library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(skimr)
Cyclistic is a Chicago-based bike-share program with two main customer segments: annual members and casual riders. The business problem is straightforward:
How do members and casual riders use Cyclistic differently, and how can those differences be used to convert more casual riders into annual members?
This report walks through the full analysis in R: data preparation, feature engineering, exploratory analysis, and practical recommendations that marketing and product teams can act on.
The analysis uses trip-level data from Cyclistic bike share for:
Both CSVs are stored in the project under:
data_raw/trips_2019.csv
data_raw/trips_2020.csv
Each record represents one bike ride, with start/end timestamps, stations, and rider type.
All analysis is performed in R using:
tidyverse – data wrangling and visualization
lubridate – date/time handling
janitor – clean column names
skimr – quick data summaries
rmarkdown – this reproducible report
The 2019 and 2020 files use slightly different schemas. First we load each year, clean column names, and standardize them to a common structure.
# 2019 trips: older schema
trips_2019_raw <- read_csv(
"data_raw/trips_2019.csv",
col_types = cols(
trip_id = col_character()
)
) |>
clean_names()
names(trips_2019_raw)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
# 2020 trips: newer schema
trips_2020_raw <- read_csv(
"data_raw/trips_2020.csv",
col_types = cols(
ride_id = col_character()
)
) |>
clean_names()
names(trips_2020_raw)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
Key steps:
Convert start_time / end_time to POSIXct
Compute ride length in minutes
Map usertype → member_casual
Align station columns with 2020 naming
trips_2019 <- trips_2019_raw |>
mutate(
started_at = ymd_hms(start_time),
ended_at = ymd_hms(end_time),
ride_length_min = as.numeric(difftime(ended_at, started_at, units = "mins")),
member_casual = case_when(
usertype == "Subscriber" ~ "member",
usertype == "Customer" ~ "casual",
TRUE ~ usertype
),
start_station_name = from_station_name,
end_station_name = to_station_name,
start_station_id = from_station_id,
end_station_id = to_station_id,
year = 2019
) |>
select(
ride_id = trip_id,
started_at,
ended_at,
start_station_name,
start_station_id,
end_station_name,
end_station_id,
member_casual,
ride_length_min,
year
)
skimr::skim(trips_2019)
| Name | trips_2019 |
| Number of rows | 365069 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 4 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 8 | 8 | 0 | 365069 | 0 |
| start_station_name | 0 | 1 | 10 | 43 | 0 | 594 | 0 |
| end_station_name | 0 | 1 | 10 | 43 | 0 | 600 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| start_station_id | 0 | 1 | 198.09 | 153.49 | 2.00 | 76.00 | 170.00 | 287.00 | 665.0 | ▇▅▃▁▁ |
| end_station_id | 0 | 1 | 198.58 | 154.47 | 2.00 | 76.00 | 168.00 | 287.00 | 665.0 | ▇▅▃▁▁ |
| ride_length_min | 0 | 1 | 16.94 | 465.40 | 1.02 | 5.43 | 8.73 | 14.43 | 177200.4 | ▇▁▁▁▁ |
| year | 0 | 1 | 2019.00 | 0.00 | 2019.00 | 2019.00 | 2019.00 | 2019.00 | 2019.0 | ▁▁▇▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2019-01-01 00:04:37 | 2019-03-31 23:53:48 | 2019-02-25 07:52:56 | 343022 |
| ended_at | 0 | 1 | 2019-01-01 00:11:07 | 2019-06-17 16:04:35 | 2019-02-25 08:03:50 | 338367 |
2020 data mostly matches the target schema already; we just ensure types and add year.
trips_2020 <- trips_2020_raw |>
mutate(
started_at = ymd_hms(started_at),
ended_at = ymd_hms(ended_at),
ride_length_min = as.numeric(difftime(ended_at, started_at, units = "mins")),
year = 2020
) |>
select(
ride_id,
started_at,
ended_at,
start_station_name,
start_station_id,
end_station_name,
end_station_id,
member_casual,
ride_length_min,
year
)
skimr::skim(trips_2020)
| Name | trips_2020 |
| Number of rows | 426887 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 4 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 8 | 16 | 0 | 426887 | 0 |
| start_station_name | 0 | 1 | 5 | 43 | 0 | 607 | 0 |
| end_station_name | 1 | 1 | 5 | 43 | 0 | 602 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| start_station_id | 0 | 1 | 209.80 | 163.22 | 2.0 | 77.00 | 176.00 | 298.00 | 675.0 | ▇▆▃▁▁ |
| end_station_id | 1 | 1 | 209.34 | 163.20 | 2.0 | 77.00 | 175.00 | 297.00 | 675.0 | ▇▅▃▁▁ |
| ride_length_min | 0 | 1 | 22.12 | 619.04 | -9.2 | 5.48 | 9.17 | 15.82 | 156450.4 | ▇▁▁▁▁ |
| year | 0 | 1 | 2020.00 | 0.00 | 2020.0 | 2020.00 | 2020.00 | 2020.00 | 2020.0 | ▁▁▇▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2020-01-01 00:04:44 | 2020-03-31 23:51:34 | 2020-02-17 05:01:27 | 399265 |
| ended_at | 0 | 1 | 2020-01-01 00:10:54 | 2020-05-19 20:10:34 | 2020-02-17 05:48:58 | 399532 |
We bind the rows, then create useful features:
ride_date – calendar date
weekday – ordered day of week
month – month bucket
hour – hour of day
day_type – weekday vs weekend
trips <- bind_rows(trips_2019, trips_2020)
trips <- trips |>
mutate(
ride_date = as.Date(started_at),
weekday = wday(started_at, label = TRUE, week_start = 1),
month = floor_date(started_at, "month"),
hour = hour(started_at),
day_type = if_else(weekday %in% c("Sat", "Sun"), "Weekend", "Weekday")
)
# Reorder weekday so plots appear in calendar order
trips <- trips |>
mutate(
weekday = factor(
as.character(weekday),
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
ordered = TRUE
)
)
We remove clearly invalid or unusable records:
Ride length ≤ 0 minutes
Ride length > 24 hours
Missing rider type
Missing start station name
trips <- trips |>
filter(
!is.na(started_at),
!is.na(ended_at),
!is.na(member_casual),
!is.na(start_station_name),
ride_length_min > 0,
ride_length_min <= 1440
)
skimr::skim(trips)
| Name | trips |
| Number of rows | 791264 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| Date | 1 |
| factor | 1 |
| numeric | 5 |
| POSIXct | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 8 | 16 | 0 | 791264 | 0 |
| start_station_name | 0 | 1 | 5 | 43 | 0 | 636 | 0 |
| end_station_name | 0 | 1 | 5 | 43 | 0 | 636 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
| day_type | 0 | 1 | 7 | 7 | 0 | 2 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| ride_date | 0 | 1 | 2019-01-01 | 2020-03-31 | 2020-01-07 | 181 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| weekday | 0 | 1 | TRUE | 7 | Tue: 135895, Thu: 132927, Wed: 130207, Fri: 123598 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| start_station_id | 0 | 1 | 204.26 | 158.74 | 2.00 | 77.00 | 174.00 | 291.00 | 675.00 | ▇▅▃▁▁ |
| end_station_id | 0 | 1 | 204.23 | 159.13 | 2.00 | 77.00 | 174.00 | 291.00 | 675.00 | ▇▅▃▁▁ |
| ride_length_min | 0 | 1 | 13.66 | 31.05 | 0.02 | 5.47 | 8.95 | 15.15 | 1435.92 | ▇▁▁▁▁ |
| year | 0 | 1 | 2019.54 | 0.50 | 2019.00 | 2019.00 | 2020.00 | 2020.00 | 2020.00 | ▇▁▁▁▇ |
| hour | 0 | 1 | 13.28 | 4.65 | 0.00 | 9.00 | 14.00 | 17.00 | 23.00 | ▁▅▃▇▂ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2019-01-01 00:04:37 | 2020-03-31 23:51:34 | 2020-01-07 12:17:10 | 741666 |
| ended_at | 0 | 1 | 2019-01-01 00:11:07 | 2020-04-01 07:38:49 | 2020-01-07 12:27:13 | 737274 |
| month | 0 | 1 | 2019-01-01 00:00:00 | 2020-03-01 00:00:00 | 2020-01-01 00:00:00 | 6 |
At this point we have a clean, consistent dataset ready for analysis.
summary_usage <- trips |>
group_by(member_casual) |>
summarise(
total_rides = n(),
avg_ride_length_min = mean(ride_length_min),
median_ride_length = median(ride_length_min),
.groups = "drop"
)
knitr::kable(summary_usage, digits = 2)
| member_casual | total_rides | avg_ride_length_min | median_ride_length |
|---|---|---|---|
| casual | 71138 | 36.46 | 21.97 |
| member | 720126 | 11.41 | 8.47 |
Interpretation:
Members account for a large share of total rides.
Casual riders tend to have longer average ride durations, even if they ride less frequently.
usage_by_weekday <- trips |>
group_by(member_casual, weekday) |>
summarise(
rides = n(),
avg_ride_length = mean(ride_length_min),
.groups = "drop"
)
knitr::kable(usage_by_weekday, digits = 1)
| member_casual | weekday | rides | avg_ride_length |
|---|---|---|---|
| casual | Mon | 6672 | 28.7 |
| casual | Tue | 7949 | 32.6 |
| casual | Wed | 8328 | 37.8 |
| casual | Thu | 7729 | 32.4 |
| casual | Fri | 8466 | 35.0 |
| casual | Sat | 13416 | 38.9 |
| casual | Sun | 18578 | 40.9 |
| member | Mon | 110412 | 11.1 |
| member | Tue | 127946 | 11.3 |
| member | Wed | 121879 | 11.2 |
| member | Thu | 125198 | 11.1 |
| member | Fri | 115132 | 11.1 |
| member | Sat | 59381 | 12.4 |
| member | Sun | 60178 | 12.9 |
p_weekday <- trips |>
count(member_casual, weekday) |>
ggplot(aes(x = weekday, y = n, fill = member_casual)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Rides by Weekday",
x = "Day of week",
y = "Number of rides",
fill = "User type"
) +
theme_minimal()
p_weekday
Key observation:
Members dominate weekday rides, especially Monday–Friday.
Casual riders show relatively higher activity on weekends, suggesting leisure-driven usage.
p_weekday_len <- trips |>
group_by(member_casual, weekday) |>
summarise(avg_ride_length = mean(ride_length_min), .groups = "drop") |>
ggplot(aes(x = weekday, y = avg_ride_length, fill = member_casual)) +
geom_col(position = "dodge") +
labs(
title = "Average Ride Length by Weekday",
x = "Day of week",
y = "Average ride length (minutes)",
fill = "User type"
) +
theme_minimal()
p_weekday_len
Interpretation:
-Casual riders have longer rides across almost all days, reinforcing that they are more likely using bikes for longer leisure trips.
-Member rides are shorter and more consistent across weekdays, matching commute or routine trips.
p_hour <- trips |>
count(member_casual, hour) |>
ggplot(aes(x = hour, y = n, color = member_casual)) +
geom_line(linewidth = 1) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Rides by Hour of Day",
x = "Hour of day",
y = "Number of rides",
color = "User type"
) +
theme_minimal()
p_hour
Interpretation:
Members show two clear peaks: morning and evening commute hours.
Casual usage peaks later in the day, especially afternoons, and is less commute-shaped.
top_start_stations <- trips |>
group_by(member_casual, start_station_name) |>
summarise(rides = n(), .groups = "drop") |>
group_by(member_casual) |>
slice_max(order_by = rides, n = 10) |>
ungroup()
knitr::kable(top_start_stations, digits = 0)
| member_casual | start_station_name | rides |
|---|---|---|
| casual | HQ QR | 3556 |
| casual | Streeter Dr & Grand Ave | 2741 |
| casual | Lake Shore Dr & Monroe St | 2727 |
| casual | Shedd Aquarium | 1831 |
| casual | Millennium Park | 1403 |
| casual | Michigan Ave & Oak St | 1017 |
| casual | Michigan Ave & Washington St | 835 |
| casual | Dusable Harbor | 829 |
| casual | Adler Planetarium | 826 |
| casual | Theater on the Lake | 794 |
| member | Canal St & Adams St | 13797 |
| member | Clinton St & Washington Blvd | 13434 |
| member | Clinton St & Madison St | 12889 |
| member | Kingsbury St & Kinzie St | 8718 |
| member | Columbus Dr & Randolph St | 8515 |
| member | Canal St & Madison St | 7956 |
| member | Franklin St & Monroe St | 7008 |
| member | Michigan Ave & Washington St | 6684 |
| member | Larrabee St & Kingsbury St | 6466 |
| member | Clinton St & Lake St | 6437 |
This table helps identify where to place membership-focused marketing (e.g., signage, QR codes, local partnerships).
From the analysis above, several patterns stand out:
Members generate the majority of total trips, but casual riders have higher average ride durations.
Member rides spike during weekday commute hours; casual rides are concentrated on weekends and afternoons.
Both segments have a set of high-traffic start stations where targeted campaigns would have outsized impact.
Longer, less frequent rides on weekends suggest “experience” usage rather than daily utility.
Frequent weekend riders with long trips are likely to save money with memberships and may be receptive to targeted offers.
Based on these insights, here are concrete actions Cyclistic could take:
After a casual rider completes 2–3 weekend rides in a month, send an in-app or email offer highlighting how much they could save with a membership.
Advertise the reliability and cost savings of annual membership during peak weekday station usage near transit hubs and office areas.
At top casual start stations, add signage or QR codes offering a “Try one month membership” discount right after a long ride.
For riders with long but infrequent rides, experiment with bundles (e.g., “5 long rides per month”) that lead naturally into annual plans.
Use these segments (weekend casuals, frequent casuals at popular stations) as test groups for different membership offers to measure uplift scientifically.
This analysis has a few important limitations:
Only ride-level behavioral data is used; there is no demographic information.
Weather, special events, and pricing changes are not explicitly modeled here.
The analysis is descriptive, not predictive; there is no formal churn or conversion model yet.
Next steps that would deepen this work:
Build a propensity model to estimate which casual riders are most likely to convert.
Incorporate weather and event data to separate normal patterns from anomalies.
Run controlled experiments on membership offers and track impact on conversion.