library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(skimr)

1. Business Objective

Cyclistic is a Chicago-based bike-share program with two main customer segments: annual members and casual riders. The business problem is straightforward:

How do members and casual riders use Cyclistic differently, and how can those differences be used to convert more casual riders into annual members?

This report walks through the full analysis in R: data preparation, feature engineering, exploratory analysis, and practical recommendations that marketing and product teams can act on.

2. Data and Tools

2.1 Source Data

The analysis uses trip-level data from Cyclistic bike share for:

2019 trips – older schema (e.g. trip_id, start_time, usertype)
2020 trips – newer schema (e.g. ride_id, started_at, member_casual)

Both CSVs are stored in the project under:

data_raw/trips_2019.csv

data_raw/trips_2020.csv

Each record represents one bike ride, with start/end timestamps, stations, and rider type.

2.2 Tools

All analysis is performed in R using:

  • tidyverse – data wrangling and visualization

  • lubridate – date/time handling

  • janitor – clean column names

  • skimr – quick data summaries

rmarkdown – this reproducible report

3. Data Preparation

3.1 Load and standardize raw data

The 2019 and 2020 files use slightly different schemas. First we load each year, clean column names, and standardize them to a common structure.

# 2019 trips: older schema

trips_2019_raw <- read_csv(
"data_raw/trips_2019.csv",
col_types = cols(
trip_id = col_character()
)
) |>
clean_names()

names(trips_2019_raw)
##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"
# 2020 trips: newer schema

trips_2020_raw <- read_csv(
"data_raw/trips_2020.csv",
col_types = cols(
ride_id = col_character()
)
) |>
clean_names()

names(trips_2020_raw)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

3.2 Transform 2019 data to common schema

Key steps:

  • Convert start_time / end_time to POSIXct

  • Compute ride length in minutes

  • Map usertype → member_casual

  • Align station columns with 2020 naming

trips_2019 <- trips_2019_raw |>
mutate(
started_at = ymd_hms(start_time),
ended_at   = ymd_hms(end_time),

ride_length_min = as.numeric(difftime(ended_at, started_at, units = "mins")),

member_casual = case_when(
  usertype == "Subscriber" ~ "member",
  usertype == "Customer"   ~ "casual",
  TRUE ~ usertype
),

start_station_name = from_station_name,
end_station_name   = to_station_name,
start_station_id   = from_station_id,
end_station_id     = to_station_id,
year = 2019


) |>
select(
ride_id = trip_id,
started_at,
ended_at,
start_station_name,
start_station_id,
end_station_name,
end_station_id,
member_casual,
ride_length_min,
year
)

skimr::skim(trips_2019)
Data summary
Name trips_2019
Number of rows 365069
Number of columns 10
_______________________
Column type frequency:
character 4
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 8 8 0 365069 0
start_station_name 0 1 10 43 0 594 0
end_station_name 0 1 10 43 0 600 0
member_casual 0 1 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
start_station_id 0 1 198.09 153.49 2.00 76.00 170.00 287.00 665.0 ▇▅▃▁▁
end_station_id 0 1 198.58 154.47 2.00 76.00 168.00 287.00 665.0 ▇▅▃▁▁
ride_length_min 0 1 16.94 465.40 1.02 5.43 8.73 14.43 177200.4 ▇▁▁▁▁
year 0 1 2019.00 0.00 2019.00 2019.00 2019.00 2019.00 2019.0 ▁▁▇▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2019-01-01 00:04:37 2019-03-31 23:53:48 2019-02-25 07:52:56 343022
ended_at 0 1 2019-01-01 00:11:07 2019-06-17 16:04:35 2019-02-25 08:03:50 338367

3.3 Transform 2020 data to common schema

2020 data mostly matches the target schema already; we just ensure types and add year.

trips_2020 <- trips_2020_raw |>
mutate(
started_at = ymd_hms(started_at),
ended_at   = ymd_hms(ended_at),
ride_length_min = as.numeric(difftime(ended_at, started_at, units = "mins")),
year = 2020
) |>
select(
ride_id,
started_at,
ended_at,
start_station_name,
start_station_id,
end_station_name,
end_station_id,
member_casual,
ride_length_min,
year
)
skimr::skim(trips_2020)
Data summary
Name trips_2020
Number of rows 426887
Number of columns 10
_______________________
Column type frequency:
character 4
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 8 16 0 426887 0
start_station_name 0 1 5 43 0 607 0
end_station_name 1 1 5 43 0 602 0
member_casual 0 1 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
start_station_id 0 1 209.80 163.22 2.0 77.00 176.00 298.00 675.0 ▇▆▃▁▁
end_station_id 1 1 209.34 163.20 2.0 77.00 175.00 297.00 675.0 ▇▅▃▁▁
ride_length_min 0 1 22.12 619.04 -9.2 5.48 9.17 15.82 156450.4 ▇▁▁▁▁
year 0 1 2020.00 0.00 2020.0 2020.00 2020.00 2020.00 2020.0 ▁▁▇▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2020-01-01 00:04:44 2020-03-31 23:51:34 2020-02-17 05:01:27 399265
ended_at 0 1 2020-01-01 00:10:54 2020-05-19 20:10:34 2020-02-17 05:48:58 399532

3.4 Combine years and engineer features

We bind the rows, then create useful features:

  • ride_date – calendar date

  • weekday – ordered day of week

  • month – month bucket

  • hour – hour of day

day_type – weekday vs weekend

trips <- bind_rows(trips_2019, trips_2020)

trips <- trips |>
mutate(
ride_date = as.Date(started_at),
weekday   = wday(started_at, label = TRUE, week_start = 1),
month     = floor_date(started_at, "month"),
hour      = hour(started_at),
day_type  = if_else(weekday %in% c("Sat", "Sun"), "Weekend", "Weekday")
)

# Reorder weekday so plots appear in calendar order

trips <- trips |>
mutate(
weekday = factor(
as.character(weekday),
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
ordered = TRUE
)
)

3.5 Data quality filters

We remove clearly invalid or unusable records:

  • Ride length ≤ 0 minutes

  • Ride length > 24 hours

  • Missing rider type

  • Missing start station name

trips <- trips |>
filter(
!is.na(started_at),
!is.na(ended_at),
!is.na(member_casual),
!is.na(start_station_name),
ride_length_min > 0,
ride_length_min <= 1440
)

skimr::skim(trips)
Data summary
Name trips
Number of rows 791264
Number of columns 15
_______________________
Column type frequency:
character 5
Date 1
factor 1
numeric 5
POSIXct 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 8 16 0 791264 0
start_station_name 0 1 5 43 0 636 0
end_station_name 0 1 5 43 0 636 0
member_casual 0 1 6 6 0 2 0
day_type 0 1 7 7 0 2 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
ride_date 0 1 2019-01-01 2020-03-31 2020-01-07 181

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
weekday 0 1 TRUE 7 Tue: 135895, Thu: 132927, Wed: 130207, Fri: 123598

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
start_station_id 0 1 204.26 158.74 2.00 77.00 174.00 291.00 675.00 ▇▅▃▁▁
end_station_id 0 1 204.23 159.13 2.00 77.00 174.00 291.00 675.00 ▇▅▃▁▁
ride_length_min 0 1 13.66 31.05 0.02 5.47 8.95 15.15 1435.92 ▇▁▁▁▁
year 0 1 2019.54 0.50 2019.00 2019.00 2020.00 2020.00 2020.00 ▇▁▁▁▇
hour 0 1 13.28 4.65 0.00 9.00 14.00 17.00 23.00 ▁▅▃▇▂

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2019-01-01 00:04:37 2020-03-31 23:51:34 2020-01-07 12:17:10 741666
ended_at 0 1 2019-01-01 00:11:07 2020-04-01 07:38:49 2020-01-07 12:27:13 737274
month 0 1 2019-01-01 00:00:00 2020-03-01 00:00:00 2020-01-01 00:00:00 6

At this point we have a clean, consistent dataset ready for analysis.

4. Descriptive Statistics

4.1 Overall usage by rider type

summary_usage <- trips |>
group_by(member_casual) |>
summarise(
total_rides        = n(),
avg_ride_length_min = mean(ride_length_min),
median_ride_length  = median(ride_length_min),
.groups = "drop"
)

knitr::kable(summary_usage, digits = 2)
member_casual total_rides avg_ride_length_min median_ride_length
casual 71138 36.46 21.97
member 720126 11.41 8.47

Interpretation:

  • Members account for a large share of total rides.

  • Casual riders tend to have longer average ride durations, even if they ride less frequently.

4.2 Rides by weekday

usage_by_weekday <- trips |>
group_by(member_casual, weekday) |>
summarise(
rides            = n(),
avg_ride_length  = mean(ride_length_min),
.groups = "drop"
)

knitr::kable(usage_by_weekday, digits = 1)
member_casual weekday rides avg_ride_length
casual Mon 6672 28.7
casual Tue 7949 32.6
casual Wed 8328 37.8
casual Thu 7729 32.4
casual Fri 8466 35.0
casual Sat 13416 38.9
casual Sun 18578 40.9
member Mon 110412 11.1
member Tue 127946 11.3
member Wed 121879 11.2
member Thu 125198 11.1
member Fri 115132 11.1
member Sat 59381 12.4
member Sun 60178 12.9

5. Usage Patterns

5.1 Rides by weekday (visual)

p_weekday <- trips |>
count(member_casual, weekday) |>
ggplot(aes(x = weekday, y = n, fill = member_casual)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Rides by Weekday",
x = "Day of week",
y = "Number of rides",
fill = "User type"
) +
theme_minimal()

p_weekday

Key observation:

  • Members dominate weekday rides, especially Monday–Friday.

  • Casual riders show relatively higher activity on weekends, suggesting leisure-driven usage.

5.2 Average ride length by weekday

p_weekday_len <- trips |>
group_by(member_casual, weekday) |>
summarise(avg_ride_length = mean(ride_length_min), .groups = "drop") |>
ggplot(aes(x = weekday, y = avg_ride_length, fill = member_casual)) +
geom_col(position = "dodge") +
labs(
title = "Average Ride Length by Weekday",
x = "Day of week",
y = "Average ride length (minutes)",
fill = "User type"
) +
theme_minimal()

p_weekday_len

Interpretation:

-Casual riders have longer rides across almost all days, reinforcing that they are more likely using bikes for longer leisure trips.

-Member rides are shorter and more consistent across weekdays, matching commute or routine trips.

5.3 Rides by hour of day

p_hour <- trips |>
count(member_casual, hour) |>
ggplot(aes(x = hour, y = n, color = member_casual)) +
geom_line(linewidth = 1) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Rides by Hour of Day",
x = "Hour of day",
y = "Number of rides",
color = "User type"
) +
theme_minimal()

p_hour

Interpretation:

  • Members show two clear peaks: morning and evening commute hours.

  • Casual usage peaks later in the day, especially afternoons, and is less commute-shaped.

5.4 Top start stations by rider type

top_start_stations <- trips |>
group_by(member_casual, start_station_name) |>
summarise(rides = n(), .groups = "drop") |>
group_by(member_casual) |>
slice_max(order_by = rides, n = 10) |>
ungroup()

knitr::kable(top_start_stations, digits = 0)
member_casual start_station_name rides
casual HQ QR 3556
casual Streeter Dr & Grand Ave 2741
casual Lake Shore Dr & Monroe St 2727
casual Shedd Aquarium 1831
casual Millennium Park 1403
casual Michigan Ave & Oak St 1017
casual Michigan Ave & Washington St 835
casual Dusable Harbor 829
casual Adler Planetarium 826
casual Theater on the Lake 794
member Canal St & Adams St 13797
member Clinton St & Washington Blvd 13434
member Clinton St & Madison St 12889
member Kingsbury St & Kinzie St 8718
member Columbus Dr & Randolph St 8515
member Canal St & Madison St 7956
member Franklin St & Monroe St 7008
member Michigan Ave & Washington St 6684
member Larrabee St & Kingsbury St 6466
member Clinton St & Lake St 6437

This table helps identify where to place membership-focused marketing (e.g., signage, QR codes, local partnerships).

6. Key Insights

From the analysis above, several patterns stand out:

1.Members ride more often; casual riders ride longer.

Members generate the majority of total trips, but casual riders have higher average ride durations.

2.Members are weekday commuters; casuals are weekend/leisure riders.

Member rides spike during weekday commute hours; casual rides are concentrated on weekends and afternoons.

4.Casual riders show behavior consistent with tourists or occasional city users.

Longer, less frequent rides on weekends suggest “experience” usage rather than daily utility.

5.There is a clear conversion opportunity around weekend casuals who ride frequently.

Frequent weekend riders with long trips are likely to save money with memberships and may be receptive to targeted offers.

7. Recommendations

Based on these insights, here are concrete actions Cyclistic could take:

1.Weekend-to-membership offers

After a casual rider completes 2–3 weekend rides in a month, send an in-app or email offer highlighting how much they could save with a membership.

2.Commute-focused messaging for members

Advertise the reliability and cost savings of annual membership during peak weekday station usage near transit hubs and office areas.

3.On-station prompts at high casual stations

At top casual start stations, add signage or QR codes offering a “Try one month membership” discount right after a long ride.

4.Bundle passes for long-duration riders

For riders with long but infrequent rides, experiment with bundles (e.g., “5 long rides per month”) that lead naturally into annual plans.

5.A/B test pricing and messaging

Use these segments (weekend casuals, frequent casuals at popular stations) as test groups for different membership offers to measure uplift scientifically.

8. Limitations and Next Steps

This analysis has a few important limitations:

Next steps that would deepen this work: