library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(skimr)

1. Business Objective

Cyclistic is a Chicago-based bike-share program with two main customer segments: annual members and casual riders. The business problem is straightforward:

How do members and casual riders use Cyclistic differently, and how can those differences be used to convert more casual riders into annual members?

This report walks through the full analysis in R: data preparation, feature engineering, exploratory analysis, and practical recommendations that marketing and product teams can act on.

2. Data and Tools

2.1 Source Data

The analysis uses trip-level data from Cyclistic bike share for:

2019 trips – older schema (e.g. trip_id, start_time, usertype)

2020 trips – newer schema (e.g. ride_id, started_at, member_casual)

Both CSVs are stored in the project under:

data_raw/trips_2019.csv

data_raw/trips_2020.csv

Each record represents one bike ride, with start/end timestamps, stations, and rider type.

2.2 Tools

All analysis is performed in R using:

tidyverse – data wrangling and visualization
lubridate – date/time handling
janitor – clean column names
skimr – quick data summaries

rmarkdown – this reproducible report

3. Data Preparation

3.1 Load and standardize raw data

The 2019 and 2020 files use slightly different schemas. First we load each year, clean column names, and standardize them to a common structure.

# 2019 trips: older schema

trips_2019_raw <- read_csv(
"data_raw/trips_2019.csv",
col_types = cols(
trip_id = col_character()
)
) |>
clean_names()

names(trips_2019_raw)

##  [1] "trip_id"           "start_time"        "end_time"         
##  [4] "bikeid"            "tripduration"      "from_station_id"  
##  [7] "from_station_name" "to_station_id"     "to_station_name"  
## [10] "usertype"          "gender"            "birthyear"

# 2020 trips: newer schema

trips_2020_raw <- read_csv(
"data_raw/trips_2020.csv",
col_types = cols(
ride_id = col_character()
)
) |>
clean_names()

names(trips_2020_raw)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

3.2 Transform 2019 data to common schema

Key steps:

Convert start_time / end_time to POSIXct
Compute ride length in minutes
Map usertype → member_casual
Align station columns with 2020 naming

trips_2019 <- trips_2019_raw |>
mutate(
started_at = ymd_hms(start_time),
ended_at   = ymd_hms(end_time),

ride_length_min = as.numeric(difftime(ended_at, started_at, units = "mins")),

member_casual = case_when(
  usertype == "Subscriber" ~ "member",
  usertype == "Customer"   ~ "casual",
  TRUE ~ usertype
),

start_station_name = from_station_name,
end_station_name   = to_station_name,
start_station_id   = from_station_id,
end_station_id     = to_station_id,
year = 2019


) |>
select(
ride_id = trip_id,
started_at,
ended_at,
start_station_name,
start_station_id,
end_station_name,
end_station_id,
member_casual,
ride_length_min,
year
)

skimr::skim(trips_2019)

Data summary
Name	trips_2019
Number of rows	365069
Number of columns	10
_______________________
Column type frequency:
character	4
numeric	4
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
ride_id	1	8	8	365069
start_station_name	1	10	43	594
end_station_name	1	10	43	600
member_casual	1	6	6	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
start_station_id	1	198.09	153.49	2.00	76.00	170.00	287.00	665.0	▇▅▃▁▁
end_station_id	1	198.58	154.47	2.00	76.00	168.00	287.00	665.0	▇▅▃▁▁
ride_length_min	1	16.94	465.40	1.02	5.43	8.73	14.43	177200.4	▇▁▁▁▁
year	1	2019.00	0.00	2019.00	2019.00	2019.00	2019.00	2019.0	▁▁▇▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2019-01-01 00:04:37	2019-03-31 23:53:48	2019-02-25 07:52:56	343022
ended_at	0	1	2019-01-01 00:11:07	2019-06-17 16:04:35	2019-02-25 08:03:50	338367

3.3 Transform 2020 data to common schema

2020 data mostly matches the target schema already; we just ensure types and add year.

trips_2020 <- trips_2020_raw |>
mutate(
started_at = ymd_hms(started_at),
ended_at   = ymd_hms(ended_at),
ride_length_min = as.numeric(difftime(ended_at, started_at, units = "mins")),
year = 2020
) |>
select(
ride_id,
started_at,
ended_at,
start_station_name,
start_station_id,
end_station_name,
end_station_id,
member_casual,
ride_length_min,
year
)
skimr::skim(trips_2020)

Data summary
Name	trips_2020
Number of rows	426887
Number of columns	10
_______________________
Column type frequency:
character	4
numeric	4
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1	8	16	426887
start_station_name	0	1	5	43	607
end_station_name	1	1	5	43	602
member_casual	0	1	6	6	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
start_station_id	0	1	209.80	163.22	2.0	77.00	176.00	298.00	675.0	▇▆▃▁▁
end_station_id	1	1	209.34	163.20	2.0	77.00	175.00	297.00	675.0	▇▅▃▁▁
ride_length_min	0	1	22.12	619.04	-9.2	5.48	9.17	15.82	156450.4	▇▁▁▁▁
year	0	1	2020.00	0.00	2020.0	2020.00	2020.00	2020.00	2020.0	▁▁▇▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2020-01-01 00:04:44	2020-03-31 23:51:34	2020-02-17 05:01:27	399265
ended_at	0	1	2020-01-01 00:10:54	2020-05-19 20:10:34	2020-02-17 05:48:58	399532

3.4 Combine years and engineer features

We bind the rows, then create useful features:

ride_date – calendar date
weekday – ordered day of week
month – month bucket
hour – hour of day

day_type – weekday vs weekend

trips <- bind_rows(trips_2019, trips_2020)

trips <- trips |>
mutate(
ride_date = as.Date(started_at),
weekday   = wday(started_at, label = TRUE, week_start = 1),
month     = floor_date(started_at, "month"),
hour      = hour(started_at),
day_type  = if_else(weekday %in% c("Sat", "Sun"), "Weekend", "Weekday")
)

# Reorder weekday so plots appear in calendar order

trips <- trips |>
mutate(
weekday = factor(
as.character(weekday),
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
ordered = TRUE
)
)

3.5 Data quality filters

We remove clearly invalid or unusable records:

Ride length ≤ 0 minutes
Ride length > 24 hours
Missing rider type
Missing start station name

trips <- trips |>
filter(
!is.na(started_at),
!is.na(ended_at),
!is.na(member_casual),
!is.na(start_station_name),
ride_length_min > 0,
ride_length_min <= 1440
)

skimr::skim(trips)

Data summary
Name	trips
Number of rows	791264
Number of columns	15
_______________________
Column type frequency:
character	5
Date	1
factor	1
numeric	5
POSIXct	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
ride_id	1	8	16	791264
start_station_name	1	5	43	636
end_station_name	1	5	43	636
member_casual	1	6	6	2
day_type	1	7	7	2

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
ride_date	0	1	2019-01-01	2020-03-31	2020-01-07	181

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
weekday	0	1	TRUE	7	Tue: 135895, Thu: 132927, Wed: 130207, Fri: 123598

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
start_station_id	1	204.26	158.74	2.00	77.00	174.00	291.00	675.00	▇▅▃▁▁
end_station_id	1	204.23	159.13	2.00	77.00	174.00	291.00	675.00	▇▅▃▁▁
ride_length_min	1	13.66	31.05	0.02	5.47	8.95	15.15	1435.92	▇▁▁▁▁
year	1	2019.54	0.50	2019.00	2019.00	2020.00	2020.00	2020.00	▇▁▁▁▇
hour	1	13.28	4.65	0.00	9.00	14.00	17.00	23.00	▁▅▃▇▂

Variable type: POSIXct

skim_variable	complete_rate	min	max	median	n_unique
started_at	1	2019-01-01 00:04:37	2020-03-31 23:51:34	2020-01-07 12:17:10	741666
ended_at	1	2019-01-01 00:11:07	2020-04-01 07:38:49	2020-01-07 12:27:13	737274
month	1	2019-01-01 00:00:00	2020-03-01 00:00:00	2020-01-01 00:00:00	6

At this point we have a clean, consistent dataset ready for analysis.

4. Descriptive Statistics

4.1 Overall usage by rider type

summary_usage <- trips |>
group_by(member_casual) |>
summarise(
total_rides        = n(),
avg_ride_length_min = mean(ride_length_min),
median_ride_length  = median(ride_length_min),
.groups = "drop"
)

knitr::kable(summary_usage, digits = 2)

member_casual	total_rides	avg_ride_length_min	median_ride_length
casual	71138	36.46	21.97
member	720126	11.41	8.47

Interpretation:

Members account for a large share of total rides.
Casual riders tend to have longer average ride durations, even if they ride less frequently.

4.2 Rides by weekday

usage_by_weekday <- trips |>
group_by(member_casual, weekday) |>
summarise(
rides            = n(),
avg_ride_length  = mean(ride_length_min),
.groups = "drop"
)

knitr::kable(usage_by_weekday, digits = 1)

member_casual	weekday	rides	avg_ride_length
casual	Mon	6672	28.7
casual	Tue	7949	32.6
casual	Wed	8328	37.8
casual	Thu	7729	32.4
casual	Fri	8466	35.0
casual	Sat	13416	38.9
casual	Sun	18578	40.9
member	Mon	110412	11.1
member	Tue	127946	11.3
member	Wed	121879	11.2
member	Thu	125198	11.1
member	Fri	115132	11.1
member	Sat	59381	12.4
member	Sun	60178	12.9

5. Usage Patterns

5.1 Rides by weekday (visual)

p_weekday <- trips |>
count(member_casual, weekday) |>
ggplot(aes(x = weekday, y = n, fill = member_casual)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Rides by Weekday",
x = "Day of week",
y = "Number of rides",
fill = "User type"
) +
theme_minimal()

p_weekday

Key observation:

Members dominate weekday rides, especially Monday–Friday.
Casual riders show relatively higher activity on weekends, suggesting leisure-driven usage.

5.2 Average ride length by weekday

p_weekday_len <- trips |>
group_by(member_casual, weekday) |>
summarise(avg_ride_length = mean(ride_length_min), .groups = "drop") |>
ggplot(aes(x = weekday, y = avg_ride_length, fill = member_casual)) +
geom_col(position = "dodge") +
labs(
title = "Average Ride Length by Weekday",
x = "Day of week",
y = "Average ride length (minutes)",
fill = "User type"
) +
theme_minimal()

p_weekday_len

Interpretation:

-Casual riders have longer rides across almost all days, reinforcing that they are more likely using bikes for longer leisure trips.

-Member rides are shorter and more consistent across weekdays, matching commute or routine trips.

5.3 Rides by hour of day

p_hour <- trips |>
count(member_casual, hour) |>
ggplot(aes(x = hour, y = n, color = member_casual)) +
geom_line(linewidth = 1) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Rides by Hour of Day",
x = "Hour of day",
y = "Number of rides",
color = "User type"
) +
theme_minimal()

p_hour

Interpretation:

Members show two clear peaks: morning and evening commute hours.
Casual usage peaks later in the day, especially afternoons, and is less commute-shaped.

5.4 Top start stations by rider type

top_start_stations <- trips |>
group_by(member_casual, start_station_name) |>
summarise(rides = n(), .groups = "drop") |>
group_by(member_casual) |>
slice_max(order_by = rides, n = 10) |>
ungroup()

knitr::kable(top_start_stations, digits = 0)

member_casual	start_station_name	rides
casual	HQ QR	3556
casual	Streeter Dr & Grand Ave	2741
casual	Lake Shore Dr & Monroe St	2727
casual	Shedd Aquarium	1831
casual	Millennium Park	1403
casual	Michigan Ave & Oak St	1017
casual	Michigan Ave & Washington St	835
casual	Dusable Harbor	829
casual	Adler Planetarium	826
casual	Theater on the Lake	794
member	Canal St & Adams St	13797
member	Clinton St & Washington Blvd	13434
member	Clinton St & Madison St	12889
member	Kingsbury St & Kinzie St	8718
member	Columbus Dr & Randolph St	8515
member	Canal St & Madison St	7956
member	Franklin St & Monroe St	7008
member	Michigan Ave & Washington St	6684
member	Larrabee St & Kingsbury St	6466
member	Clinton St & Lake St	6437

This table helps identify where to place membership-focused marketing (e.g., signage, QR codes, local partnerships).

6. Key Insights

From the analysis above, several patterns stand out:

1.Members ride more often; casual riders ride longer.

Members generate the majority of total trips, but casual riders have higher average ride durations.

2.Members are weekday commuters; casuals are weekend/leisure riders.

Member rides spike during weekday commute hours; casual rides are concentrated on weekends and afternoons.

3.Usage is geographically concentrated at popular stations.

Both segments have a set of high-traffic start stations where targeted campaigns would have outsized impact.

4.Casual riders show behavior consistent with tourists or occasional city users.

Longer, less frequent rides on weekends suggest “experience” usage rather than daily utility.

5.There is a clear conversion opportunity around weekend casuals who ride frequently.

Frequent weekend riders with long trips are likely to save money with memberships and may be receptive to targeted offers.

7. Recommendations

Based on these insights, here are concrete actions Cyclistic could take:

1.Weekend-to-membership offers

After a casual rider completes 2–3 weekend rides in a month, send an in-app or email offer highlighting how much they could save with a membership.

2.Commute-focused messaging for members

Advertise the reliability and cost savings of annual membership during peak weekday station usage near transit hubs and office areas.

3.On-station prompts at high casual stations

At top casual start stations, add signage or QR codes offering a “Try one month membership” discount right after a long ride.

4.Bundle passes for long-duration riders

For riders with long but infrequent rides, experiment with bundles (e.g., “5 long rides per month”) that lead naturally into annual plans.

5.A/B test pricing and messaging

Use these segments (weekend casuals, frequent casuals at popular stations) as test groups for different membership offers to measure uplift scientifically.

8. Limitations and Next Steps

This analysis has a few important limitations:

Only ride-level behavioral data is used; there is no demographic information.
Weather, special events, and pricing changes are not explicitly modeled here.
The analysis is descriptive, not predictive; there is no formal churn or conversion model yet.

Next steps that would deepen this work:

Build a propensity model to estimate which casual riders are most likely to convert.
Incorporate weather and event data to separate normal patterns from anomalies.
Run controlled experiments on membership offers and track impact on conversion.

Cyclistic Bike Share Case Study

Sreeja Edulapalle

2025-12-05

1. Business Objective

2. Data and Tools

2.1 Source Data

2019 trips – older schema (e.g. trip_id, start_time, usertype)

2020 trips – newer schema (e.g. ride_id, started_at, member_casual)

2.2 Tools

3. Data Preparation

3.1 Load and standardize raw data

3.2 Transform 2019 data to common schema

3.3 Transform 2020 data to common schema

3.4 Combine years and engineer features

3.5 Data quality filters

4. Descriptive Statistics

4.1 Overall usage by rider type

4.2 Rides by weekday

5. Usage Patterns

5.1 Rides by weekday (visual)

5.2 Average ride length by weekday

5.3 Rides by hour of day

5.4 Top start stations by rider type

6. Key Insights

1.Members ride more often; casual riders ride longer.

2.Members are weekday commuters; casuals are weekend/leisure riders.

3.Usage is geographically concentrated at popular stations.

4.Casual riders show behavior consistent with tourists or occasional city users.

5.There is a clear conversion opportunity around weekend casuals who ride frequently.

7. Recommendations

1.Weekend-to-membership offers

2.Commute-focused messaging for members

3.On-station prompts at high casual stations

4.Bundle passes for long-duration riders

5.A/B test pricing and messaging

8. Limitations and Next Steps