Google Data Analytics Course - Capstone Project using Python (January 2025)

Case study: How does a bike-share navigate speedy success? 🚴🚲👩‍💻📈📊

This project contains a solution using Python for the Capstone of Google Data Analytics Professional Certificate Program from Google

This project was inspired by a previous implementation I developed using the R programming language. Although this version using Python is a distinct solution, some phrases and ideas in the presentation of this solution may be similar to the original R project. This is intentional, as the challenge and problem-solving approach remains consistent across both implementations.

Google Data Analytics Course 📈📊

Professional training designed by Google 🟢🔵🔴🟡
Prepare for a career in Data Analytics
8 Course series, including Google Data Analytics Capstone: Complete a Case Study

This project is my proposed solution using Python on the Cyclistic bike-share analysis case study!

Introduction

In this case study, I will take on the role of a junior data analyst performing real-world tasks for a fictional company, Cyclistic, working in the marketing analytics team. To address the business questions, I will follow the steps of the data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.

Background and Business Question

Cyclistic

Cyclistic, a bike-share company in Chicago, launched in 2016 a successful bike-share offering. Since then, has grown to 5,824 geotracked bicycles and locked into a network of 692 stations across Chicago. Cyclistic bikes can be accessed from one station and returned to any other station in the network, anytime. This bike-share program includes traditional bikes and more inclusive options like reclining bikes, hand tricycles, and cargo bikes useful to people with disabilities and riders who can’t use a standard two-wheeled bike. While most users ride for leisure, 30% use the bikes for commuting to work each day.

Cyclistic offers flexible pricing plans: single-ride passes, full-day passes (for casual riders), and annual memberships (for members). Cyclistic’s finance analysts have found that annual members are more profitable, and the director of marketing sees an opportunity to grow by converting casual riders into members, as they are already familiar with the Cyclistic program. Rather than targeting all-new customers, she believes the focus should be on converting casual riders into members, as they are already familiar with Cyclistic and rely on it for their mobility needs.

Characters and teams

Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
Cyclistic executive team: The notoriously detail-oriented executive team.
Lily Moreno: The director of marketing and the manager of my team. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels. She believes the company’s future success depends on maximizing the number of annual memberships.
Carla Cotas: A junior data analyst working in the marketing analytics team. My team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Scenario

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members.

In order to do that, however, my team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Therefore, Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Ask

Analysis Questions

Three questions will guide the future marketing program:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

The first question to answer was assigned to me: How do annual members and casual riders use Cyclistic bikes differently?

Prepare

Folder Structure

The local directory, folders, sub-folders and file-naming conventions follow the structure below:

|--20250109_CyclisticBikeShare
  |-- 1.OriginalData
  |-- 2.PreparedData
  |-- 3.UploadedData
    |-- YYYYMMDD
  |-- 4.Analysis
    |-- DataErrors
      |-- AutomaticallyExcludedResults
      |-- ManuallyExcludedResults
    |-- PythonCode
    |-- Figures
  |-- 5.Insights
    |-- MyInsights
    |-- SharedInsights
  |-- 6.Final
    |-- Report
    |-- Presentation

Data Source

Cyclistic’s historical trip data to analyze and identify trends is available here.

This study analyzes and identifies trends from the previous 12 months: January 2024 to December 2024 (‘202401-divvy-tripdata.csv’ → ‘202412-divvy-tripdata.csv’).

The data has been made available by Motivate International Inc. under this license. This is public data that can use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit from using riders’ personally identifiable information. This means that you won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.

Data Preparation

⚠️ Note: Im my previous implementation I used Posit’s RStudio and I could only complete this project in R focusing the first previous 6 months: November 2023 to April 2024 (‘202311-divvy-tripdata.csv’ → ‘202404-divvy-tripdata.csv’) because Posit’s RStudio not crashed due to Memory Usage. However, when using Python for this project, I don't have this Memory Usage problem and I was able to do a more complete analysis, completing further anaysis points identified in the previous study in R.

The Cyclistic’s historical trip data is available to download in zip files (‘202401-divvy-tripdata.zip’ to ‘202412-divvy-tripdata.zip’). After downloaded and stored in a folder (1.OriginalData), the zip files were unzipped locally giving the original .CSV data files (‘202401-divvy-tripdata.csv’ to ‘202412-divvy-tripdata.csv’). This study uses Python 3.11 for the analysis because it is a widely used programming language in data analytics, providing extensive data libraries.

The original data files, naming convention of YYYYMM-divvy-tripdata, were imported into Python.

CyclisticTripData_2024_01 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202401-divvy-tripdata.csv')
CyclisticTripData_2024_02 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202402-divvy-tripdata.csv')
CyclisticTripData_2024_03 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202403-divvy-tripdata.csv')
CyclisticTripData_2024_04 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202404-divvy-tripdata.csv')
CyclisticTripData_2024_05 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202405-divvy-tripdata.csv')
CyclisticTripData_2024_06 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202406-divvy-tripdata.csv')
CyclisticTripData_2024_07 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202407-divvy-tripdata.csv')
CyclisticTripData_2024_08 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202408-divvy-tripdata.csv')
CyclisticTripData_2024_09 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202409-divvy-tripdata.csv')
CyclisticTripData_2024_10 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202410-divvy-tripdata.csv')
CyclisticTripData_2024_11 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202411-divvy-tripdata.csv')
CyclisticTripData_2024_12 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202412-divvy-tripdata.csv')

print('Importing original data files completed!')

A check summary data of each dataframe about structure and information was performed for exploring consistency in the column numbers to ensure the same number of columns and the same column names before moving to the next step and merge them in only one data set.

CyclisticTripData_2024_01.info()

CyclisticTripData_2024_02.info()

CyclisticTripData_2024_03.info()

CyclisticTripData_2024_04.info()

CyclisticTripData_2024_05.info()

CyclisticTripData_2024_06.info()

CyclisticTripData_2024_07.info()

CyclisticTripData_2024_08.info()

CyclisticTripData_2024_09.info()

CyclisticTripData_2024_10.info()

CyclisticTripData_2024_11.info()

CyclisticTripData_2024_01.info()

Key findings:

the data is organized in CSV (comma-separated values) format
there are consistency in the column numbers and column names
there are consistency on the column types, 4 columns has data type ‘float64’ and 9 columns has data type ‘object’
the column labels include ride id, rideable type, started and ended time, start and end station - name, id, latitude and longitude - and member type
the number of observations is different every month

The data can be trusted, it is published in a trustworthiness source and it is aligned with the question that was assigned to me. However, all ride ids are unique and due to data privacy prohibiting using rider's personally identifiable information, it will not be possible to determine if riders have purchased multiple single passes and, also, determine if riders live in the Cyclistic service area. Now that a description of all data sources used has been explored, the data is ready to be processed and cleaned for analysis.

Process

In this step, the 12 dataframes are combined into one dataframe, clean it and manipulate it.

Data cleaning

The consistency in column names, column types and number of columns in each dataframe makes it possible that the 12 dataframes can be combined into only one dataframe for analyzing 12 months of data.

dataframes = [CyclisticTripData_2024_01, CyclisticTripData_2024_02, CyclisticTripData_2024_03, 
              CyclisticTripData_2024_04, CyclisticTripData_2024_05, CyclisticTripData_2024_06,
              CyclisticTripData_2024_07, CyclisticTripData_2024_08, CyclisticTripData_2024_09,
              CyclisticTripData_2024_10, CyclisticTripData_2024_11, CyclisticTripData_2024_12]
CyclisticTripData = pd.concat(dataframes, ignore_index=True)

Then the large dataframe can be verified to check very quickly uniformities and to get a better data sensitivity.

CyclisticTripData.info()
CyclisticTripData.shape
CyclisticTripData.columns
CyclisticTripData.head()
CyclisticTripData.tail()

And, the CyclisticTripData dataframe looks like

ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual
0	C1D650626C8C899A	electric_bike	2024-01-12 15:30:27	2024-01-12 15:37:59	Wells St & Elm St	KA1504000135	Kingsbury St & Kinzie St	KA1503000043	41.903267	-87.634737	41.889177	-87.638506
1	EECD38BDB25BFCB0	electric_bike	2024-01-08 15:45:46	2024-01-08 15:52:59	Wells St & Elm St	KA1504000135	Kingsbury St & Kinzie St	KA1503000043	41.902937	-87.634440	41.889177	-87.638506
2	F4A9CE78061F17F7	electric_bike	2024-01-27 12:27:19	2024-01-27 12:35:19	Wells St & Elm St	KA1504000135	Kingsbury St & Kinzie St	KA1503000043	41.902951	-87.634470	41.889177	-87.638506
3	0A0D9E15EE50B171	classic_bike	2024-01-29 16:26:17	2024-01-29 16:56:06	Wells St & Randolph St	TA1305000030	Larrabee St & Webster Ave	13193	41.884295	-87.633963	41.921822	-87.644140
4	33FFC9805E3EFF9A	classic_bike	2024-01-31 05:43:23	2024-01-31 06:09:35	Lincoln Ave & Waveland Ave	13253	Kingsbury St & Kinzie St	KA1503000043	41.948797	-87.675278	41.889177	-87.638506

ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual
5860563	BD56BA20F42E4794	electric_bike	2024-12-11 08:23:46.564	2024-12-11 08:37:34.532	Clybourn Ave & Division St	TA1307000115	NaN	NaN	41.904634	-87.640518	41.880000	-87.630000
5860564	3074643A6B60B300	electric_bike	2024-12-09 12:26:15.677	2024-12-09 12:37:32.712	Canal St & Jackson Blvd	13138	NaN	NaN	41.878125	-87.639968	41.900000	-87.620000
5860565	15602635C5DF484E	electric_bike	2024-12-31 17:10:03.113	2024-12-31 17:17:21.838	Albany Ave & Bloomingdale Ave	15655	California Ave & Milwaukee Ave	13084	41.914027	-87.705126	41.922695	-87.697153
5860566	F15ABBA961560B75	electric_bike	2024-12-01 14:39:47.216	2024-12-01 14:45:21.268	Albany Ave & Bloomingdale Ave	15655	California Ave & Milwaukee Ave	13084	41.914003	-87.705099	41.922695	-87.697153
5860567	8AF273287533B527	electric_bike	2024-12-17 06:38:32.320	2024-12-17 06:46:27.167	Albany Ave & Bloomingdale Ave	15655	NaN	NaN	41.914027	-87.705126	41.920000	-87.690000

From looking at the output information, there are no columns to be removed. All the data collected can be used to get insights for the analysis. Moreover, from looking at the output information from the last 5 rows, NaN values are observed.

Now, the large dataframe is ready to check that data is clean, free of errors and check if modifying data is needed.

Note: The large dataframe is exported to .CSV file for a saved version.

CyclisticTripData.to_csv('~/projects/20250109_CyclisticBikeShare/2.PreparedData/20250205_CyclisticTripData.csv', index=False)

Data manipulation

Next, a closer look at data is taken to check for duplicates, null values, and inconsistency on values that needs to be cleaned.

duplicate_number = CyclisticTripData['ride_id'].duplicated().sum()
duplicate = CyclisticTripData[CyclisticTripData['ride_id'].duplicated()]

print("Number of duplicate ride_id:", duplicate_number)
print("Duplicate Observations:", duplicate)

There are 211 duplicates. A closer look at the duplicates allowed to identify that the duplicates correspond to data duplicated in month 05 and 06, and recordings started on 2024-05-31 and finishing on 2024-06-01. Moreover, the data collected changed time format from HH:MM:SS, until month 05, to HH:MM:SS.SSS, after month 06. Then, considering the purpose of the analysis, it was proceeded to make consistency in the date format to match HH:MM:SS.

CyclisticTripData[['started_at', 'ended_at']] = CyclisticTripData[['started_at', 'ended_at']].apply(lambda x: pd.to_datetime(x, format='mixed').dt.strftime('%Y-%m-%d %H:%M:%S'))

And, then just keep only one record for each duplicate.

count_row_initial = CyclisticTripData.shape[0]
CyclisticTripData = CyclisticTripData.drop_duplicates(keep='first')
count_row_keep_one = CyclisticTripData.shape[0]

print('Duplicate rows removed:', count_row_initial - count_row_keep_one)

Next, a closer look at data with unique observations is taken to check for null or empty cells.

nan_count = CyclisticTripData.isna().sum().sum()
null_count = CyclisticTripData.isnull().sum().sum()
nan_count_column = CyclisticTripData.isna().sum()
null_count_column = CyclisticTripData.isnull().sum()

print('Number of NaN values in each column:', nan_count_column)
print('Number of null values in each column:', null_count_column)
print('Number of NaN values:', nan_count)
print('Number of null values:', null_count)

There are 4371386 total number of NaN or null values. These values are observed in a "mixed way" without a pattern in columns 'start_station_name', 'start_station_id', 'end_station_name', 'end_station_id', 'end_lat' and 'end_long'. Then, the observations where there are NaN or null values can be removed.
In the end, the large dataframe was left with 4208188 observations and 13 features.

Now, the consistency on characters length across columns can be checked.

max_char = CyclisticTripData.astype(str).map(len).max()
min_char = CyclisticTripData.astype(str).map(len).min()

print('Maximum character length per column:', max_char)
print('Minimum character length per column:', min_char)

There are consistency on characters length on columns ride_id, started_at, ended_at and member_casual.
Columns start_station_id, start_station_name, end_station_id and end_station_name will be kept for now, but most likely they do not add value in future steps, giving the possibility of being removed later.

Finally, the uniqued values for columnns rideable_type and member_casual can be checked.

rideable_type_unique = CyclisticTripData['rideable_type'].unique()
rideable_type_counts = CyclisticTripData['rideable_type'].value_counts()
member_casual_unique = CyclisticTripData['member_casual'].unique()
member_casual_counts = CyclisticTripData['member_casual'].value_counts()

print("'rideable_type' values:", rideable_type_unique)
print("'rideable_type' total number:", rideable_type_counts)
print("'member_casual' values:", member_casual_unique)
print("'member_casual' total number:", member_casual_counts)

There are "member" and "casual" at member_casual column.

There are "electric_bike", "classic_bike" and "electric_scooter" at rideable_type column.

Now, the data is ready to the transform and calculation steps to know the ride length, month, day of the month, day of week and hour of the day.

New columns ride_length, month, day_of_week, day_of_month and hour_of_day are created to calculate (1) the length of each ride by subtracting the column started_at from the column ended_at and (2) the month, day of the week, day of the month and hour of the day that each ride started, respectivelly. Both columns started_at and ended_at are consistent and have the start and end time in the format YYYY-MM-DD hh:mm:ss.

CyclisticTripData['hour'] = pd.DatetimeIndex(CyclisticTripData["started_at"]).hour
CyclisticTripData['day_of_month'] = pd.DatetimeIndex(CyclisticTripData["started_at"]).day
CyclisticTripData['day_of_week'] = pd.DatetimeIndex(CyclisticTripData["started_at"]).day_name()
CyclisticTripData['month'] = pd.DatetimeIndex(CyclisticTripData["started_at"]).month_name()

CyclisticTripData['ride_length'] =  pd.to_datetime(CyclisticTripData['ended_at']) -  pd.to_datetime(CyclisticTripData['started_at'])
CyclisticTripData['ride_length'] = CyclisticTripData['ride_length'].dt.total_seconds().astype(int)
CyclisticTripData['ride_length'] = pd.to_datetime(CyclisticTripData['ride_length'], unit='s').dt.strftime('%H:%M:%S')

CyclisticTripData.head(3)

Now, a closer look at data in the new columns is taken to check for consistency and meaning on values.

month_unique = CyclisticTripData['month'].unique()
day_of_week_unique = CyclisticTripData['day_of_week'].unique()
hour_min = CyclisticTripData['hour'].min()
hour_max = CyclisticTripData['hour'].max()
day_of_month_min = CyclisticTripData['day_of_month'].min()
day_of_month_max = CyclisticTripData['day_of_month'].max()
ride_length_min = CyclisticTripData['ride_length_seconds'].min()
ride_length_max = CyclisticTripData['ride_length_seconds'].max()

ride_length_zeronegative = (CyclisticTripData['ride_length_seconds'] <= 0).sum().sum()


print("'month' values:", month_unique)
print("'day_of_week' total number:", day_of_week_unique)
print(f'minimun hour is {hour_min} and maximum hour is {hour_max}')
print(f'minimun day of month is {day_of_month_min} and maximum day of month is {day_of_month_max}')
print(f'minimun ride length is {ride_length_min} seconds and maximum ride length is {ride_length_max} seconds')
print(f'There are {ride_length_zeronegative} observations with negative or zero ride length')

Key findings:

There are consistency on the day of the week that each ride started, with unique values corresponding to the names of the days of the week
There are consistency on months with unique values corresponding to the months of a year
There are consistency on hours and day of the month
There are at least one negative value on ride length. And, at least one value lasting little more than a day (circa 25 hours)
There are 213 observations with negative ride length or equal to 00:00:00
At the end, the data ready for analysis has 4207975 observations and 19 features

Analyze

Once the data is prepared, processed, and stored appropriately, it's ready for analyze it to find trends or relationships and, also, we can discover some surprises in the data. This will help to find how will these insights help answer the question assigned to me. This step includes conducting descriptive analysis and identifying trends and relationships.

For this step, I followed the Case Study Roadmap for the Data Analyze. And, I started by looking the following questions:

How should you organize your data to perform analysis on it?
Has your data been properly formatted?
What surprises did you discover in the data?

What's in this project? 👩‍💻

README.md: That's this file, where you can describe your project and how you built it.

20250109_Python_CyclisticBikeShareCapstone.ipynb: That's Jupyter notebook with the code for this project.

Resources

Background and Business Question was inspired by the business case description provided on Google Data Analytics Capstone: Complete a Case Study

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
20250109_Python_CyclisticBikeShareCapstone.ipynb		20250109_Python_CyclisticBikeShareCapstone.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google Data Analytics Course - Capstone Project using Python (January 2025)

Case study: How does a bike-share navigate speedy success? 🚴🚲👩‍💻📈📊