This project contains a solution using Python for the Capstone of Google Data Analytics Professional Certificate Program from Google
This project was inspired by a previous implementation I developed using the R programming language. Although this version using Python is a distinct solution, some phrases and ideas in the presentation of this solution may be similar to the original R project. This is intentional, as the challenge and problem-solving approach remains consistent across both implementations.
- Professional training designed by Google 🟢🔵🔴🟡
- Prepare for a career in Data Analytics
- 8 Course series, including Google Data Analytics Capstone: Complete a Case Study
This project is my proposed solution using Python on the Cyclistic bike-share analysis case study!
In this case study, I will take on the role of a junior data analyst performing real-world tasks for a fictional company, Cyclistic, working in the marketing analytics team. To address the business questions, I will follow the steps of the data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.
Cyclistic, a bike-share company in Chicago, launched in 2016 a successful bike-share offering. Since then, has grown to 5,824 geotracked bicycles and locked into a network of 692 stations across Chicago. Cyclistic bikes can be accessed from one station and returned to any other station in the network, anytime. This bike-share program includes traditional bikes and more inclusive options like reclining bikes, hand tricycles, and cargo bikes useful to people with disabilities and riders who can’t use a standard two-wheeled bike. While most users ride for leisure, 30% use the bikes for commuting to work each day.
Cyclistic offers flexible pricing plans: single-ride passes, full-day passes (for casual riders), and annual memberships (for members). Cyclistic’s finance analysts have found that annual members are more profitable, and the director of marketing sees an opportunity to grow by converting casual riders into members, as they are already familiar with the Cyclistic program. Rather than targeting all-new customers, she believes the focus should be on converting casual riders into members, as they are already familiar with Cyclistic and rely on it for their mobility needs.
-
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
-
Cyclistic executive team: The notoriously detail-oriented executive team.
-
Lily Moreno: The director of marketing and the manager of my team. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels. She believes the company’s future success depends on maximizing the number of annual memberships.
-
Carla Cotas: A junior data analyst working in the marketing analytics team. My team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members.
In order to do that, however, my team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Therefore, Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
Three questions will guide the future marketing program:
- How do annual members and casual riders use Cyclistic bikes differently?
- Why would casual riders buy Cyclistic annual memberships?
- How can Cyclistic use digital media to influence casual riders to become members?
The first question to answer was assigned to me: How do annual members and casual riders use Cyclistic bikes differently?
The local directory, folders, sub-folders and file-naming conventions follow the structure below:
|--20250109_CyclisticBikeShare
|-- 1.OriginalData
|-- 2.PreparedData
|-- 3.UploadedData
|-- YYYYMMDD
|-- 4.Analysis
|-- DataErrors
|-- AutomaticallyExcludedResults
|-- ManuallyExcludedResults
|-- PythonCode
|-- Figures
|-- 5.Insights
|-- MyInsights
|-- SharedInsights
|-- 6.Final
|-- Report
|-- Presentation
Cyclistic’s historical trip data to analyze and identify trends is available here.
This study analyzes and identifies trends from the previous 12 months: January 2024 to December 2024 (‘202401-divvy-tripdata.csv’ → ‘202412-divvy-tripdata.csv’).
The data has been made available by Motivate International Inc. under this license. This is public data that can use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit from using riders’ personally identifiable information. This means that you won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.
⚠️ Note: Im my previous implementation I used Posit’s RStudio and I could only complete this project in R focusing the first previous 6 months: November 2023 to April 2024 (‘202311-divvy-tripdata.csv’ → ‘202404-divvy-tripdata.csv’) because Posit’s RStudio not crashed due to Memory Usage. However, when using Python for this project, I don't have this Memory Usage problem and I was able to do a more complete analysis, completing further anaysis points identified in the previous study in R.
The Cyclistic’s historical trip data is available to download in zip files (‘202401-divvy-tripdata.zip’ to ‘202412-divvy-tripdata.zip’). After downloaded and stored in a folder (1.OriginalData), the zip files were unzipped locally giving the original .CSV data files (‘202401-divvy-tripdata.csv’ to ‘202412-divvy-tripdata.csv’). This study uses Python 3.11 for the analysis because it is a widely used programming language in data analytics, providing extensive data libraries.
The original data files, naming convention of YYYYMM-divvy-tripdata, were imported into Python.
CyclisticTripData_2024_01 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202401-divvy-tripdata.csv')
CyclisticTripData_2024_02 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202402-divvy-tripdata.csv')
CyclisticTripData_2024_03 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202403-divvy-tripdata.csv')
CyclisticTripData_2024_04 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202404-divvy-tripdata.csv')
CyclisticTripData_2024_05 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202405-divvy-tripdata.csv')
CyclisticTripData_2024_06 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202406-divvy-tripdata.csv')
CyclisticTripData_2024_07 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202407-divvy-tripdata.csv')
CyclisticTripData_2024_08 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202408-divvy-tripdata.csv')
CyclisticTripData_2024_09 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202409-divvy-tripdata.csv')
CyclisticTripData_2024_10 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202410-divvy-tripdata.csv')
CyclisticTripData_2024_11 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202411-divvy-tripdata.csv')
CyclisticTripData_2024_12 = pd.read_csv('~/projects/20250109_CyclisticBikeShare/1.OriginalData/202412-divvy-tripdata.csv')
print('Importing original data files completed!')
A check summary data of each dataframe about structure and information was performed for exploring consistency in the column numbers to ensure the same number of columns and the same column names before moving to the next step and merge them in only one data set.
CyclisticTripData_2024_01.info()
CyclisticTripData_2024_02.info()
CyclisticTripData_2024_03.info()
CyclisticTripData_2024_04.info()
CyclisticTripData_2024_05.info()
CyclisticTripData_2024_06.info()
CyclisticTripData_2024_07.info()
CyclisticTripData_2024_08.info()
CyclisticTripData_2024_09.info()
CyclisticTripData_2024_10.info()
CyclisticTripData_2024_11.info()
CyclisticTripData_2024_01.info()
Key findings:
- the data is organized in CSV (comma-separated values) format
- there are consistency in the column numbers and column names
- there are consistency on the column types, 4 columns has data type ‘float64’ and 9 columns has data type ‘object’
- the column labels include ride id, rideable type, started and ended time, start and end station - name, id, latitude and longitude - and member type
- the number of observations is different every month
The data can be trusted, it is published in a trustworthiness source and it is aligned with the question that was assigned to me. However, all ride ids are unique and due to data privacy prohibiting using rider's personally identifiable information, it will not be possible to determine if riders have purchased multiple single passes and, also, determine if riders live in the Cyclistic service area. Now that a description of all data sources used has been explored, the data is ready to be processed and cleaned for analysis.
In this step, the 12 dataframes are combined into one dataframe, clean it and manipulate it.
The consistency in column names, column types and number of columns in each dataframe makes it possible that the 12 dataframes can be combined into only one dataframe for analyzing 12 months of data.
dataframes = [CyclisticTripData_2024_01, CyclisticTripData_2024_02, CyclisticTripData_2024_03,
CyclisticTripData_2024_04, CyclisticTripData_2024_05, CyclisticTripData_2024_06,
CyclisticTripData_2024_07, CyclisticTripData_2024_08, CyclisticTripData_2024_09,
CyclisticTripData_2024_10, CyclisticTripData_2024_11, CyclisticTripData_2024_12]
CyclisticTripData = pd.concat(dataframes, ignore_index=True)
Then the large dataframe can be verified to check very quickly uniformities and to get a better data sensitivity.
CyclisticTripData.info()
CyclisticTripData.shape
CyclisticTripData.columns
CyclisticTripData.head()
CyclisticTripData.tail()
And, the CyclisticTripData dataframe looks like
ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | C1D650626C8C899A | electric_bike | 2024-01-12 15:30:27 | 2024-01-12 15:37:59 | Wells St & Elm St | KA1504000135 | Kingsbury St & Kinzie St | KA1503000043 | 41.903267 | -87.634737 | 41.889177 | -87.638506 |
1 | EECD38BDB25BFCB0 | electric_bike | 2024-01-08 15:45:46 | 2024-01-08 15:52:59 | Wells St & Elm St | KA1504000135 | Kingsbury St & Kinzie St | KA1503000043 | 41.902937 | -87.634440 | 41.889177 | -87.638506 |
2 | F4A9CE78061F17F7 | electric_bike | 2024-01-27 12:27:19 | 2024-01-27 12:35:19 | Wells St & Elm St | KA1504000135 | Kingsbury St & Kinzie St | KA1503000043 | 41.902951 | -87.634470 | 41.889177 | -87.638506 |
3 | 0A0D9E15EE50B171 | classic_bike | 2024-01-29 16:26:17 | 2024-01-29 16:56:06 | Wells St & Randolph St | TA1305000030 | Larrabee St & Webster Ave | 13193 | 41.884295 | -87.633963 | 41.921822 | -87.644140 |
4 | 33FFC9805E3EFF9A | classic_bike | 2024-01-31 05:43:23 | 2024-01-31 06:09:35 | Lincoln Ave & Waveland Ave | 13253 | Kingsbury St & Kinzie St | KA1503000043 | 41.948797 | -87.675278 | 41.889177 | -87.638506 |
ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual |
---|---|---|---|---|---|---|---|---|---|---|---|---|
5860563 | BD56BA20F42E4794 | electric_bike | 2024-12-11 08:23:46.564 | 2024-12-11 08:37:34.532 | Clybourn Ave & Division St | TA1307000115 | NaN | NaN | 41.904634 | -87.640518 | 41.880000 | -87.630000 |
5860564 | 3074643A6B60B300 | electric_bike | 2024-12-09 12:26:15.677 | 2024-12-09 12:37:32.712 | Canal St & Jackson Blvd | 13138 | NaN | NaN | 41.878125 | -87.639968 | 41.900000 | -87.620000 |
5860565 | 15602635C5DF484E | electric_bike | 2024-12-31 17:10:03.113 | 2024-12-31 17:17:21.838 | Albany Ave & Bloomingdale Ave | 15655 | California Ave & Milwaukee Ave | 13084 | 41.914027 | -87.705126 | 41.922695 | -87.697153 |
5860566 | F15ABBA961560B75 | electric_bike | 2024-12-01 14:39:47.216 | 2024-12-01 14:45:21.268 | Albany Ave & Bloomingdale Ave | 15655 | California Ave & Milwaukee Ave | 13084 | 41.914003 | -87.705099 | 41.922695 | -87.697153 |
5860567 | 8AF273287533B527 | electric_bike | 2024-12-17 06:38:32.320 | 2024-12-17 06:46:27.167 | Albany Ave & Bloomingdale Ave | 15655 | NaN | NaN | 41.914027 | -87.705126 | 41.920000 | -87.690000 |
From looking at the output information, there are no columns to be removed. All the data collected can be used to get insights for the analysis. Moreover, from looking at the output information from the last 5 rows, NaN values are observed.
Now, the large dataframe is ready to check that data is clean, free of errors and check if modifying data is needed.
Note: The large dataframe is exported to .CSV file for a saved version.
CyclisticTripData.to_csv('~/projects/20250109_CyclisticBikeShare/2.PreparedData/20250205_CyclisticTripData.csv', index=False)
Next, a closer look at data is taken to check for duplicates, null values, and inconsistency on values that needs to be cleaned.
duplicate_number = CyclisticTripData['ride_id'].duplicated().sum()
duplicate = CyclisticTripData[CyclisticTripData['ride_id'].duplicated()]
print("Number of duplicate ride_id:", duplicate_number)
print("Duplicate Observations:", duplicate)
- There are 211 duplicates. A closer look at the duplicates allowed to identify that the duplicates correspond to data duplicated in month 05 and 06, and recordings started on 2024-05-31 and finishing on 2024-06-01. Moreover, the data collected changed time format from HH:MM:SS, until month 05, to HH:MM:SS.SSS, after month 06. Then, considering the purpose of the analysis, it was proceeded to make consistency in the date format to match HH:MM:SS.
CyclisticTripData[['started_at', 'ended_at']] = CyclisticTripData[['started_at', 'ended_at']].apply(lambda x: pd.to_datetime(x, format='mixed').dt.strftime('%Y-%m-%d %H:%M:%S'))
And, then just keep only one record for each duplicate.
count_row_initial = CyclisticTripData.shape[0]
CyclisticTripData = CyclisticTripData.drop_duplicates(keep='first')
count_row_keep_one = CyclisticTripData.shape[0]
print('Duplicate rows removed:', count_row_initial - count_row_keep_one)
Next, a closer look at data with unique observations is taken to check for null or empty cells.
nan_count = CyclisticTripData.isna().sum().sum()
null_count = CyclisticTripData.isnull().sum().sum()
nan_count_column = CyclisticTripData.isna().sum()
null_count_column = CyclisticTripData.isnull().sum()
print('Number of NaN values in each column:', nan_count_column)
print('Number of null values in each column:', null_count_column)
print('Number of NaN values:', nan_count)
print('Number of null values:', null_count)
- There are 4371386 total number of NaN or null values. These values are observed in a "mixed way" without a pattern in columns 'start_station_name', 'start_station_id', 'end_station_name', 'end_station_id', 'end_lat' and 'end_long'. Then, the observations where there are NaN or null values can be removed.
- In the end, the large dataframe was left with 4208188 observations and 13 features.
Now, the consistency on characters length across columns can be checked.
max_char = CyclisticTripData.astype(str).map(len).max()
min_char = CyclisticTripData.astype(str).map(len).min()
print('Maximum character length per column:', max_char)
print('Minimum character length per column:', min_char)
- There are consistency on characters length on columns ride_id, started_at, ended_at and member_casual.
- Columns start_station_id, start_station_name, end_station_id and end_station_name will be kept for now, but most likely they do not add value in future steps, giving the possibility of being removed later.
Finally, the uniqued values for columnns rideable_type and member_casual can be checked.
rideable_type_unique = CyclisticTripData['rideable_type'].unique()
rideable_type_counts = CyclisticTripData['rideable_type'].value_counts()
member_casual_unique = CyclisticTripData['member_casual'].unique()
member_casual_counts = CyclisticTripData['member_casual'].value_counts()
print("'rideable_type' values:", rideable_type_unique)
print("'rideable_type' total number:", rideable_type_counts)
print("'member_casual' values:", member_casual_unique)
print("'member_casual' total number:", member_casual_counts)
- There are "member" and "casual" at member_casual column.
- There are "electric_bike", "classic_bike" and "electric_scooter" at rideable_type column.
Now, the data is ready to the transform and calculation steps to know the ride length, month, day of the month, day of week and hour of the day.
New columns ride_length, month, day_of_week, day_of_month and hour_of_day are created to calculate (1) the length of each ride by subtracting the column started_at from the column ended_at and (2) the month, day of the week, day of the month and hour of the day that each ride started, respectivelly. Both columns started_at and ended_at are consistent and have the start and end time in the format YYYY-MM-DD hh:mm:ss.
CyclisticTripData['hour'] = pd.DatetimeIndex(CyclisticTripData["started_at"]).hour
CyclisticTripData['day_of_month'] = pd.DatetimeIndex(CyclisticTripData["started_at"]).day
CyclisticTripData['day_of_week'] = pd.DatetimeIndex(CyclisticTripData["started_at"]).day_name()
CyclisticTripData['month'] = pd.DatetimeIndex(CyclisticTripData["started_at"]).month_name()
CyclisticTripData['ride_length'] = pd.to_datetime(CyclisticTripData['ended_at']) - pd.to_datetime(CyclisticTripData['started_at'])
CyclisticTripData['ride_length'] = CyclisticTripData['ride_length'].dt.total_seconds().astype(int)
CyclisticTripData['ride_length'] = pd.to_datetime(CyclisticTripData['ride_length'], unit='s').dt.strftime('%H:%M:%S')
CyclisticTripData.head(3)
Now, a closer look at data in the new columns is taken to check for consistency and meaning on values.
month_unique = CyclisticTripData['month'].unique()
day_of_week_unique = CyclisticTripData['day_of_week'].unique()
hour_min = CyclisticTripData['hour'].min()
hour_max = CyclisticTripData['hour'].max()
day_of_month_min = CyclisticTripData['day_of_month'].min()
day_of_month_max = CyclisticTripData['day_of_month'].max()
ride_length_min = CyclisticTripData['ride_length_seconds'].min()
ride_length_max = CyclisticTripData['ride_length_seconds'].max()
ride_length_zeronegative = (CyclisticTripData['ride_length_seconds'] <= 0).sum().sum()
print("'month' values:", month_unique)
print("'day_of_week' total number:", day_of_week_unique)
print(f'minimun hour is {hour_min} and maximum hour is {hour_max}')
print(f'minimun day of month is {day_of_month_min} and maximum day of month is {day_of_month_max}')
print(f'minimun ride length is {ride_length_min} seconds and maximum ride length is {ride_length_max} seconds')
print(f'There are {ride_length_zeronegative} observations with negative or zero ride length')
Key findings:
-
There are consistency on the day of the week that each ride started, with unique values corresponding to the names of the days of the week
-
There are consistency on months with unique values corresponding to the months of a year
-
There are at least one negative value on ride length. And, at least one value lasting little more than a day (circa 25 hours)
-
There are 213 observations with negative ride length or equal to 00:00:00
-
At the end, the data ready for analysis has 4207975 observations and 19 features
Once the data is prepared, processed, and stored appropriately, it's ready for analyze it to find trends or relationships and, also, we can discover some surprises in the data. This will help to find how will these insights help answer the question assigned to me. This step includes conducting descriptive analysis and identifying trends and relationships.
For this step, I followed the Case Study Roadmap for the Data Analyze. And, I started by looking the following questions:
- How should you organize your data to perform analysis on it?
- Has your data been properly formatted?
- What surprises did you discover in the data?
README.md
: That's this file, where you can describe your project and how you built it.
20250109_Python_CyclisticBikeShareCapstone.ipynb
: That's Jupyter notebook with the code for this project.
Background and Business Question was inspired by the business case description provided on Google Data Analytics Capstone: Complete a Case Study