Cyclistic Marketing Case Study (2023) - Google Capstone Project - Comprehensive

Cyclistic Marketing Case Study - Google Capstone Project - Comprehensive

Note: This blog post is my thorough output of the Cyclistic Marketing Case Study as a part of Google Capstone Project. I already did this case study before (check here), and that time, I used Google Sheets only. Now, I tried to incorporate it by using SQL and any other data analytic tools to make it more comprehensive and more 

Moreover, this post may contain affiliate links that at no additional cost to you, I may earn a small commission. Thank you for your support.

Business Tasks

Cyclistic Company defined casual riders are those customers who purchase either single-ride passes or full-day ride passes, while they defined Cyclistic members are those customers who purchased annual memberships. Concluded by financial analysts, the company has identified that its Cyclistic members contribute more significantly to the company's profits than casual riders. Maam Lily Moreno, a marketing director and also a stakeholder in this project, has decided to shift the focus of the marketing strategy away from acquiring new customers. Instead, the goal is to convert casual riders into Cyclistic members. As part of the company’s marketing team, she seeks to understand the differences between casual riders and Cyclistic members and how these differences can impact the design of the new marketing campaign.

Description of Data Sources Used

The company provides a historical trip dataset that is every month, starting from December 2022 to November 2023. You can verify the Cyclistic trip data here. The data is organized by Motivate International Inc. which has a data license agreement. Listed below are the 12 .csv files that are used to analyze the data.

  1. 202212-divvy-tripdata
  2. 202301-divvy-tripdata
  3. 202302-divvy-tripdata
  4. 202303-divvy-tripdata
  5. 202304-divvy-tripdata
  6. 202305-divvy-tripdata
  7. 202306-divvy-tripdata
  8. 202307-divvy-tripdata
  9. 202308-divvy-tripdata
  10. 202309-divvy-tripdata
  11. 202310-divvy-tripdata
  12. 202311-divvy-tripdata

Every file contains trip data and is structured in a long form. Part of the data license agreement is to prohibit the use of riders’ personally identifiable information for data privacy so, it cannot access the credit card numbers of every rider. The following are the columns that are indicated in the 12 CSV files.

  • ride_id
  • rideable_type
  • started_at
  • ended_at
  • start_station_name
  • start_station_id
  • end_station_name
  • end_station_id
  • start_lat
  • start_lng
  • end_lat
  • end_lng
  • member_casual

The data sources maintain ROCCC.

  • Reliable- The datasets contain accurate and unbiased information on Cyclistic’s transactions every day.
  • Original- The datasets is public and the source is based on the transactions in Chicago and has a license to use the data.
  • Comprehensive- There is a column that differentiates between casual riders and members riders.
  • Current- The latest public dataset available is December 2023 and the scope of this analysis one year.
  • Cited- The data is filed by Motivate International Inc. which has a data license agreement. Allow us to use the datasets in public.

Even though data about the station name and ID are not fully complete and mostly blanks, it does not affect getting information about the difference between the two. Hence, the remaining data is enough to achieve our business task and have a great contribution to making a new plan for the marketing campaign. 

Documentation of Data Cleaning and Data Manipulation

Microsoft Excel: Initial Data Cleaning and Manipulation

As part of the Prepare phase, I downloaded all the 12 zip files and unzipped them. I opened it and converted it from CSV to XLS files. I created folders for the CSV and XLS files. For each XLS file, I did the following:

  1. Created a column named ride_length, in which it subtracts the ended_at and started_at (=d2-c2) and formatted as hh:mm:ss.
  2. Created a column named day_of_week, in which determines the days of the week assigned by their numbers, and formatted as Numbers with no decimals.
  3. Then converted all the files into CSV files for uploading in Big Query.

Station name will be used in joins in Big Query.

Since my account in BigQuery was in trial mode, file sizes were limited to 100 MB, so some of the updated CSV files could not be uploaded. To pursue it, I modified again the CSV files:

  1. Opened new Excel file that is used for the station name, station id, and its longitude and latitude,
  2. Copied the start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, and end_lng columns from each CSV files.
  3. Pasted it in the Excel Files and used Remove Duplicates to delete all the duplications in the Excel and remains the unique data
  4. Saved it into CSV files and named as Station_name
  5. Deleted the start_station_name, start_station_id, end_station_name, and end_station_id column in each CSV file that I’d copied to reduce the file size.
  6. In that case, it would have a two sets of CSV files one for the monthly data and the other for the station names.
  7. All the CSV files were uploaded to BigQuery.

During the uploading of BigQuery, Some of the CSV files regarding to the datatrip did not continue because of the errors. I found out that the cause of the error is that time indicated in ended_at was earlier than the started_at. I did the following:

  1. Modified the ride_length in each CSV files by using an absolute to remove the negative symbol i.e. =abs(d2-c2)
  2. Saved it and uploaded it again.

BigQuery: Further Data Cleaning or Manipulating using Structured Query Language

Using the BigQuery SQL, I unite all the monthly data into one database using the UNION command. I also named the day_of_week column to the actual days of the week. See the query below:

--Objective: To clean or manipulate the data
--To unite all monthly data

With raw_data as

(SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202212-divvy-tripdata`

union all

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202301-divvy-tripdata`

union all

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202302-divvy-tripdata`

union all

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202303-divvy-tripdata`

union all

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202304-divvy-tripdata`

union all

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202305-divvy-tripdata`

union all

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202306-divvy-tripdata`

union all

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202307-divvy-tripdata`

union all

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202308-divvy-tripdata`

union all

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202309-divvy-tripdata`

union all

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202310-divvy-tripdata`

union all

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.202311-divvy-tripdata`

)


--Naming day_of_week into actual days

(select *

, case

  when day_of_week = 1 then 'Sunday'

  when day_of_week = 2 then 'Monday'

  when day_of_week = 3 then 'Tuesday'

  when day_of_week = 4 then 'Wednesday'

  when day_of_week = 5 then 'Thursday'

  when day_of_week = 6 then 'Friday'

  when day_of_week = 7 then 'Saturday'

  else null end as Weekdays

from raw_data

)

I also unite all the station names. The processed station names were also be used for analysis. See the query below:

--To unite all the station name

with raw_data_station_name as (

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.Stations_name-p1`

union all

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.Stations_name-p2`

union all

SELECT * FROM `gold-pod-375712.Cyclistic_Marketing_Study.Stations_name-p3`

)


--to get the start station name

,start0 as

(select

"Start" as Point

, start_lat

, start_lng

, start_station_name

, start_station_id

from raw_data_station_name)


, start1 as

(select * from start0

group by 1,2,3,4,5

)


, end0 as

(select

"End" as point

, end_lat

, end_lng

, end_station_name

, end_station_id

from raw_data_station_name)


, end1 as

(select * from end0

group by 1,2,3,4,5

)


--Uniting all the stations (both starting points and ending points)


,uniting as

(select * from end1

union all

select * from start1

order by 1 desc

)


select point

, end_lat as latittude

, end_lng as longitude

, end_station_id as station_id

, end_station_name as station_name

from uniting

group by 1,2,3,4,5


All the queries indicated above were saved as a view:

  • Cyclistic_Clean_Data
  • Cyclistic_Station_Data

Summary of Analysis 


Post a Comment

0 Comments