Note
- This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
- š„ Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq
Description:
Hey friend! Imagine youāre a data analyst at an e-commerce company. The marketing team wants to know which products are driving sales, how customer behavior changes over time, and which demographics are most valuable. The problem is, the data is scattered across multiple sources! Your task is to use your Pandas magic to unite the data, answer the marketing teamās questions, and become the hero of the company.
Tasks
- 1:Ā Product Performance:Ā Find the top 10 products with the highest total revenue, considering both sales data and product prices. The data includes columns for āproduct_idā, āproduct_priceā, and āunits_soldā.
- 2:Ā Customer Journey:Ā Analyze how customer behavior changes over their first five purchases. The data includes columns for ācustomer_idā, āpurchase_dateā, āproduct_idā, and āunits_soldā.
Bonus Question:
- Demographic Insights:Ā Calculate the average order value for each age group and gender. The data includes columns for ācustomer_idā, āageā, āgenderā, and āorder_valueā.
import libraries
The Data
The generated dataset simulates e-commerce data with 3 tables: products, customers, and orders.
Column Descriptions:
- product_id:Ā unique product identifier
- product_price:Ā product price
- customer_id:Ā unique customer identifier
- age:Ā customer age
- gender:Ā customer gender (male or female)
- purchase_date:Ā order date
- units_sold:Ā number of units sold
customer_id | product_id | purchase_date | units_sold | product_price | age | gender | |
---|---|---|---|---|---|---|---|
0 | 104 | 52 | 2022-01-01 | 1 | 44.0 | 62 | M |
1 | 91 | 90 | 2022-01-02 | 7 | 68.0 | 31 | F |
2 | 43 | 80 | 2022-01-03 | 2 | 78.0 | 56 | F |
3 | 63 | 46 | 2022-01-04 | 5 | 84.0 | 20 | M |
4 | 159 | 96 | 2022-01-05 | 8 | 59.0 | 42 | M |
ā¦ | ā¦ | ā¦ | ā¦ | ā¦ | ā¦ | ā¦ | ā¦ |
1045 | 183 | 16 | 2022-04-18 | 8 | 49.0 | 36 | M |
1046 | 47 | 55 | 2023-01-20 | 5 | 46.0 | 58 | F |
1047 | 49 | 98 | 2024-01-26 | 1 | 51.0 | 48 | F |
1048 | 0 | 46 | 2022-02-27 | 5 | 84.0 | 18 | F |
1049 | 0 | 23 | 2023-03-07 | 5 | 87.0 | 18 | F |
1050 rows Ć 7 columns
Letās take a look at the datatypes to ensure all of the columns are of the correct type.
Take note of the output below, we are also able to see the columns with null values (gender and product_price)
<class āpandas.core.frame.DataFrameā> RangeIndex: 1050 entries, 0 to 1049 Data columns (total 7 columns):
Column Non-Null Count Dtype
`--- ------ -------------- -----
0 customer_id 1050 non-null int32
1 product_id 1050 non-null int32
2 purchase_date 1050 non-null datetime64[ns]
3 units_sold 1050 non-null int32
4 product_price 945 non-null float64
5 age 1050 non-null int32
6 gender 997 non-null object
dtypes: datetime64ns, float64(1), int32(4), object(1)
memory usage: 41.1+ KB
Missing Values
To make things simple, we can get rid of the rows that have missing prices as that is an important column and we donāt really want to make guesses here. The gender column is a different story. For this tutorial, we are not really doing any kind of filtering by gender (except the bonus question), so for now let us ignore the missing values in the gender column.
<class āpandas.core.frame.DataFrameā> Index: 945 entries, 0 to 1049 Data columns (total 7 columns):
Column Non-Null Count Dtype
`--- ------ -------------- -----
0 customer_id 945 non-null int32
1 product_id 945 non-null int32
2 purchase_date 945 non-null datetime64[ns]
3 units_sold 945 non-null int32
4 product_price 945 non-null float64
5 age 945 non-null int32
6 gender 894 non-null object
dtypes: datetime64ns, float64(1), int32(4), object(1)
memory usage: 44.3+ KB
Duplicates
Letās pretend you took a look at specific duplicate rows and determined they were in fact duplicates and needed to be removed.
customer_id | product_id | purchase_date | units_sold | product_price | age | gender | |
---|---|---|---|---|---|---|---|
1049 | 0 | 23 | 2023-03-07 | 5 | 87.0 | 18 | F |
1048 | 0 | 46 | 2022-02-27 | 5 | 84.0 | 18 | F |
1038 | 3 | 4 | 2024-02-29 | 2 | 77.0 | 48 | M |
1025 | 7 | 20 | 2024-09-07 | 4 | 91.0 | 47 | F |
1003 | 7 | 83 | 2022-09-23 | 6 | 57.0 | 47 | F |
customer_id | product_id | purchase_date | units_sold | product_price | age | gender | |
---|---|---|---|---|---|---|---|
430 | 0 | 23 | 2023-03-07 | 5 | 87.0 | 18 | F |
1049 | 0 | 23 | 2023-03-07 | 5 | 87.0 | 18 | F |
Product Performance:
Find the top 10 products with the highest total revenue, considering both sales data and product prices. The data includes columns for āproduct_idā, āproduct_priceā, and āunits_soldā.
customer_id | product_id | purchase_date | units_sold | product_price | age | gender | revenue | |
---|---|---|---|---|---|---|---|---|
0 | 104 | 52 | 2022-01-01 | 1 | 44.0 | 62 | M | 44.0 |
1 | 91 | 90 | 2022-01-02 | 7 | 68.0 | 31 | F | 476.0 |
2 | 43 | 80 | 2022-01-03 | 2 | 78.0 | 56 | F | 156.0 |
3 | 63 | 46 | 2022-01-04 | 5 | 84.0 | 20 | M | 420.0 |
4 | 159 | 96 | 2022-01-05 | 8 | 59.0 | 42 | M | 472.0 |
ā¦ | ā¦ | ā¦ | ā¦ | ā¦ | ā¦ | ā¦ | ā¦ | ā¦ |
994 | 116 | 67 | 2024-09-21 | 3 | 51.0 | 41 | M | 153.0 |
995 | 104 | 30 | 2024-09-22 | 6 | 57.0 | 62 | M | 342.0 |
996 | 126 | 37 | 2024-09-23 | 1 | 29.0 | 58 | NaN | 29.0 |
998 | 116 | 5 | 2024-09-25 | 9 | 19.0 | 41 | M | 171.0 |
999 | 16 | 59 | 2024-09-26 | 2 | 27.0 | 28 | F | 54.0 |
900 rows Ć 8 columns
To determine the products with the highest total revenue, we will make use of Pandas groupBy method.
After I calculated the total revenue per product, I sorted the data from highest to lowest, and then selected the top ten rows.
revenue | |
---|---|
product_id | |
72 | 6808.0 |
94 | 6650.0 |
17 | 6111.0 |
29 | 5874.0 |
78 | 5452.0 |
11 | 5194.0 |
71 | 5152.0 |
23 | 5133.0 |
43 | 5092.0 |
97 | 4898.0 |
Customer Journey:
Analyze how customer behavior changes over their first 5 purchases. The data includes columns for ācustomer_idā, āpurchase_dateā, āproduct_idā, and āunits_soldā.
I wanted to start by looking at the units sold over the dataset. We might be able to see a pattern here, but since we are working with randomly generated data, I donāt see much here. Do you?
units_sold | purchase_date | ||
---|---|---|---|
customer_id | |||
0 | 57 | 5 | 2022-02-27 |
67 | 5 | 2022-03-09 | |
331 | 8 | 2022-11-28 | |
430 | 5 | 2023-03-07 | |
849 | 6 | 2024-04-29 | |
ā¦ | ā¦ | ā¦ | ā¦ |
199 | 206 | 4 | 2022-07-26 |
546 | 1 | 2023-07-01 | |
623 | 4 | 2023-09-16 | |
815 | 5 | 2024-03-26 | |
940 | 2 | 2024-07-29 |
780 rows Ć 2 columns
Now that we have trimmed the data to the first 5 purchases per customer, letās generate the chart again.
As you can see below, we are still in the same boat. At this point, you may want to further filter the data by gender/age and see if you find any patterns there.
Can You Solve the Bonus Question?
Calculate the average order value for each age group and gender. The data includes columns for ācustomer_idā, āageā, āgenderā, and āorder_valueā.