Note
- This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
- 🔥 Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq
Description:
You are a data scientist working for an e-commerce company. The marketing team has collected customer data from various sources, including website interactions, social media, and customer surveys. However, due to the diverse sources, there are duplicate records in the dataset.
Task:
Your task is to identify and combine duplicate rows based on specific criteria, and calculate the total spend for each unique customer.
- Identify duplicate rows based on CustomerID, Name, and Email.
- Combine duplicate rows into a single row, adding up the values in the Spent column.
- Calculate the total spend for each unique customer.
Bonus Question:
- What is the average spend per customer for the top 3 customers with the highest total spend? (Answer: 700.00)
Generate the data
Here is a tiny dataset composed of 12 rows that represents customer information, including their ID, name, email, and amount spent.
Columns:
CustomerID | Name | Spent | ||
---|---|---|---|---|
0 | C001 | John | john@example.com | 100 |
1 | C002 | Mary | mary@example.com | 200 |
2 | C003 | David | david@example.com | 300 |
3 | C001 | John | john@example.com | 100 |
4 | C002 | Mary | mary@example.com | 200 |
5 | C004 | Emily | emily@example.com | 400 |
6 | C005 | Michael | michael@example.com | 500 |
7 | C003 | David | david@example.com | 300 |
8 | C006 | Sarah | sarah@example.com | 600 |
9 | C001 | John | john@example.com | 100 |
10 | C002 | Mary | mary@example.com | 200 |
11 | C003 | David | david@example.com | 300 |
Identify Duplicates
CustomerID | Name | Spent | ||
---|---|---|---|---|
3 | C001 | John | john@example.com | 100 |
9 | C001 | John | john@example.com | 100 |
4 | C002 | Mary | mary@example.com | 200 |
10 | C002 | Mary | mary@example.com | 200 |
7 | C003 | David | david@example.com | 300 |
11 | C003 | David | david@example.com | 300 |
Total Spent per Customer
I probably would have removed the duplicate rows. In this example, we are treating the duplicates as additional payments received from the customer. Remember, we are collecting data from various sources.
Spent | |||
---|---|---|---|
CustomerID | Name | ||
C001 | John | john@example.com | 300 |
C002 | Mary | mary@example.com | 600 |
C003 | David | david@example.com | 900 |
C004 | Emily | emily@example.com | 400 |
C005 | Michael | michael@example.com | 500 |
C006 | Sarah | sarah@example.com | 600 |
Can You Solve the Bonus Question?
- What is the average spend per customer for the top 3 customers with the highest total spend? (Answer: 700.00)