Note

  • This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
  • Pandas Ninja: Take your skills to the next level with comprehensive Jupyter Notebook tutorials covering dates, group by, plotting, pivot tables, and more. Includes specialized tutorials for Excel and SQL developers, helping you master data analysis with Pandas.

Description:

You are a data scientist working for an e-commerce company. The marketing team has collected customer data from various sources, including website interactions, social media, and customer surveys. However, due to the diverse sources, there are duplicate records in the dataset.

Task:

Your task is to identify and combine duplicate rows based on specific criteria, and calculate the total spend for each unique customer.

  • Identify duplicate rows based on CustomerID, Name, and Email.
  • Combine duplicate rows into a single row, adding up the values in the Spent column.
  • Calculate the total spend for each unique customer.

Bonus Question:

  • What is the average spend per customer for the top 3 customers with the highest total spend? (Answer: 700.00)
# import libraries
import pandas as pd
import numpy as np

Generate the data

Here is a tiny dataset composed of 12 rows that represents customer information, including their ID, name, email, and amount spent.

Columns:

CustomerID (string): unique customer identifier
Name (string): customer name
Email (string): customer email
Spent (integer): amount spent by the customer
# sample data placed in a dictionary
data = {
'CustomerID': ['C001', 'C002', 'C003', 'C001', 'C002', 'C004', 'C005', 'C003', 'C006'],
'Name': ['John', 'Mary', 'David', 'John', 'Mary', 'Emily', 'Michael', 'David', 'Sarah'],
'Email': ['john@example.com', 'mary@example.com', 'david@example.com', 'john@example.com', 'mary@example.com', 'emily@example.com', 'michael@example.com', 'david@example.com', 'sarah@example.com'],
'Spent': [100, 200, 300, 100, 200, 400, 500, 300, 600]
}
 
# create the dataframe
df = pd.DataFrame(data)
 
# introduce duplicates
duplicates = pd.DataFrame({'CustomerID': ['C001', 'C002', 'C003'], 'Name': ['John', 'Mary', 'David'], 'Email': ['john@example.com', 'mary@example.com', 'david@example.com'], 'Spent': [100, 200, 300]})
 
# combine the dataframes
df = pd.concat([df, duplicates], ignore_index=True)
  
df
CustomerIDNameEmailSpent
0C001Johnjohn@example.com100
1C002Marymary@example.com200
2C003Daviddavid@example.com300
3C001Johnjohn@example.com100
4C002Marymary@example.com200
5C004Emilyemily@example.com400
6C005Michaelmichael@example.com500
7C003Daviddavid@example.com300
8C006Sarahsarah@example.com600
9C001Johnjohn@example.com100
10C002Marymary@example.com200
11C003Daviddavid@example.com300

Identify Duplicates

df[df.duplicated()].sort_values(by='CustomerID')
 
# Intentionally not removing duplicates, as they represent additional payments from the same customer
CustomerIDNameEmailSpent
3C001Johnjohn@example.com100
9C001Johnjohn@example.com100
4C002Marymary@example.com200
10C002Marymary@example.com200
7C003Daviddavid@example.com300
11C003Daviddavid@example.com300

Total Spent per Customer

I probably would have removed the duplicate rows. In this example, we are treating the duplicates as additional payments received from the customer. Remember, we are collecting data from various sources.

group = df.groupby(['CustomerID','Name','Email'])
 
# calculate the sum
group.sum()
Spent
CustomerIDNameEmail
C001Johnjohn@example.com300
C002Marymary@example.com600
C003Daviddavid@example.com900
C004Emilyemily@example.com400
C005Michaelmichael@example.com500
C006Sarahsarah@example.com600

Can You Solve the Bonus Question?

  • What is the average spend per customer for the top 3 customers with the highest total spend? (Answer: 700.00)