Note
- This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
- š„ Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq
Description:
You are a fintech professional working for a bank. Your team has collected a dataset of credit card transactions from various customers, including transaction dates, amounts, categories, and merchant names. However, the data is messy and needs to be cleaned and analyzed to extract valuable insights.
Task:
Use pandas to clean and analyze the dataset, answering the following questions:
- What is the total transaction amount for each category (e.g., Food, Entertainment, Shopping)?
- Which merchant has the highest total transaction amount?
- What is the average transaction amount per day for each category?
Bonus Question:
Identify the top 3 categories with the highest average transaction amount on weekends (Saturday and Sunday).
Answer:Ā Travel, Entertainment, Food (in that order)
Generate the data
The code below generates a sample dataset with 105 rows and 4 columns (date, amount, category, and merchant). The data includes some missing values and duplicates to simulate real data in the wild.
date | amount | category | merchant | |
---|---|---|---|---|
0 | 2024-01-13 | 67.084665 | Food | Starbucks |
1 | 2024-01-16 | 96.305434 | Shopping | Expedia |
2 | 2024-01-22 | 68.751129 | Travel | Expedia |
3 | 2024-01-01 | 67.155299 | Travel | Amazon |
4 | 2024-01-04 | NaN | Entertainment | Amazon |
The data types of the columns look good.
Do you see the Non-Null count column and the value of 99? This is how we can determine if our data has any null values.
<class āpandas.core.frame.DataFrameā> Index: 105 entries, 0 to 4 Data columns (total 4 columns):
Column Non-Null Count Dtype
0 date 105 non-null datetime64[ns]
1 amount 99 non-null float64
2 category 105 non-null object
3 merchant 105 non-null object
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 4.1+ KB
Duplicates
Letās take care of the duplicates in the data
date | amount | category | merchant | |
---|---|---|---|---|
0 | 2024-01-13 | 67.084665 | Food | Starbucks |
1 | 2024-01-16 | 96.305434 | Shopping | Expedia |
2 | 2024-01-22 | 68.751129 | Travel | Expedia |
3 | 2024-01-01 | 67.155299 | Travel | Amazon |
4 | 2024-01-04 | NaN | Entertainment | Amazon |
date | amount | category | merchant | |
---|---|---|---|---|
0 | 2024-01-13 | 67.084665 | Food | Starbucks |
0 | 2024-01-13 | 67.084665 | Food | Starbucks |
date | amount | category | merchant | |
---|---|---|---|---|
0 | 2024-01-13 | 67.084665 | Food | Starbucks |
1 | 2024-01-16 | 96.305434 | Shopping | Expedia |
2 | 2024-01-22 | 68.751129 | Travel | Expedia |
3 | 2024-01-01 | 67.155299 | Travel | Amazon |
4 | 2024-01-04 | NaN | Entertainment | Amazon |
Missing Values
Here I am just going to fill any missing values with the previous known value. If I donāt find any, I will simply get rid of the row.
date | amount | category | merchant | |
---|---|---|---|---|
4 | 2024-01-04 | NaN | Entertainment | Amazon |
23 | 2024-01-15 | NaN | Food | Expedia |
24 | 2024-01-25 | NaN | Travel | McDonalds |
49 | 2024-01-08 | NaN | Entertainment | McDonalds |
87 | 2024-01-24 | NaN | Shopping | McDonalds |
Letās see if we can look at the null rows and the previous row so we can see if forward fill will work for all the rows.
As you can see from the table below, we donāt have a previous not-null value for any of our missing data. This means we cannot make a guess on what should be there and we just need to remove those rows.
date | amount | category | merchant | |
---|---|---|---|---|
11 | 2024-01-03 | 70.789517 | Food | Netflix |
12 | 2024-01-04 | NaN | Entertainment | Amazon |
28 | 2024-01-08 | 18.313650 | Entertainment | Amazon |
29 | 2024-01-08 | NaN | Entertainment | McDonalds |
47 | 2024-01-14 | 71.979506 | Travel | McDonalds |
48 | 2024-01-15 | NaN | Food | Expedia |
80 | 2024-01-24 | 50.604329 | Shopping | Expedia |
81 | 2024-01-24 | NaN | Shopping | McDonalds |
85 | 2024-01-25 | 77.425285 | Entertainment | Netflix |
86 | 2024-01-25 | NaN | Travel | McDonalds |
Letās remove the null rows under theĀ amountĀ column
<class āpandas.core.frame.DataFrameā> Index: 95 entries, 0 to 99 Data columns (total 4 columns):
Column Non-Null Count Dtype
0 date 95 non-null datetime64[ns]
1 amount 95 non-null float64
2 category 95 non-null object
3 merchant 95 non-null object
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 3.7+ KB
Letās Answer The Questions
amount | |
---|---|
category | |
Entertainment | 1576.840628 |
Food | 1063.905275 |
Shopping | 1060.962022 |
Travel | 1822.804450 |
amount | |
---|---|
merchant | |
McDonalds | 1453.475029 |
amount | ||
---|---|---|
date | category | |
2024-01-01 | Entertainment | 29.657538 |
Shopping | 73.336355 | |
Travel | 46.258544 | |
2024-01-02 | Entertainment | 19.257702 |
Food | 35.962879 | |
Shopping | 89.060697 | |
Travel | 89.037412 |
Can You Solve the Bonus Question?
Identify the top 3 categories with the highest average transaction amount on weekends (Saturday and Sunday).