Note

  • This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
  • Pandas Ninja: Take your skills to the next level with comprehensive Jupyter Notebook tutorials covering dates, group by, plotting, pivot tables, and more. Includes specialized tutorials for Excel and SQL developers, helping you master data analysis with Pandas.

Description:

You are a fintech professional working for a bank. Your team has collected a dataset of credit card transactions from various customers, including transaction dates, amounts, categories, and merchant names. However, the data is messy and needs to be cleaned and analyzed to extract valuable insights.

Task:

Use pandas to clean and analyze the dataset, answering the following questions:

  • What is the total transaction amount for each category (e.g., Food, Entertainment, Shopping)?
  • Which merchant has the highest total transaction amount?
  • What is the average transaction amount per day for each category?

Bonus Question:

Identify the top 3 categories with the highest average transaction amount on weekends (Saturday and Sunday).

Answer: Travel, Entertainment, Food (in that order)

# import libraries
import pandas as pd
import numpy as np

Generate the data

The code below generates a sample dataset with 105 rows and 4 columns (date, amount, category, and merchant). The data includes some missing values and duplicates to simulate real data in the wild.

# set the seed
np.random.seed(0)
 
# hold the values to randomly pick from
categories = ['Food', 'Entertainment', 'Shopping', 'Travel']
merchants = ['McDonalds', 'Amazon', 'Netflix', 'Expedia', 'Starbucks']
dates = pd.date_range('2024-01-01', '2024-01-31')
 
# generate the data
data = {
'date': np.random.choice(dates, size=100),
'amount': np.random.uniform(10, 100, size=100),
'category': np.random.choice(categories, size=100),
'merchant': np.random.choice(merchants, size=100)
}
 
# add the data to our dataframe
df = pd.DataFrame(data)
 
# introduce some missing values
# we randomly select 5 index values and set the column "amount" to null
df.loc[np.random.choice(df.index, size=5), 'amount'] = np.nan
 
# introduce some duplicates
# we select the first 5 rows and append them to the dataframe
dupes = df.iloc[:5]
df = pd.concat([df, dupes])
 
df.head()
dateamountcategorymerchant
02024-01-1367.084665FoodStarbucks
12024-01-1696.305434ShoppingExpedia
22024-01-2268.751129TravelExpedia
32024-01-0167.155299TravelAmazon
42024-01-04NaNEntertainmentAmazon

The data types of the columns look good.

Do you see the Non-Null count column and the value of 99? This is how we can determine if our data has any null values.

df.info()

<class ‘pandas.core.frame.DataFrame’> Index: 105 entries, 0 to 4 Data columns (total 4 columns):

Column Non-Null Count Dtype


0 date 105 non-null datetime64[ns] 1 amount 99 non-null float64
2 category 105 non-null object
3 merchant 105 non-null object
dtypes: datetime64[ns](1), float64(1), object(2) memory usage: 4.1+ KB

Duplicates

Let’s take care of the duplicates in the data

# here is how we can identify the duplicates
df[df.duplicated()]
dateamountcategorymerchant
02024-01-1367.084665FoodStarbucks
12024-01-1696.305434ShoppingExpedia
22024-01-2268.751129TravelExpedia
32024-01-0167.155299TravelAmazon
42024-01-04NaNEntertainmentAmazon
# here is how we can pinpoint one of the duplicates
df[df.loc[:,'amount'].apply(lambda x: np.isclose(x,67.084665))]
dateamountcategorymerchant
02024-01-1367.084665FoodStarbucks
02024-01-1367.084665FoodStarbucks
# pandas makes it very easy to get rid of the duplicates
df = df.drop_duplicates()
 
df.head()
dateamountcategorymerchant
02024-01-1367.084665FoodStarbucks
12024-01-1696.305434ShoppingExpedia
22024-01-2268.751129TravelExpedia
32024-01-0167.155299TravelAmazon
42024-01-04NaNEntertainmentAmazon

Missing Values

Here I am just going to fill any missing values with the previous known value. If I don’t find any, I will simply get rid of the row.

# these are the null values
df[df.loc[:,'amount'].isna()]
dateamountcategorymerchant
42024-01-04NaNEntertainmentAmazon
232024-01-15NaNFoodExpedia
242024-01-25NaNTravelMcDonalds
492024-01-08NaNEntertainmentMcDonalds
872024-01-24NaNShoppingMcDonalds

Let’s see if we can look at the null rows and the previous row so we can see if forward fill will work for all the rows.

As you can see from the table below, we don’t have a previous not-null value for any of our missing data. This means we cannot make a guess on what should be there and we just need to remove those rows.

# make sure data is ordered
df = df.sort_values(by=['date','category','merchant'])
 
# make sure index values are in order
# drop=True, this is to completely get rid of the old index and prevent it from being added as a column
df = df.reset_index(drop=True)
 
# get null rows
null_rows = df[df['amount'].isnull()]
 
# get null indexes
null_indices = null_rows.index
 
# get previous rows
previous_rows = df.loc[null_indices - 1]
 
# combine them
pd.concat([null_rows,previous_rows]).sort_index()
dateamountcategorymerchant
112024-01-0370.789517FoodNetflix
122024-01-04NaNEntertainmentAmazon
282024-01-0818.313650EntertainmentAmazon
292024-01-08NaNEntertainmentMcDonalds
472024-01-1471.979506TravelMcDonalds
482024-01-15NaNFoodExpedia
802024-01-2450.604329ShoppingExpedia
812024-01-24NaNShoppingMcDonalds
852024-01-2577.425285EntertainmentNetflix
862024-01-25NaNTravelMcDonalds

Let’s remove the null rows under the amount column

# drop nulls
df = df.dropna()
 
# verify
df.info()

<class ‘pandas.core.frame.DataFrame’> Index: 95 entries, 0 to 99 Data columns (total 4 columns):

Column Non-Null Count Dtype


0 date 95 non-null datetime64[ns] 1 amount 95 non-null float64
2 category 95 non-null object
3 merchant 95 non-null object
dtypes: datetime64[ns](1), float64(1), object(2) memory usage: 3.7+ KB

Let’s Answer The Questions

# What is the total transaction amount for each category (e.g., Food, Entertainment, Shopping)?
group = df.groupby('category')
 
group.sum(numeric_only=True)
amount
category
Entertainment1576.840628
Food1063.905275
Shopping1060.962022
Travel1822.804450
# Which merchant has the highest total transaction amount?
group = df.groupby('merchant')
 
group.sum(numeric_only=True).sort_values(by='amount', ascending=False).head(1)
amount
merchant
McDonalds1453.475029
# What is the average transaction amount per day for each category?
group = df.groupby(['date','category'])
 
# show top 7 rows
group.mean(numeric_only=True).head(7)
amount
datecategory
2024-01-01Entertainment29.657538
Shopping73.336355
Travel46.258544
2024-01-02Entertainment19.257702
Food35.962879
Shopping89.060697
Travel89.037412

Can You Solve the Bonus Question?

Identify the top 3 categories with the highest average transaction amount on weekends (Saturday and Sunday).