Note

Description:

Imagine you’re a Data Analyst at a bank, and your manager just dropped a bombshell on you. It seems that the bank’s financial reports are a mess, and they need your expertise to sort it out. The problem is that the data is scattered across multiple files, with different formats and inconsistencies. Your task is to use your Pandas skills to wrangle the data, identify the issues, and provide insights to help the bank get back on track.

Tasks:

  • Data Unification: Combine the ‘transactions.csv’ and ‘customers.xlsx’ files into a single dataframe, handling missing values and data type inconsistencies. Ensure that the ‘customer_id’ column is of numeric type (int or float) and the ‘date’ column is in datetime format.

  • Transaction Categorization: Create a new column ‘transaction_category’ and categorize each transaction as either ‘deposit’, ‘withdrawal’, or ‘transfer’ based on the ‘transaction_type’ column. Calculate the total value and average amount for each transaction category.

  • Customer Insights: Group by ‘gender’ and ‘location’, and calculate the average account balance (sum of ‘amount’ column) and transaction frequency (count of transactions) for each group. Output the results as a new dataframe.

# import libraries
import pandas as pd
import numpy as np
import sys
 
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
Numpy version 1.26.4

The Data

The dataset contains financial transaction records for 1,000 bank customers, including 10,000 transactions (deposits, withdrawals, and transfers) across two files (transactions.csv and customers.xlsx), with demographic information about each customer, such as age, gender, and location.

Summary of Data:

Transactions:

  • customer_id (int): unique customer identifier
  • transaction_type (str): type of transaction (deposit, withdrawal, transfer)
  • amount (float): transaction amount
  • date (datetime): transaction date

Customers:

  • customer_id (int): unique customer identifier
  • age (int): customer age
  • gender (str): customer gender
  • location (str): customer location (urban/rural)
# set the seed
np.random.seed(0)
 
transactions = pd.DataFrame({
    'customer_id': np.random.randint(1, 1000, 10000),
    'transaction_type': np.random.choice(['deposit', 'withdrawal', 'transfer'], 10000),
    'amount': np.random.uniform(100, 1000, 10000),
    'date': pd.date_range('2024-01-01', periods=10000)
})
 
# select 100 unique customer IDs
missing_ids = np.random.choice(transactions['customer_id'].unique(), 100, replace=False)
 
# for each of the selected customer IDs, set a random 'amount' value to null
for customer_id in missing_ids:
    idx = np.random.choice(transactions[transactions['customer_id'] == customer_id].index)
    transactions.loc[idx, 'amount'] = np.nan
 
customers = pd.DataFrame({
    'customer_id': np.arange(1, 1001),
    'age': np.random.randint(25, 65, 1000),
    'gender': np.random.choice(['male', 'female'], 1000),
    'location': np.random.choice(['urban', 'rural'], 1000)
})
 
# save data to disc
transactions.to_csv('transactions.csv', index=False)
customers.to_excel('customers.xlsx', index=False)

Data Unification:

  • Combine the ‘transactions.csv’ and ‘customers.xlsx’ files into a single dataframe, handling missing values and data type inconsistencies.
  • Ensure the ‘customer_id’ column is of numeric type (int or float) and the ‘date’ column is in datetime format.

Let us start by reading in the CSV and Excel file. We will then look at their data types to ensure they are correct.

transactions_df = pd.read_csv('transactions.csv')
transactions_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customer_id       10000 non-null  int64  
 1   transaction_type  10000 non-null  object 
 2   amount            9900 non-null   float64
 3   date              10000 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 312.6+ KB

From the table above, we can see we need to update the date column so it is represented as a date object. Currently it is represented as a string object.

The easiest way is to use the pd.to_datetime method.

transactions_df['date'] = pd.to_datetime(transactions_df['date'])
transactions_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   customer_id       10000 non-null  int64         
 1   transaction_type  10000 non-null  object        
 2   amount            9900 non-null   float64       
 3   date              10000 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 312.6+ KB

Let us now import the Excel file and perform the same process as we did with the CSV import.

I am not seeing any issues with the datatypes after importing from Excel. Yay!

customers_df = pd.read_excel('customers.xlsx')
customers_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   customer_id  1000 non-null   int64 
 1   age          1000 non-null   int64 
 2   gender       1000 non-null   object
 3   location     1000 non-null   object
dtypes: int64(2), object(2)
memory usage: 31.4+ KB

Now we can try to merge the dataframes. The common key between them is customer_id and we will use the .merge() method to accomplish this portion of the task.

df = pd.merge(transactions_df, customers_df, on='customer_id')
df.head()
customer_idtransaction_typeamountdateagegenderlocation
0685withdrawal189.8520382024-01-0137femalerural
1560withdrawal626.7201392024-01-0236maleurban
2630withdrawal657.8415882024-01-0328malerural
3193deposit327.1290162024-01-0448malerural
4836transfer798.2391452024-01-0541maleurban

Transaction Categorization:

Using the ‘transaction_type’ column, calculate the total value and average amount for each transaction category.

The transaction_type column contains the strings ‘deposit’, ‘withdrawal’, and ‘transfer’. Since we are looking for total values in addition to the averages of these three categories, let us see if a pivot table would work for this task.

Notes on the df.pivot_table:

  • By default the pivot table will calculate the mean of values
  • To change the aggregate function you need to add the parameter, i.e. aggfunc=‘sum’
  • To get a different perspective: replace index with columns
df.pivot_table(index=['transaction_type'], values=['amount'], aggfunc=['sum','mean'])
summean
amountamount
transaction_type
deposit1.794246e+06547.694277
transfer1.843336e+06548.611931
withdrawal1.785630e+06547.067914

Customer Insights:

Group by ‘gender’ and ‘location’, and calculate the average account balance (sum of ‘amount’ column) and transaction frequency (count of transactions) for each group. Output the results as a new dataframe.

The agg() method gives us a really nice API to work with. We not only can use multiple aggregate functions, but Pandas allows us to choose and name the column.

# create group object
group = df.groupby(['gender','location'])
 
# get the total and average of the group
group.agg(
     amount_sum=pd.NamedAgg(column="amount", aggfunc="sum"),
     amount_count=pd.NamedAgg(column="amount", aggfunc="count")
)
amount_sumamount_count
genderlocation
femalerural1.447087e+062620
urban1.280509e+062298
malerural1.312961e+062419
urban1.382655e+062563

Summary

In this tutorial, you have completed three tasks and learned valuable skills along the way. You unified data by combining multiple files into a single DataFrame, categorized transactions to calculate totals and averages, and gained customer insights by grouping data by gender and location. Through these tasks, you have learned how to work with Pandas, including saving and loading data, checking and converting data types, merging DataFrames, using pivot tables, and grouping data to calculate aggregate values.

Key Takeaways

Here are the important skills you learned in this tutorial:

  • How to create DataFrames from dictionaries and save them to CSV and Excel files
  • How to read CSV and Excel files into DataFrames
  • How to check the data types of a DataFrame’s columns
  • How to convert data types (e.g., converting a column to datetime format)
  • How to merge DataFrames based on a common column (e.g., customer_id)
  • How to use pivot tables to calculate aggregate values (e.g., sum and mean) for transaction categories
  • How to group data by multiple columns (e.g., gender and location) and calculate aggregate values (e.g., sum and mean) for each group.