Note
- This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
- đ„ Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq
Description:
Imagine youâre a Data Analyst at a bank, and your manager just dropped a bombshell on you. It seems that the bankâs financial reports are a mess, and they need your expertise to sort it out. The problem is that the data is scattered across multiple files, with different formats and inconsistencies. Your task is to use your Pandas skills to wrangle the data, identify the issues, and provide insights to help the bank get back on track.
Tasks:
-
Data Unification: Combine the âtransactions.csvâ and âcustomers.xlsxâ files into a single dataframe, handling missing values and data type inconsistencies. Ensure that the âcustomer_idâ column is of numeric type (int or float) and the âdateâ column is in datetime format.
-
Transaction Categorization: Create a new column âtransaction_categoryâ and categorize each transaction as either âdepositâ, âwithdrawalâ, or âtransferâ based on the âtransaction_typeâ column. Calculate the total value and average amount for each transaction category.
-
Customer Insights: Group by âgenderâ and âlocationâ, and calculate the average account balance (sum of âamountâ column) and transaction frequency (count of transactions) for each group. Output the results as a new dataframe.
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
Numpy version 1.26.4
The Data
The dataset contains financial transaction records for 1,000 bank customers, including 10,000 transactions (deposits, withdrawals, and transfers) across two files (transactions.csv and customers.xlsx), with demographic information about each customer, such as age, gender, and location.
Summary of Data:
Transactions:
- customer_id (int): unique customer identifier
- transaction_type (str): type of transaction (deposit, withdrawal, transfer)
- amount (float): transaction amount
- date (datetime): transaction date
Customers:
- customer_id (int): unique customer identifier
- age (int): customer age
- gender (str): customer gender
- location (str): customer location (urban/rural)
Data Unification:
- Combine the âtransactions.csvâ and âcustomers.xlsxâ files into a single dataframe, handling missing values and data type inconsistencies.
- Ensure the âcustomer_idâ column is of numeric type (int or float) and the âdateâ column is in datetime format.
Let us start by reading in the CSV and Excel file. We will then look at their data types to ensure they are correct.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customer_id 10000 non-null int64
1 transaction_type 10000 non-null object
2 amount 9900 non-null float64
3 date 10000 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 312.6+ KB
From the table above, we can see we need to update the date
column so it is represented as a date object. Currently it is represented as a string object.
The easiest way is to use the pd.to_datetime
method.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customer_id 10000 non-null int64
1 transaction_type 10000 non-null object
2 amount 9900 non-null float64
3 date 10000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 312.6+ KB
Let us now import the Excel file and perform the same process as we did with the CSV import.
I am not seeing any issues with the datatypes after importing from Excel. Yay!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customer_id 1000 non-null int64
1 age 1000 non-null int64
2 gender 1000 non-null object
3 location 1000 non-null object
dtypes: int64(2), object(2)
memory usage: 31.4+ KB
Now we can try to merge the dataframes. The common key between them is customer_id
and we will use the .merge()
method to accomplish this portion of the task.
customer_id | transaction_type | amount | date | age | gender | location | |
---|---|---|---|---|---|---|---|
0 | 685 | withdrawal | 189.852038 | 2024-01-01 | 37 | female | rural |
1 | 560 | withdrawal | 626.720139 | 2024-01-02 | 36 | male | urban |
2 | 630 | withdrawal | 657.841588 | 2024-01-03 | 28 | male | rural |
3 | 193 | deposit | 327.129016 | 2024-01-04 | 48 | male | rural |
4 | 836 | transfer | 798.239145 | 2024-01-05 | 41 | male | urban |
Transaction Categorization:
Using the âtransaction_typeâ column, calculate the total value and average amount for each transaction category.
The transaction_type column contains the strings âdepositâ, âwithdrawalâ, and âtransferâ. Since we are looking for total values in addition to the averages of these three categories, let us see if a pivot table would work for this task.
Notes on the df.pivot_table
:
- By default the pivot table will calculate the mean of
values
- To change the aggregate function you need to add the parameter, i.e. aggfunc=âsumâ
- To get a different perspective: replace
index
withcolumns
sum | mean | |
---|---|---|
amount | amount | |
transaction_type | ||
deposit | 1.794246e+06 | 547.694277 |
transfer | 1.843336e+06 | 548.611931 |
withdrawal | 1.785630e+06 | 547.067914 |
Customer Insights:
Group by âgenderâ and âlocationâ, and calculate the average account balance (sum of âamountâ column) and transaction frequency (count of transactions) for each group. Output the results as a new dataframe.
The agg()
method gives us a really nice API to work with. We not only can use multiple aggregate functions, but Pandas allows us to choose and name the column.
amount_sum | amount_count | ||
---|---|---|---|
gender | location | ||
female | rural | 1.447087e+06 | 2620 |
urban | 1.280509e+06 | 2298 | |
male | rural | 1.312961e+06 | 2419 |
urban | 1.382655e+06 | 2563 |
Summary
In this tutorial, you have completed three tasks and learned valuable skills along the way. You unified data by combining multiple files into a single DataFrame, categorized transactions to calculate totals and averages, and gained customer insights by grouping data by gender and location. Through these tasks, you have learned how to work with Pandas, including saving and loading data, checking and converting data types, merging DataFrames, using pivot tables, and grouping data to calculate aggregate values.
Key Takeaways
Here are the important skills you learned in this tutorial:
- How to create DataFrames from dictionaries and save them to CSV and Excel files
- How to read CSV and Excel files into DataFrames
- How to check the data types of a DataFrameâs columns
- How to convert data types (e.g., converting a column to datetime format)
- How to merge DataFrames based on a common column (e.g., customer_id)
- How to use pivot tables to calculate aggregate values (e.g., sum and mean) for transaction categories
- How to group data by multiple columns (e.g., gender and location) and calculate aggregate values (e.g., sum and mean) for each group.