Note

Description:

You’re a financial analyst at a leading digital lending platform. Your company provides personal loans to customers based on their creditworthiness. The business development team wants to analyze customer loan repayment behavior to inform their marketing strategies. They’ve asked you to prepare a dataset that combines customer information, loan details, and repayment history.

Tasks:

  • Calculate the total repayment amount made by the customer and merge this information with the loan details dataset. Ensure the resulting dataset includes all columns from the loan details dataset.
  • Group customers by age brackets (25-34, 35-44, 45-54, 55+) and calculate the average loan amount and average credit score for each age group.
  • Use the pivot_table method to reshape the repayment history dataset, showing:
    • Rows: Months of repayment (January to December)
    • Columns: Years of repayment (e.g., 2022, 2023)
    • Values: Total repayment amount made in each month of each year
# import libraries
import pandas as pd
import numpy as np
import sys
 
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
Numpy version 1.26.4

The Data

The dataset represents a digital lending platform’s customer loan data, comprising three tables: customer information (demographics), loan details (loan amounts and issuance dates), and repayment history (repayment amounts and dates). The dataset includes 1,000 customers, 1,000 loans, and 5,000 repayment records, providing a comprehensive view of customer loan behavior.

customer_info.csv: Contains customer demographic information.

  • customer_id (unique identifier)
  • age
  • income
  • credit_score

loan_details.csv: Contains loan information.

  • loan_id (unique identifier)
  • customer_id (foreign key referencing customer_info)
  • loan_amount
  • loan_issued_date

repayment_history.csv: Contains repayment information.

  • repayment_id (unique identifier)
  • loan_id (foreign key referencing loan_details)
  • repayment_date
  • repayment_amount
# set the seed
np.random.seed(0)
 
# customer information
customer_info = pd.DataFrame({
    'customer_id': np.arange(1000),
    'age': np.random.randint(25, 60, 1000),
    'income': np.random.randint(50000, 150000, 1000),
    'credit_score': np.random.randint(600, 850, 1000)
})
 
# loan details
loan_details = pd.DataFrame({
    'loan_id': np.arange(1000),
    'customer_id': np.random.choice(customer_info['customer_id'], 1000),
    'loan_amount': np.random.randint(1000, 50000, 1000),
    'loan_issued_date': pd.date_range('2022-01-01', periods=1000, freq='D')
})
 
# repayment history
repayment_history = pd.DataFrame({
    'repayment_id': np.arange(5000),
    'loan_id': np.random.choice(loan_details['loan_id'], 5000),
    'repayment_date': pd.date_range('2022-01-01', periods=5000, freq='D'),
    'repayment_amount': np.random.randint(50, 500, 5000)
})
 
# save the datasets to CSV files
customer_info.to_csv('customer_info.csv', index=False)
loan_details.to_csv('loan_details.csv', index=False)
repayment_history.to_csv('repayment_history.csv', index=False)

Let us read into memory the csv files and take a look at the data types of each dataset.

# create dataframes
customer_info_df = pd.read_csv('customer_info.csv')
loan_details_df = pd.read_csv('loan_details.csv')
 
# you can use the parse_dates parameter to specify that column "repayment_date" should be parsed as a date object.
repayment_history_df = pd.read_csv('repayment_history.csv', parse_dates=['repayment_date'])
customer_info_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   customer_id   1000 non-null   int64
 1   age           1000 non-null   int64
 2   income        1000 non-null   int64
 3   credit_score  1000 non-null   int64
dtypes: int64(4)
memory usage: 31.4 KB

We will need to convert the column “loan_issue_date” to a date object.

loan_details_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   loan_id           1000 non-null   int64 
 1   customer_id       1000 non-null   int64 
 2   loan_amount       1000 non-null   int64 
 3   loan_issued_date  1000 non-null   object
dtypes: int64(3), object(1)
memory usage: 31.4+ KB
repayment_history_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   repayment_id      5000 non-null   int64         
 1   loan_id           5000 non-null   int64         
 2   repayment_date    5000 non-null   datetime64[ns]
 3   repayment_amount  5000 non-null   int64         
dtypes: datetime64[ns](1), int64(3)
memory usage: 156.4 KB

Task #1

Calculate the total repayment amount made by the customer and merge this information with the loan details dataset. Ensure the resulting dataset includes all columns from the loan details dataset.

For this task, Let us merge all three datasets and then calculate the total repayment amount per customer. This is the dataframe we will use for the rest of the tutorial.

df = customer_info_df.merge(loan_details_df, on='customer_id').merge(repayment_history_df, on='loan_id')
df.head()
customer_idageincomecredit_scoreloan_idloan_amountloan_issued_daterepayment_idrepayment_daterepayment_amount
00251414457251111422022-01-0217102026-09-07303
1025141445725840257362024-04-205272023-06-12492
2025141445725840257362024-04-205892023-08-13281
3025141445725840257362024-04-2012812025-07-0599
4025141445725840257362024-04-2028312029-10-02117

We can see from the data below that our new dataframe has 5,000 rows and we do not have any null values. Remember that by defaul the .merge method will do an inner merge. This means the columns we merged on (customer_id and loan_id) need to have matching values on both dataframes being merged. If the values do not match, that row will not make it to the final result.

In this specific case, we did loose some of the data, for example:

  • The customer_info_df has a customer_id=1
  • Customer_id=1 is not present in the loan_details_df
  • This means the final dataframe (df), does not contain customer_id=1
# here is customer_id=1
mask = customer_info_df.loc[:,'customer_id'] == 1
customer_info_df[mask]
customer_idageincomecredit_score
112892694692
# customer_id=1 is not found in loan_details_df
mask = loan_details_df.loc[:,'customer_id'] == 1
loan_details_df[mask]
loan_idcustomer_idloan_amountloan_issued_date
# customer_id=1 is not found in df
mask = df.loc[:,'customer_id'] == 1
df[mask]
customer_idageincomecredit_scoreloan_idloan_amountloan_issued_daterepayment_idrepayment_daterepayment_amount

To calculate the total repayment amount per customer, we can make use of the .groupby method.

# create group object
group = df.groupby('customer_id')
 
# get the total repayment amount for the group
group['repayment_amount'].sum()
customer_id
0      1663
2      4450
3       725
4      2104
5      2094
       ... 
994     675
995    2513
996    1156
997    1155
998     442
Name: repayment_amount, Length: 629, dtype: int64

If you wanted to verify the math is correct


We can find select one customer_id and manually add the values.

  • 303 + 492 + 281 + 99 + 117 + 371 = 1,663
mask = df.loc[:,'customer_id'] == 0
df.loc[mask,'repayment_amount']
0    303
1    492
2    281
3     99
4    117
5    371
Name: repayment_amount, dtype: int64

Task #2

Group customers by age brackets (25-34, 35-44, 45-54, 55+) and calculate the average loan amount and average credit score for each age group.

  • bins defines the edges of the bins.
  • labels assigns names to each bin.
  • float(‘inf’) represents infinity, making the last bin open-ended (55+).
  • include_lowest=True ensures that the lowest value (55) is included in the last bin.
# create bins for the (25-34), (35-44), (45-54) and (55+) categories
bins = [25, 34, 44, 55, float('inf')]  
 
# labels for the three categories
labels = ['25-34', '35-44', '45-54', '55+'] 
 
# bin it up!
df['age_brackets'] = pd.cut(df['age'], bins=bins, labels=labels, include_lowest=True)
 
# here we get a frequency count of the categories
df['age_brackets'].value_counts()
age_brackets
25-34    1552
35-44    1535
45-54    1411
55+       502
Name: count, dtype: int64

The observed parameter in df.groupby affects grouping behavior when using Categorical columns, and setting it to True (future default) treats categories as part of the data’s index, whereas setting it to False (current default) treats them as “unobserved” variables.

I went in more detail in a previous tutorial: Education-Analytics-Challenge

# calculate the average loan amount and average credit score for each age group
df.groupby('age_brackets', observed=True).agg(
     avg_loan_amount=pd.NamedAgg(column="loan_amount", aggfunc="mean"),
     avg_credit_score=pd.NamedAgg(column="credit_score", aggfunc="mean")
)
avg_loan_amountavg_credit_score
age_brackets
25-3425107.916237715.453608
35-4425182.748534725.331596
45-5425559.566265725.663359
55+28249.966135712.916335

Task #3

Use the pivot_table method to reshape the repayment history dataset, showing:

  • Rows: Months of repayment (January to December)
  • Columns: Years of repayment (e.g., 2022, 2023)
  • Values: Total repayment amount made in each month of each year

In order to sort the months correctly, we need to use a map as shown below.

month_order = {'January': 1, 'February': 2, 'March': 3, 'April': 4,
               'May': 5, 'June': 6, 'July': 7, 'August': 8,
               'September': 9, 'October': 10, 'November': 11, 'December': 12}
 
repayment_pivot = repayment_history_df.pivot_table(values='repayment_amount', 
                                 index=repayment_history_df['repayment_date'].dt.strftime('%B'), 
                                 columns=repayment_history_df['repayment_date'].dt.year, aggfunc='sum')
 
repayment_pivot.sort_index(key=lambda x: x.map(month_order))
repayment_date20222023202420252026202720282029203020312032203320342035
repayment_date
January7876.08834.09711.08979.09236.08772.08576.08084.09712.08340.08167.07108.07630.08604.0
February7917.08264.08113.08181.07965.06685.07884.08308.08039.07589.07676.07565.08783.07098.0
March8090.08245.09570.09039.07522.08419.08936.07896.07604.07817.07499.08310.07289.08927.0
April7950.09636.08574.08980.09272.08009.08171.08437.07657.08692.08601.08754.08644.08464.0
May9821.08433.09034.08423.08261.08928.07685.09209.08665.08259.08926.08598.08573.09337.0
June7999.09103.07305.07271.09419.08276.07641.09278.08966.08488.08390.08898.08626.08272.0
July7786.09393.09565.08010.08378.09542.08496.08251.08633.08975.010388.07370.08617.08241.0
August7587.08239.08548.08866.08559.08532.08008.09171.08203.08873.07762.09981.07278.09633.0
September8666.08986.07753.08168.08277.08769.09079.08666.07734.07779.07864.09420.06240.02101.0
October9202.08138.07779.08989.07725.07485.09331.08433.09347.08671.09127.09622.09030.0NaN
November9295.08768.08246.08136.07776.08842.08161.08071.08535.08646.08221.07041.07790.0NaN
December8295.08646.07134.07425.08097.08750.09667.08965.08523.07022.08990.08686.07727.0NaN

Summary:

The Pandas tutorial provided a comprehensive guide to analyzing customer loan behavior using three datasets: customer information, loan details, and repayment history. The tutorial covered merging datasets, grouping data, calculating aggregates, and reshaping data using pivot tables.

Key Takeaways:

  • Merging datasets: Combine customer information, loan details, and repayment history datasets using the merge method.
  • Data grouping: Group customers by age brackets using pd.cut and calculate average loan amounts and credit scores using groupby and agg.
  • Data aggregation: Calculate total repayment amounts per customer using groupby and sum.
  • Pivot tables: Reshape repayment history data using pivot_table to show monthly repayment amounts by year.
  • Data sorting: Sort months correctly using a custom sorting map.
  • Data inspection: Use info and value_counts to understand data distribution and quality.