Note

Description

You are a data analyst at a fintech company that provides loan services to customers. Your manager has asked you to analyze the loan repayment behavior of borrowers to identify trends and insights that can inform business decisions. You have been provided with a large dataset containing loan repayment information, including loan amounts, repayment dates, and borrower information. Your task is to use Pandas to analyze this data, identify trends, and provide insights that can help the company improve its loan services.

Tasks

  • Loan Repayment Analysis: Calculate the average loan repayment amount by month and identify the top 5 months with the highest repayment amounts.
  • Borrower Insights: Group borrowers by age and calculate the average loan amount and repayment amount for each age group.
  • Late Repayments: Identify loans that were repaid late (after the due date) and calculate the total amount of late repayments.
# import libraries
import pandas as pd
import numpy as np
import sys
 
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
Numpy version 1.26.4

The Data

The dataset contains information about loan repayments made by borrowers to a fintech company. The dataset includes 1200 loans, each with information about the loan amount, repayment date, borrower age, due date, and repayment amount.

Column Definitions

  • Loan Amount: The amount borrowed by the customer (float)
  • Repayment Date: The date on which the loan was repaid (datetime)
  • Borrower Age: The age of the borrower at the time of loan repayment (int)
  • Due Date: The date by which the loan was due to be repaid (datetime)
  • Repayment Amount: The amount repaid by the borrower (float)

Some additional notes on the dataset:

  • The dataset includes some missing values in the Repayment Amount column, representing loans where the repayment amount is not known.
  • Some loans have been marked as late repayments, where the Repayment Date is after the Due Date.
# set the seed
np.random.seed(0)
 
# generate the loan data
loans = pd.DataFrame({
    'Loan Amount': np.random.uniform(1000, 10000, size=1000),
    'Repayment Date': pd.date_range('2022-01-01', periods=1000),
    'Borrower Age': np.random.randint(25, 60, size=1000),
    'Due Date': pd.date_range('2022-01-01', periods=1000),
    'Repayment Amount': np.random.uniform(1000, 10000, size=1000)
})
 
# generate repayment amounts (never larger than loan amount)
loans['Repayment Amount'] = loans.apply(lambda x: np.random.choice([np.random.uniform(0, x['Loan Amount']), x['Loan Amount']]), axis=1)
 
# add some late repayments
late_repayments = loans.sample(frac=0.2)
late_repayments['Repayment Date'] += pd.Timedelta(days=30)
loans = pd.concat([loans, late_repayments])
 
# add some missing values
loans.loc[np.random.choice(loans.index, size=100), 'Repayment Amount'] = np.nan
 
loans.head()
Loan AmountRepayment DateBorrower AgeDue DateRepayment Amount
05939.3215352022-01-01422022-01-012940.536278
17436.7042972022-01-02542022-01-027436.704297
26424.8703852022-01-03312022-01-036424.870385
35903.9486472022-01-04532022-01-042272.657962
44812.8931942022-01-05352022-01-053978.737138

Let us start with taking a peek at the datatypes.

loans.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1200 entries, 0 to 941
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Loan Amount       1200 non-null   float64       
 1   Repayment Date    1200 non-null   datetime64[ns]
 2   Borrower Age      1200 non-null   int32         
 3   Due Date          1200 non-null   datetime64[ns]
 4   Repayment Amount  1074 non-null   float64       
dtypes: datetime64[ns](2), float64(2), int32(1)
memory usage: 83.9 KB

Missing Values

If I received this dataset, I would have asked if I could take a look at the source of the data. There are lots of times where the person/script that compiled the data for you is not 100% perfect. Data is usually messy and it may have edge cases that may not have been taken into account when preparing the data.

In this case, since we have no other reference point, we will simply delete rows with the missing values.

loans = loans.dropna()
loans.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1074 entries, 0 to 941
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Loan Amount       1074 non-null   float64       
 1   Repayment Date    1074 non-null   datetime64[ns]
 2   Borrower Age      1074 non-null   int32         
 3   Due Date          1074 non-null   datetime64[ns]
 4   Repayment Amount  1074 non-null   float64       
dtypes: datetime64[ns](2), float64(2), int32(1)
memory usage: 46.1 KB

Loan Repayment Analysis:

Calculate the average loan repayment amount by month and identify the top 5 months with the highest repayment amounts.

Challenge: Can you redo this section, but group by month and year which may have been the original intention of the task?

# create group object
group = loans.groupby(loans['Repayment Date'].dt.month)
 
# top 5 repayment amounts
group.mean(numeric_only=True).sort_values(by='Repayment Amount', ascending=False)['Repayment Amount'].head(5)
Repayment Date
1     4584.052671
12    4165.392593
9     4153.814819
4     4096.045033
3     4082.376501
Name: Repayment Amount, dtype: float64

Notice I was able to pull the month out of the date column.

loans['Repayment Date'].dt.month
0       1
1       1
2       1
3       1
4       1
       ..
301    11
515     6
565     8
995    10
941     8
Name: Repayment Date, Length: 1074, dtype: int32

Borrower Insights:

Group borrowers by age and calculate the average loan amount and repayment amount for each age group.

From the figure below, it seems like not a lot of people are paying their loans off!

# create group object
group = loans.groupby('Borrower Age')
 
# top 5 repayment amounts
group.mean(numeric_only=True).plot(figsize=(10,5));

Late Repayments:

Identify loans that were repaid late (after the due date) and calculate the total amount of late repayments.

# flag the loans that were repaid late
loans['late'] = loans['Repayment Date'] > loans['Due Date']
loans[loans['late'] == True].head()
Loan AmountRepayment DateBorrower AgeDue DateRepayment Amountlate
3872434.7301712023-02-22472023-01-232434.730171True
5101941.6509992023-06-25542023-05-26894.474464True
462893.4430502022-03-18282022-02-162893.443050True
209807.5650802022-02-20502022-01-212333.814874True
7207759.1948402024-01-21552023-12-227759.194840True

How many were repaid late?

loans[loans['late'] == True].count()
Loan Amount         171
Repayment Date      171
Borrower Age        171
Due Date            171
Repayment Amount    171
late                171
dtype: int64

Calculate the total amount of late repayments

We get a total of $650,231.87.

loans[loans['late'] == True].sum(numeric_only=True)['Repayment Amount']
650231.8707490516

Key Takeaways

  • How to handle missing values (dropna() function)
  • How to group data by a specific column (groupby() function)
  • How to calculate average values by group (mean() function)
  • How to sort values in descending order (sort_values() function)
  • How to extract specific columns or rows from a dataset

Tasks and Outputs

Loan Repayment Analysis:

  • Calculated average loan repayment amount by month
  • Identified top 5 months with highest repayment amounts: January, December, September, April, and March

Borrower Insights:

  • Grouped borrowers by age and calculated average loan and repayment amounts
  • Plotted a graph to visualize the data

Late Repayments:

  • Identified loans repaid late (after due date)
  • Calculated total amount of late repayments: $650,231.87