Note

  • This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
  • Pandas Ninja: Take your skills to the next level with comprehensive Jupyter Notebook tutorials covering dates, group by, plotting, pivot tables, and more. Includes specialized tutorials for Excel and SQL developers, helping you master data analysis with Pandas.

Description

You are a Fintech professional at a leading investment firm, and your team is responsible for generating quarterly reports for a client who invests in real estate. The client has a large portfolio of properties across the United States and needs to identify areas with the highest return on investment. Your task is to analyze a dataset of property sales and identify trends, patterns, and insights that will inform investment decisions.

Tasks

  • Data Wrangling: Using the ‘Date’ column, extract the quarter and year of the sale. Create a new column ‘Sale Quarter’ in the format ‘Q1 2020’.
  • Data Analysis: Calculate the total sales, average sale price, and return on investment (ROI) for each region and property type. ROI is calculated as (Sale Price - Purchase Price) / Purchase Price.
  • Data Visualization: Create a dashboard to visualize the top-performing regions and property types, and identify areas with the highest growth potential.
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # only needed for advanced plotting
import matplotlib as mpl # only needed to get version
import sys
 
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)
print('matplotlib version: ' + mpl.__version__)
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
Numpy version 1.26.4
matplotlib version: 3.8.4

The Data

The dataset contains real estate sales data, comprising 10,000 records of property sales across the United States. The data provides insights into the real estate market, allowing for analysis of sales trends, investment opportunities, and regional performance.

Columns:

  • Region: The region where the property is located (Northeast, Southwest, West, Southeast)
  • Property Type: The type of property (Residential, Commercial, Industrial)
  • Sale Price: The sale price of the property
  • Purchase Price: The purchase price of the property
  • Date: The date the property was sold
  • Location: The location of the property (Urban, Rural, Suburban)
# set the seed
np.random.seed(0)
 
data = {
    'Region': np.random.choice(['Northeast', 'Southwest', 'West', 'Southeast'], size=10000),
    'Property Type': np.random.choice(['Residential', 'Commercial', 'Industrial'], size=10000),
    'Sale Price': np.random.randint(100000, 1000000, size=10000),
    'Purchase Price': np.random.randint(80000, 900000, size=10000),
    'Date': pd.date_range('2020-01-01', periods=10000, freq='D'),
    'Location': np.random.choice(['Urban', 'Rural', 'Suburban'], size=10000)
}
 
df = pd.DataFrame(data)
df.head()
RegionProperty TypeSale PricePurchase PriceDateLocation
0NortheastCommercial2185847732652020-01-01Rural
1SoutheastIndustrial1789131310662020-01-02Suburban
2SouthwestResidential7360167369972020-01-03Suburban
3NortheastIndustrial4084116038222020-01-04Suburban
4SoutheastIndustrial5317222807682020-01-05Rural

Let us take a look at the data types. The code to generate the data was pretty basic, so I am not expecting any issues with the data types for this tutorial.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Region          10000 non-null  object        
 1   Property Type   10000 non-null  object        
 2   Sale Price      10000 non-null  int32         
 3   Purchase Price  10000 non-null  int32         
 4   Date            10000 non-null  datetime64[ns]
 5   Location        10000 non-null  object        
dtypes: datetime64[ns](1), int32(2), object(3)
memory usage: 390.8+ KB

Data Wrangling:

Using the ‘Date’ column, extract the quarter and year of the sale. Create a new column ‘Sale Quarter’ in the format ‘Q1 2020’.

We can extract the day, the year, the month, and yes even the quarter from a date object.

.dt.year: extracts the year
.dt.month: extracts the month
.dt.day: extracts the day
.dt.hour: extracts the hour
.dt.minute: extracts the minute
.dt.second: extracts the second
.dt.microsecond: extracts the microsecond
.dt.quarter: extracts the quarter (Q1, Q2, Q3, or Q4)
.dt.dayofweek: extracts the day of the week (Monday=0, ..., Sunday=6)
.dt.dayofyear: extracts the day of the year the year
.dt.weekofyear: extracts the week of the year
.dt.days_in_month: The number of days in the month of the datetime
# let us verify quarter works
df['Date'].dt.quarter.unique()
array([1, 2, 3, 4])

Method #1: Make sure to convert numerical columns into strings

df['Sale Quarter'] = "Q" + df['Date'].dt.quarter.astype('str') + " " + df['Date'].dt.year.astype('str')
 
df.head()
RegionProperty TypeSale PricePurchase PriceDateLocationSale Quarter
0NortheastCommercial2185847732652020-01-01RuralQ1 2020
1SoutheastIndustrial1789131310662020-01-02SuburbanQ1 2020
2SouthwestResidential7360167369972020-01-03SuburbanQ1 2020
3NortheastIndustrial4084116038222020-01-04SuburbanQ1 2020
4SoutheastIndustrial5317222807682020-01-05RuralQ1 2020

Method #2: When using apply, do not use .dt to access the dates.

df['Sale Quarter'] = df['Date'].apply(lambda x: "Q{} {}".format(x.quarter, x.year))
 
df.head()
RegionProperty TypeSale PricePurchase PriceDateLocationSale Quarter
0NortheastCommercial2185847732652020-01-01RuralQ1 2020
1SoutheastIndustrial1789131310662020-01-02SuburbanQ1 2020
2SouthwestResidential7360167369972020-01-03SuburbanQ1 2020
3NortheastIndustrial4084116038222020-01-04SuburbanQ1 2020
4SoutheastIndustrial5317222807682020-01-05RuralQ1 2020

Data Analysis:

Calculate the total sales, average sale price, and return on investment (ROI) for each region and property type.

ROI is calculated as (Sale Price - Purchase Price) / Purchase Price.

We need to first calculate the ROI and add it to our dataframe.

df['ROI'] = (df["Sale Price"] - df["Purchase Price"]) / df["Purchase Price"]
 
df.head()
RegionProperty TypeSale PricePurchase PriceDateLocationSale QuarterROI
0NortheastCommercial2185847732652020-01-01RuralQ1 2020-0.717323
1SoutheastIndustrial1789131310662020-01-02SuburbanQ1 20200.365060
2SouthwestResidential7360167369972020-01-03SuburbanQ1 2020-0.001331
3NortheastIndustrial4084116038222020-01-04SuburbanQ1 2020-0.323624
4SoutheastIndustrial5317222807682020-01-05RuralQ1 20200.893813
# create group object
group = df.groupby(['Region','Property Type'])
 
# total sales, average sale price, and return on investment (ROI) for each region and property type
stats = group.agg(
     total_sales=pd.NamedAgg(column="Sale Price", aggfunc="sum"),
     avg_sale_price=pd.NamedAgg(column="Sale Price", aggfunc="mean"),
     avg_roi=pd.NamedAgg(column="ROI", aggfunc="mean")
)
 
stats
total_salesavg_sale_priceavg_roi
RegionProperty Type
NortheastCommercial478204953543414.7193180.569722
Industrial441306378545496.1409150.613538
Residential445531386544659.3960880.542753
SoutheastCommercial438267095552669.7288780.585154
Industrial484245809547170.4056500.702109
Residential467827932553642.5230770.693283
SouthwestCommercial487230479567867.6911420.720646
Industrial468987824552400.2638400.602655
Residential468436616547879.0830410.626235
WestCommercial426445537551675.9857700.631625
Industrial444369230528381.9619500.493232
Residential418329316526863.1183880.569282

Data Visualization:

Create a dashboard to visualize the top-performing regions and property types, and identify areas with the highest growth potential.

Looking at the dashboard below, here are some key findings:

  • Southeast (industrial and residential) and Southwest commercial have the best ROI
  • Northeast commercial is the most expensive real estate sold but it has a very low ROI

Take note of all the MatPlotLib code I used to customize the plot, see how you can make use of it on your next visualization.

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(13, 5))
fig.subplots_adjust(hspace=0.25) ## Create space between plots
 
# Chart 1
stats[['avg_roi']].sort_values(by='avg_roi').plot.barh(figsize=(13, 5))
 
# Chart 2
stats[['total_sales']].sort_values(by='total_sales').head(3).plot.bar(ax=axes[0], legend=False)
 
# Chart 3
stats[['avg_sale_price']].sort_values(by='avg_sale_price').head(3).plot.bar(ax=axes[1], legend=False)
 
## customize viz ##
 
# Move the x-axis to the top of the plot
axes[0].spines['bottom'].set_position(('axes', 0.5))
axes[1].spines['bottom'].set_position(('axes', 0.5))
 
# hide ticks
axes[0].tick_params(axis='x', which='both', length=0)
axes[1].tick_params(axis='x', which='both', length=0)
axes[0].spines['bottom'].set_visible(False)
axes[1].spines['bottom'].set_visible(False)
 
# hide labels
axes[0].set_xlabel('')
axes[1].set_xlabel('')
 
# Rotate the x-axis labels 90 degrees and change text to white
axes[0].tick_params(axis='x', rotation=90, labelcolor='white')
axes[1].tick_params(axis='x', rotation=90, labelcolor='white')
 
# set titles to top two charts
axes[0].set_title('Total Sales')
axes[1].set_title('Average Sale Price');

Summary:

The tutorial demonstrates how to extract insights from a real estate sales dataset using Pandas. It starts by creating a sample dataset and then performs data wrangling, analysis, and visualization to identify top-performing regions and property types.

Key Takeaways:

  • Extracted the quarter and year from the ‘Date’ column and created a new ‘Sale Quarter’ column.
  • Converted numerical columns to strings for proper formatting.
  • Calculated the return on investment (ROI) for each property.
  • Grouped data by region and property type to calculate total sales, average sale price, and average ROI.
  • Created a dashboard with three charts to visualize top-performing regions and property types.
  • Customized the plots using MatPlotLib to enhance visualization.