Note

Description:

You’re a data scientist at a leading travel company, and you’ve been tasked with analyzing customer booking behavior to inform marketing strategies. Your team has collected data on customer bookings, including demographics, travel dates, and accommodation types. However, the data is scattered across multiple sources and needs to be cleaned, merged, and transformed before analysis can begin.

Tasks:

  • Merge: Combine the customer demographics and booking data into a single DataFrame.
  • Feature Engineering: Take the merged data from Task #1 and reshape it to create a customer-centric view.
  • Group and Aggregate: Group the data by region and accommodation type, and calculate key metrics such as total booking value, average booking value, and count of bookings. Then, sort and rank the results to identify top-performing regions and accommodation types.
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sys
 
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
Numpy version 1.26.4

The Data

The data is designed to mimic real-world travel industry data, with a focus on customer booking behavior and preferences.

customer_demographics:

  • customer_id (unique identifier)
  • age (integer)
  • region (categorical)

booking_data:

  • booking_id (unique identifier)
  • customer_id (foreign key referencing customer_demographics)
  • travel_date (datetime)
  • accommodation_type (categorical)
  • booking_value (integer)
  • booking_status (categorical)
# set the seed
np.random.seed(0)
 
# customer demographics
customer_demographics = pd.DataFrame({
    'customer_id': np.arange(1, 10001),
    'age': np.random.randint(18, 65, 10000),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 10000)
})
 
# booking data
booking_data = pd.DataFrame({
    'booking_id': np.arange(1, 50001),
    'customer_id': np.random.randint(1, 10000, 50000),
    'travel_date': pd.date_range('2022-01-01', periods=50000),
    'accommodation_type': np.random.choice(['Hotel', 'Hostel', 'Vacation Rental'], 50000)
})
 
# additional columns
booking_data['booking_value'] = np.random.randint(100, 1000, 50000)
booking_data['booking_status'] = np.random.choice(['Confirmed', 'Cancelled'], 50000)

Let us make sure the data types are correct for both of the dataframes.

customer_demographics.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   customer_id  10000 non-null  int32 
 1   age          10000 non-null  int32 
 2   region       10000 non-null  object
dtypes: int32(2), object(1)
memory usage: 156.4+ KB

And here is the information for the booking_data.

booking_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   booking_id          50000 non-null  int32         
 1   customer_id         50000 non-null  int32         
 2   travel_date         50000 non-null  datetime64[ns]
 3   accommodation_type  50000 non-null  object        
 4   booking_value       50000 non-null  int32         
 5   booking_status      50000 non-null  object        
dtypes: datetime64[ns](1), int32(3), object(2)
memory usage: 1.7+ MB

Merge:

Combine the customer demographics and booking data into a single DataFrame.

For the merge:

  • The main dataset is booking_data and any additional columns will come from the customer_demographics dataframe. We accomplish this by using a left join.
  • The common column we will merge on is called “customer_id”.
# merge the dataframes
new_df = booking_data.merge(customer_demographics, how='left', on='customer_id')
new_df.head()
booking_idcustomer_idtravel_dateaccommodation_typebooking_valuebooking_statusageregion
0139452022-01-01Hotel964Confirmed44East
1281652022-01-02Hostel603Cancelled55South
2378632022-01-03Vacation Rental395Confirmed62South
3467752022-01-04Hostel566Cancelled30South
4524852022-01-05Hotel776Cancelled24North

Feature Engineering

Take the merged data from Task #1 and reshape it to create a customer-centric view.

To reshape the data, let us follow the directions below:

Use the pivot_table function to create a new DataFrame with the following structure:

Index:

  • customer_id

Columns:

  • age (from customer_demographics)
  • region (from customer_demographics)
  • total_bookings (count of bookings per customer)
  • total_booking_value (sum of booking values per customer)
  • favorite_accommodation_type (mode of accommodation types per customer)

This will create a customer-centric view with summarized booking information for each customer.

pivot = new_df.pivot_table(index='customer_id',
			  values=['booking_id', 'booking_value', 'accommodation_type'],
			  aggfunc={'booking_id': 'count', 
					   'booking_value': 'sum', 
					   'accommodation_type': lambda x: x.mode().iloc[0]})

Now all we need to do is rename the columns.

# the json format makes it easy to rename the columns
pivot = pivot.rename(columns={'booking_id': 'total_bookings', 
                      'booking_value': 'total_booking_value', 
                      'accommodation_type': 'favorite_accommodation_type'})
 
pivot.head()
favorite_accommodation_typetotal_bookingstotal_booking_value
customer_id
1Hotel62959
2Hotel53108
3Vacation Rental32043
4Vacation Rental42568
5Hotel93356

Group and Aggregate

Group the data by region and accommodation type, and calculate key metrics such as total booking value, average booking value, and count of bookings. Then, sort and rank the results to identify top-performing regions and accommodation types.

# create group object
group = new_df.groupby(['region','accommodation_type'])
 
# calculate total booking value, average booking value, and count of bookings
metrics = group.agg(
     total_booking_value=pd.NamedAgg(column="booking_value", aggfunc="sum"),
     avg_booking_value=pd.NamedAgg(column="booking_value", aggfunc="mean"),
     count_of_bookings=pd.NamedAgg(column="booking_value", aggfunc="count")
)
 
metrics.head()
total_booking_valueavg_booking_valuecount_of_bookings
regionaccommodation_type
EastHostel2353019550.5425834274
Hotel2320282545.0509754257
Vacation Rental2326865553.8835994201
NorthHostel2356863548.3627274298
Hotel2309074552.6744854178

We can sort the data to identify which region/accommodation_type had the most bookings.

According to the data, Hostels in the North region had the highest count of bookings.

metrics.sort_values(by='count_of_bookings', ascending=False)
total_booking_valueavg_booking_valuecount_of_bookings
regionaccommodation_type
NorthHostel2356863548.3627274298
EastHostel2353019550.5425834274
Hotel2320282545.0509754257
Vacation Rental2326865553.8835994201
NorthHotel2309074552.6744854178
SouthHotel2290785549.7444204167
WestHotel2294852552.4439094154
NorthVacation Rental2267620548.5292694134
WestVacation Rental2279817552.0138014130
SouthVacation Rental2244830548.1880344095
WestHostel2220775546.9889164060
SouthHostel2224943549.0974834052

We can also try to plot the data to visually attempt to uncover patterns.

From the charts below, what I can spot pretty easily is that price is an important factor. We can see places like the East Hotel and the North Hostel have a healthy number of bookings. My best guess is that the low price is convincing people to book. What do you think?

# this section is used to dynamically create the subplot row and column
num_columns = len(metrics.columns)
num_rows = (num_columns + 1) // 2  # we want to only have 2 columns of plots
num_cols = 2
 
# you can adjust the figsize to your preference here
fig, axs = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(12,4*len(metrics.columns)))
 
# create space between plots
fig.subplots_adjust(hspace=2.0)
 
# this is to avoid having to deal with dynamically determining the rows,column on the loop below
axs = axs.flatten()
 
# loop through each column and create a bar chart
for i, column in enumerate(metrics.columns):
    metrics[column].sort_values().plot.barh(ax=axs[i])
    axs[i].set_title(f'{column}')
 
# hide any unused axes
for i in range(len(metrics.columns), len(axs)):
    axs[i].axis('off')
 
fig.tight_layout()
plt.show()

Summary

This tutorial guides you through analyzing customer booking behavior for a travel company using Pandas, covering tasks such as merging and reshaping data, grouping and aggregating data, and visualizing results to identify trends and patterns. By the end of the tutorial, you’ll have a comprehensive understanding of how to clean, transform, and analyze data to inform marketing strategies.

Key Takeaways:

  • Merging data from multiple sources using pd.merge()
  • Using pivot_table() to reshape data and create a customer-centric view
  • Grouping and aggregating data using groupby() and agg()
  • Calculating key metrics such as total booking value, average booking value, and count of bookings
  • Sorting and ranking data to identify top-performing regions and accommodation types
  • Visualizing data using bar charts to uncover patterns and trends