Note

Description:

You are a data scientist at an e-commerce company. You have two datasets: orders and products. Your task is to merge these datasets and calculate the average order value for each region.

Task:

  • Merge the orders and products datasets on the ProductID column.
  • Calculate the total order value for each order by summing the product prices.
  • Calculate the average order value for each region.

Bonus Question:

  • What is the average order value only including products with a price greater than 25.64
# import libraries
import pandas as pd
import numpy as np
import sys
 
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
Numpy version 1.26.4

The Data

TheĀ ordersĀ table contains information about each order, such as the region it was made in and the customer who made it, while theĀ productsĀ table contains information about each product, including its price and the order it was part of. The two tables are linked by the ProductID column, which allows you to match products to the orders they were part of.

Orders:
  • OrderID (int): Unique identifier for each order
  • Region (str): Geographic region where the order was made (e.g. North, South, East, West)
  • CustomerID (int): Unique identifier for each customer
  • OrderDate (str): Date the order was made (in YYYY-MM-DD format)
  • ProductID (int): Foreign key referencing the ProductID in the Products table
Products:
  • ProductID (int): Unique identifier for each product
  • ProductPrice (float): Price of each product
orders = pd.DataFrame({
'OrderID': [3, 3, 3, 5, 5],
'Region': ['North', 'South', 'East', 'West', 'North'],
'CustomerID': [1, 2, 3, 4, 5],
'ProductID': [1, 1, 2, 3, 4],
'OrderDate': ['2022-01-01', '2022-01-15', '2022-02-01', '2022-03-01', '2022-04-01']
})
 
orders.head()
OrderIDRegionCustomerIDProductIDOrderDate
03North112022-01-01
13South212022-01-15
23East322022-02-01
35West432022-03-01
45North542022-04-01
# make sure the datatypes look good
orders.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   OrderID     5 non-null      int64 
 1   Region      5 non-null      object
 2   CustomerID  5 non-null      int64 
 3   ProductID   5 non-null      int64 
 4   OrderDate   5 non-null      object
dtypes: int64(3), object(2)
memory usage: 332.0+ bytes

I want to convert the date column from string to datetime. We want our dates to be date objects and not treated as strings.

orders['OrderDate'] = pd.to_datetime(orders['OrderDate'] )
 
orders.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   OrderID     5 non-null      int64         
 1   Region      5 non-null      object        
 2   CustomerID  5 non-null      int64         
 3   ProductID   5 non-null      int64         
 4   OrderDate   5 non-null      datetime64[ns]
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 332.0+ bytes
products = pd.DataFrame({
'ProductID': [1, 2, 3, 4, 5],
'ProductPrice': [10.99, 5.99, 7.99, 12.99, 8.99]
})
 
products.head()
ProductIDProductPrice
0110.99
125.99
237.99
3412.99
458.99
# make sure datatypes look ok
 
products.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   ProductID     5 non-null      int64  
 1   ProductPrice  5 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 212.0 bytes

Merge the two dataframes

Thanks to the common column named ProductID, merging the dataframes is a simple task.

# Merge the orders and products datasets on the ProductID column.
df = orders.merge(products, on='ProductID')
 
df
OrderIDRegionCustomerIDProductIDOrderDateProductPrice
03North112022-01-0110.99
13South212022-01-1510.99
23East322022-02-015.99
35West432022-03-017.99
45North542022-04-0112.99

Calculate the total order value for each order by summing the product prices.

I started by creating a basic group object and then getting the sum of theĀ ProductPriceĀ column. I then realized I needed to get these two numbers back into the original dataframe.

I immediately thought about using transform, but to be honest, my mind is drawing a blank this morningā€¦

# create group object
group = df.groupby('OrderID')
 
# get order value for group
group['ProductPrice'].sum(numeric_only=True)
OrderID
3    27.97
5    20.98
Name: ProductPrice, dtype: float64

Ok, letā€™s start by converting the series into a proper dataframe.

agg = pd.DataFrame(group['ProductPrice'].sum(numeric_only=True)).reset_index()
 
agg
OrderIDProductPrice
0327.97
1520.98

Since we are going to be merging the newĀ aggĀ dataframe, we should probably rename the ProductPrice column to something else so it doesnā€™t clash with the other dataframe.

Since this column represents the total order value, that is what we should call it.

# rename columns
 
agg.columns = ['OrderID', 'TotalOrderValue']

Now we could have stopped after we calculated the sum on the group. We did get the total order value for each OrderID (27.50 for OderID 3 and 20.98 for OrderID 5). At the same time my mind said, how do we then add this information back to the original dataframe?

df.merge(agg, on='OrderID')
OrderIDRegionCustomerIDProductIDOrderDateProductPriceTotalOrderValue
03North112022-01-0110.9927.97
13South212022-01-1510.9927.97
23East322022-02-015.9927.97
35West432022-03-017.9920.98
45North542022-04-0112.9920.98

Can you solve the BONUS question?

  • What is the average order value only including products with a price greater thanĀ 25.64