Note
- This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
- š„ Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq
Description:
You are a data scientist at an e-commerce company. You have two datasets: orders and products. Your task is to merge these datasets and calculate the average order value for each region.
Task:
- Merge the orders and products datasets on the ProductID column.
- Calculate the total order value for each order by summing the product prices.
- Calculate the average order value for each region.
Bonus Question:
- What is the average order value only including products with a price greater than 25.64
The Data
TheĀ ordersĀ table contains information about each order, such as the region it was made in and the customer who made it, while theĀ productsĀ table contains information about each product, including its price and the order it was part of. The two tables are linked by the ProductID column, which allows you to match products to the orders they were part of.
Orders:
- OrderID (int): Unique identifier for each order
- Region (str): Geographic region where the order was made (e.g. North, South, East, West)
- CustomerID (int): Unique identifier for each customer
- OrderDate (str): Date the order was made (in YYYY-MM-DD format)
- ProductID (int): Foreign key referencing the ProductID in the Products table
Products:
- ProductID (int): Unique identifier for each product
- ProductPrice (float): Price of each product
OrderID | Region | CustomerID | ProductID | OrderDate | |
---|---|---|---|---|---|
0 | 3 | North | 1 | 1 | 2022-01-01 |
1 | 3 | South | 2 | 1 | 2022-01-15 |
2 | 3 | East | 3 | 2 | 2022-02-01 |
3 | 5 | West | 4 | 3 | 2022-03-01 |
4 | 5 | North | 5 | 4 | 2022-04-01 |
I want to convert the date column from string to datetime. We want our dates to be date objects and not treated as strings.
ProductID | ProductPrice | |
---|---|---|
0 | 1 | 10.99 |
1 | 2 | 5.99 |
2 | 3 | 7.99 |
3 | 4 | 12.99 |
4 | 5 | 8.99 |
Merge the two dataframes
Thanks to the common column named ProductID, merging the dataframes is a simple task.
OrderID | Region | CustomerID | ProductID | OrderDate | ProductPrice | |
---|---|---|---|---|---|---|
0 | 3 | North | 1 | 1 | 2022-01-01 | 10.99 |
1 | 3 | South | 2 | 1 | 2022-01-15 | 10.99 |
2 | 3 | East | 3 | 2 | 2022-02-01 | 5.99 |
3 | 5 | West | 4 | 3 | 2022-03-01 | 7.99 |
4 | 5 | North | 5 | 4 | 2022-04-01 | 12.99 |
Calculate the total order value for each order by summing the product prices.
I started by creating a basic group object and then getting the sum of theĀ ProductPriceĀ column. I then realized I needed to get these two numbers back into the original dataframe.
I immediately thought about using transform, but to be honest, my mind is drawing a blank this morningā¦
Ok, letās start by converting the series into a proper dataframe.
OrderID | ProductPrice | |
---|---|---|
0 | 3 | 27.97 |
1 | 5 | 20.98 |
Since we are going to be merging the newĀ aggĀ dataframe, we should probably rename the ProductPrice column to something else so it doesnāt clash with the other dataframe.
Since this column represents the total order value, that is what we should call it.
Now we could have stopped after we calculated the sum on the group. We did get the total order value for each OrderID (27.50 for OderID 3 and 20.98 for OrderID 5). At the same time my mind said, how do we then add this information back to the original dataframe?
OrderID | Region | CustomerID | ProductID | OrderDate | ProductPrice | TotalOrderValue | |
---|---|---|---|---|---|---|---|
0 | 3 | North | 1 | 1 | 2022-01-01 | 10.99 | 27.97 |
1 | 3 | South | 2 | 1 | 2022-01-15 | 10.99 | 27.97 |
2 | 3 | East | 3 | 2 | 2022-02-01 | 5.99 | 27.97 |
3 | 5 | West | 4 | 3 | 2022-03-01 | 7.99 | 20.98 |
4 | 5 | North | 5 | 4 | 2022-04-01 | 12.99 | 20.98 |
Can you solve the BONUS question?
- What is the average order value only including products with a price greater thanĀ 25.64