Note
- This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
- 🔥 Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq
Description
You’re a data analyst for a large agricultural company that produces and distributes fresh produce to supermarkets across the country. The company has recently expanded its operations and needs your help to clean and prepare its dataset of produce shipments for analysis. Your task is to wrangle the dataset to make it suitable for analysis and visualization.
Tasks
- Data Transformation: Convert the shipment_date column to a standard datetime format and extract the month and year of each shipment.
- Data Categorization: Create a new column produce_category based on the produce_type, categorizing produce into fruits (Apples, Bananas, Oranges) and vegetables (Carrots, Lettuce).
- Data Enrichment: Create a new column season based on the month of the shipment, categorizing shipments into spring (March to May), summer (June to August), fall (September to November), and winter (December to February).
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
Numpy version 1.26.4
The Data
The data represents a sample of shipments from farms to supermarkets across the country, with varying types of produce and shipment characteristics.
Columns:
- produce_type: The type of produce (Apples, Bananas, Carrots, Oranges, Lettuce)
- shipment_date: The date the shipment was sent, with inconsistent formatting
- farm_id: The ID of the farm that produced the shipment
- supermarket_id: The ID of the supermarket that received the shipment
- shipment_weight: The weight of the shipment in pounds
- revenue: The revenue generated by the shipment in dollars
- shipping_time: The time it took for the shipment to arrive in days
- delayed: A boolean indicating whether the shipment was delayed or lost
produce_type | shipment_date | farm_id | supermarket_id | shipment_weight | revenue | shipping_time | delayed | |
---|---|---|---|---|---|---|---|---|
0 | Lettuce | 2022-01-01 | farm1 | market3 | 522.434901 | 253.916357 | 4.193670 | True |
1 | Apples | 2022-01-02 | farm3 | market2 | 129.924851 | 359.323176 | 2.133125 | True |
2 | Oranges | 2022-01-03 | farm4 | market4 | 148.890313 | 509.488722 | 6.331273 | True |
3 | Oranges | 2022-01-04 | farm3 | market1 | 260.623032 | 728.175679 | 2.599539 | False |
4 | Oranges | 2022-01-05 | farm1 | market3 | 878.212644 | 533.657207 | 5.697890 | True |
Data types inspection:
Looking at the output below, we can see the only column that does not have the correct data type is the shipment_date column. If you take a close look at how we created the data, we added extra spaces to similate bad data in this column.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 produce_type 10000 non-null object
1 shipment_date 10000 non-null object
2 farm_id 10000 non-null object
3 supermarket_id 10000 non-null object
4 shipment_weight 10000 non-null float64
5 revenue 10000 non-null float64
6 shipping_time 10000 non-null float64
7 delayed 10000 non-null bool
dtypes: bool(1), float64(3), object(4)
memory usage: 556.8+ KB
Data Transformation:
Convert the shipment_date column to a standard datetime format and extract the month and year of each shipment.
Let us use the pd.to_datetime
method to clean up this column.
Since we have spaces added to the string, we first need to remove those using the strip()
method before attempting to convert them to dates.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 produce_type 10000 non-null object
1 shipment_date 10000 non-null datetime64[ns]
2 farm_id 10000 non-null object
3 supermarket_id 10000 non-null object
4 shipment_weight 10000 non-null float64
5 revenue 10000 non-null float64
6 shipping_time 10000 non-null float64
7 delayed 10000 non-null bool
dtypes: bool(1), datetime64[ns](1), float64(3), object(3)
memory usage: 556.8+ KB
Here are the steps to create a new column named “Year_Month”.
- pull out the year and convert to a string
- pull out the month and convert to a string
- add a dash ”-” in between the two strings
produce_type | shipment_date | farm_id | supermarket_id | shipment_weight | revenue | shipping_time | delayed | Year_Month | |
---|---|---|---|---|---|---|---|---|---|
0 | Lettuce | 2022-01-01 | farm1 | market3 | 522.434901 | 253.916357 | 4.193670 | True | 2022-1 |
1 | Apples | 2022-01-02 | farm3 | market2 | 129.924851 | 359.323176 | 2.133125 | True | 2022-1 |
2 | Oranges | 2022-01-03 | farm4 | market4 | 148.890313 | 509.488722 | 6.331273 | True | 2022-1 |
3 | Oranges | 2022-01-04 | farm3 | market1 | 260.623032 | 728.175679 | 2.599539 | False | 2022-1 |
4 | Oranges | 2022-01-05 | farm1 | market3 | 878.212644 | 533.657207 | 5.697890 | True | 2022-1 |
Data Categorization:
Create a new column produce_category based on the produce_type, categorizing produce into fruits (Apples, Bananas, Oranges) and vegetables (Carrots, Lettuce).
**dict.fromkeys()
is a Python expression that creates a new dictionary from an iterable (like a list or tuple) and assigns a default value to each key.
For this task, I am going to create a Python dictionary that will categorize the “produce_type” column.
Let us start by explaining how I created the dictionary:
dict.fromkeys()
is a method that creates a new dictionary from an iterable.- The ** operator is used to unpack the iterable and pass it as arguments to the
fromkeys()
method.
For example, if you have a list:
Using **dict.fromkeys()
:
Creates a new dictionary:
In the context of the code below, **dict.fromkeys(['Carrots', 'Lettuce'], 'vegetables')
creates a dictionary with each value in [‘Carrots’, ‘Lettuce’] as a key and ‘vegetables’ as the corresponding value.
produce_type | shipment_date | farm_id | supermarket_id | shipment_weight | revenue | shipping_time | delayed | Year_Month | produce_category | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Lettuce | 2022-01-01 | farm1 | market3 | 522.434901 | 253.916357 | 4.193670 | True | 2022-1 | vegetables |
1 | Apples | 2022-01-02 | farm3 | market2 | 129.924851 | 359.323176 | 2.133125 | True | 2022-1 | fruits |
2 | Oranges | 2022-01-03 | farm4 | market4 | 148.890313 | 509.488722 | 6.331273 | True | 2022-1 | fruits |
3 | Oranges | 2022-01-04 | farm3 | market1 | 260.623032 | 728.175679 | 2.599539 | False | 2022-1 | fruits |
4 | Oranges | 2022-01-05 | farm1 | market3 | 878.212644 | 533.657207 | 5.697890 | True | 2022-1 | fruits |
Data Enrichment:
Create a new column season based on the month of the shipment, categorizing shipments into spring (March to May), summer (June to August), fall (September to November), and winter (December to February).
For this task, we will use the same strategy of mapping a dictionary to the dataframe.
produce_type | shipment_date | farm_id | supermarket_id | shipment_weight | revenue | shipping_time | delayed | Year_Month | produce_category | season | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Lettuce | 2022-01-01 | farm1 | market3 | 522.434901 | 253.916357 | 4.193670 | True | 2022-1 | vegetables | winter |
1 | Apples | 2022-01-02 | farm3 | market2 | 129.924851 | 359.323176 | 2.133125 | True | 2022-1 | fruits | winter |
2 | Oranges | 2022-01-03 | farm4 | market4 | 148.890313 | 509.488722 | 6.331273 | True | 2022-1 | fruits | winter |
3 | Oranges | 2022-01-04 | farm3 | market1 | 260.623032 | 728.175679 | 2.599539 | False | 2022-1 | fruits | winter |
4 | Oranges | 2022-01-05 | farm1 | market3 | 878.212644 | 533.657207 | 5.697890 | True | 2022-1 | fruits | winter |
Summary
In this tutorial, you learned how to clean and transform data, create new columns based on existing data, and map values to categories using dictionaries and Pandas functions. You now have a solid understanding of data wrangling and analysis techniques, which will serve as a great foundation for your future work with data.
Key Takeaways:
- Data Cleaning: Use
pd.to_datetime()
to convert date columns to a standard format. - Data Transformation: Use
dt.year
anddt.month
to extract year and month from a datetime column. - Data Categorization: Use
dict.fromkeys()
to create a dictionary for mapping values andmap()
to apply the mapping to a column. - Data Enrichment: Use
dt.month
to extract the month and map it to a season using a dictionary.