Note

Description

You’re a data analyst for a large agricultural company that produces and distributes fresh produce to supermarkets across the country. The company has recently expanded its operations and needs your help to clean and prepare its dataset of produce shipments for analysis. Your task is to wrangle the dataset to make it suitable for analysis and visualization.

Tasks

  • Data Transformation: Convert the shipment_date column to a standard datetime format and extract the month and year of each shipment.
  • Data Categorization: Create a new column produce_category based on the produce_type, categorizing produce into fruits (Apples, Bananas, Oranges) and vegetables (Carrots, Lettuce).
  • Data Enrichment: Create a new column season based on the month of the shipment, categorizing shipments into spring (March to May), summer (June to August), fall (September to November), and winter (December to February).
# import libraries
import pandas as pd
import numpy as np
import sys
 
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
Numpy version 1.26.4

The Data

The data represents a sample of shipments from farms to supermarkets across the country, with varying types of produce and shipment characteristics.

Columns:

  • produce_type: The type of produce (Apples, Bananas, Carrots, Oranges, Lettuce)
  • shipment_date: The date the shipment was sent, with inconsistent formatting
  • farm_id: The ID of the farm that produced the shipment
  • supermarket_id: The ID of the supermarket that received the shipment
  • shipment_weight: The weight of the shipment in pounds
  • revenue: The revenue generated by the shipment in dollars
  • shipping_time: The time it took for the shipment to arrive in days
  • delayed: A boolean indicating whether the shipment was delayed or lost
# set the seed
np.random.seed(0)
 
# generate the data
data = {
    'produce_type': np.random.choice(['Apples', 'Bananas', 'Carrots', 'Oranges', 'Lettuce'], size=10000),
    'shipment_date': pd.date_range('2022-01-01', periods=10000, freq='D').astype(str) + np.random.choice(['', ' ', '  ', '\t'], size=10000),
    'farm_id': np.random.choice(['farm1', 'farm2', 'farm3', 'farm4', 'farm5'], size=10000),
    'supermarket_id': np.random.choice(['market1', 'market2', 'market3', 'market4', 'market5'], size=10000),
    'shipment_weight': np.random.uniform(100, 1000, size=10000),
    'revenue': np.random.uniform(100, 1000, size=10000),
    'shipping_time': np.random.uniform(1, 10, size=10000),
    'delayed': np.random.choice([True, False], size=10000)
}
 
df = pd.DataFrame(data)
df.head()
produce_typeshipment_datefarm_idsupermarket_idshipment_weightrevenueshipping_timedelayed
0Lettuce2022-01-01farm1market3522.434901253.9163574.193670True
1Apples2022-01-02farm3market2129.924851359.3231762.133125True
2Oranges2022-01-03farm4market4148.890313509.4887226.331273True
3Oranges2022-01-04farm3market1260.623032728.1756792.599539False
4Oranges2022-01-05farm1market3878.212644533.6572075.697890True

Data types inspection:

Looking at the output below, we can see the only column that does not have the correct data type is the shipment_date column. If you take a close look at how we created the data, we added extra spaces to similate bad data in this column.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   produce_type     10000 non-null  object 
 1   shipment_date    10000 non-null  object 
 2   farm_id          10000 non-null  object 
 3   supermarket_id   10000 non-null  object 
 4   shipment_weight  10000 non-null  float64
 5   revenue          10000 non-null  float64
 6   shipping_time    10000 non-null  float64
 7   delayed          10000 non-null  bool   
dtypes: bool(1), float64(3), object(4)
memory usage: 556.8+ KB

Data Transformation:

Convert the shipment_date column to a standard datetime format and extract the month and year of each shipment.


Let us use the pd.to_datetime method to clean up this column.

Since we have spaces added to the string, we first need to remove those using the strip() method before attempting to convert them to dates.

df['shipment_date'] = pd.to_datetime(df['shipment_date'].str.strip())
 
# we now have a date object
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   produce_type     10000 non-null  object        
 1   shipment_date    10000 non-null  datetime64[ns]
 2   farm_id          10000 non-null  object        
 3   supermarket_id   10000 non-null  object        
 4   shipment_weight  10000 non-null  float64       
 5   revenue          10000 non-null  float64       
 6   shipping_time    10000 non-null  float64       
 7   delayed          10000 non-null  bool          
dtypes: bool(1), datetime64[ns](1), float64(3), object(3)
memory usage: 556.8+ KB

Here are the steps to create a new column named “Year_Month”.

  • pull out the year and convert to a string
  • pull out the month and convert to a string
  • add a dash ”-” in between the two strings
# new column composed of the year and the month
df['Year_Month'] = df['shipment_date'].dt.year.astype(str) + '-' + df['shipment_date'].dt.month.astype(str)
 
df.head()
produce_typeshipment_datefarm_idsupermarket_idshipment_weightrevenueshipping_timedelayedYear_Month
0Lettuce2022-01-01farm1market3522.434901253.9163574.193670True2022-1
1Apples2022-01-02farm3market2129.924851359.3231762.133125True2022-1
2Oranges2022-01-03farm4market4148.890313509.4887226.331273True2022-1
3Oranges2022-01-04farm3market1260.623032728.1756792.599539False2022-1
4Oranges2022-01-05farm1market3878.212644533.6572075.697890True2022-1

Data Categorization:

Create a new column produce_category based on the produce_type, categorizing produce into fruits (Apples, Bananas, Oranges) and vegetables (Carrots, Lettuce).

**dict.fromkeys() is a Python expression that creates a new dictionary from an iterable (like a list or tuple) and assigns a default value to each key.

For this task, I am going to create a Python dictionary that will categorize the “produce_type” column.

Let us start by explaining how I created the dictionary:

  • dict.fromkeys() is a method that creates a new dictionary from an iterable.
  • The ** operator is used to unpack the iterable and pass it as arguments to the fromkeys() method.

For example, if you have a list:

my_list = ['apple', 'banana', 'cherry']

Using **dict.fromkeys():

my_dict = **dict.fromkeys(my_list, 'fruit')

Creates a new dictionary:

my_dict = {'apple': 'fruit', 'banana': 'fruit', 'cherry': 'fruit'}

In the context of the code below, **dict.fromkeys(['Carrots', 'Lettuce'], 'vegetables') creates a dictionary with each value in [‘Carrots’, ‘Lettuce’] as a key and ‘vegetables’ as the corresponding value.

# map the values
mapping = {**dict.fromkeys(['Apples', 'Bananas', 'Oranges'], 'fruits'), 
           **dict.fromkeys(['Carrots', 'Lettuce'], 'vegetables')}
 
# apply the map to the produce_type column
df['produce_category'] = df['produce_type'].map(mapping)
df.head()
produce_typeshipment_datefarm_idsupermarket_idshipment_weightrevenueshipping_timedelayedYear_Monthproduce_category
0Lettuce2022-01-01farm1market3522.434901253.9163574.193670True2022-1vegetables
1Apples2022-01-02farm3market2129.924851359.3231762.133125True2022-1fruits
2Oranges2022-01-03farm4market4148.890313509.4887226.331273True2022-1fruits
3Oranges2022-01-04farm3market1260.623032728.1756792.599539False2022-1fruits
4Oranges2022-01-05farm1market3878.212644533.6572075.697890True2022-1fruits

Data Enrichment:

Create a new column season based on the month of the shipment, categorizing shipments into spring (March to May), summer (June to August), fall (September to November), and winter (December to February).

For this task, we will use the same strategy of mapping a dictionary to the dataframe.

# the trick is to use numbers instead of dates
season_map = {**dict.fromkeys([12,1,2], 'winter'), 
              **dict.fromkeys(range(3, 6), 'spring'), 
              **dict.fromkeys(range(6, 9), 'summer'), 
              **dict.fromkeys(range(9, 12), 'fall')}
 
# apply map
df['season'] = df['shipment_date'].dt.month.map(season_map)
df.head()
produce_typeshipment_datefarm_idsupermarket_idshipment_weightrevenueshipping_timedelayedYear_Monthproduce_categoryseason
0Lettuce2022-01-01farm1market3522.434901253.9163574.193670True2022-1vegetableswinter
1Apples2022-01-02farm3market2129.924851359.3231762.133125True2022-1fruitswinter
2Oranges2022-01-03farm4market4148.890313509.4887226.331273True2022-1fruitswinter
3Oranges2022-01-04farm3market1260.623032728.1756792.599539False2022-1fruitswinter
4Oranges2022-01-05farm1market3878.212644533.6572075.697890True2022-1fruitswinter

Summary

In this tutorial, you learned how to clean and transform data, create new columns based on existing data, and map values to categories using dictionaries and Pandas functions. You now have a solid understanding of data wrangling and analysis techniques, which will serve as a great foundation for your future work with data.

Key Takeaways:

  • Data Cleaning: Use pd.to_datetime() to convert date columns to a standard format.
  • Data Transformation: Use dt.year and dt.month to extract year and month from a datetime column.
  • Data Categorization: Use dict.fromkeys() to create a dictionary for mapping values and map() to apply the mapping to a column.
  • Data Enrichment: Use dt.month to extract the month and map it to a season using a dictionary.