Note
- This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
- š„ Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq
Description:
Youāre a data analyst for a non-profit organization, and youāve been tasked with cleaning up a messy dataset of donations. The data is a bit of a disaster, with missing values, duplicates, and inconsistent formatting. Your mission is to use your Pandas skills to wrangle the data into shape.
Tasks:
- Clean up the Mess:Ā Remove duplicates, handle missing values, and ensure data types are correct.
- Standardize the data:Ā Normalize the āDonation Amountā column and convert theĀ
date
Ā column to a standard format. - Data quality check:Ā Identify and correct any inconsistent or invalid data.
The Data
The columns below represent information about individual donations, the date they were made, and the campaign that drove the donation. The goal is to clean, transform, and prepare this data for analysis.
Hereās a breakdown of what each column in the sample data represents:
- Donor ID:Ā A unique identifier for each donor
- Donation Amount:Ā The amount donated by each donor ( initially in a mix of numeric and string formats, requiring cleanup)
- Date:Ā The date each donation was made
- Campaign:Ā The marketing campaign or channel that led to the donation
Important Note about theĀ Donation Amount
Ā Column:
The logic below will generate a mix of:
- Numeric values (e.g., 10.50, 500.00)
- String values with words (e.g., ā10 thousandā, ā5 dollars and 25 centsā)
- String values with currency symbols (e.g., ā
$50
ā, ā$1000
ā)
Your task will be to clean up this column by converting all values to a standard numeric format, handling the various string formats, and dealing with any potential errors or inconsistencies. Good luck!
donor_id | date | campaign | donation_amount | |
---|---|---|---|---|
0 | 685 | Invalid Date | Unknown | NaN |
1 | 560 | 2022-03-07T00:00:00.000000000 | 6 dollars and 98 cents | |
2 | 630 | 2022-11-08T00:00:00.000000000 | Social Media | 76 thousand |
3 | 193 | 2022-03-25T00:00:00.000000000 | NaN | |
4 | 836 | 2022-04-07T00:00:00.000000000 | $81 | |
ā¦ | ā¦ | ā¦ | ā¦ | ā¦ |
9995 | 426 | Invalid Date | Social Media | $81 |
9996 | 891 | 2022-04-18T00:00:00.000000000 | Unknown | NaN |
9997 | 778 | 2022-08-24T00:00:00.000000000 | Event | $81 |
9998 | 974 | 2022-10-07T00:00:00.000000000 | $81 | |
9999 | 74 | 2022-07-09T00:00:00.000000000 | Event | NaN |
10000 rows Ć 4 columns |
Letās start by looking at the datatypes.
As you can expect, Pandas is treating all of the columns as strings. Let the clean up process begin.
<class āpandas.core.frame.DataFrameā> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 4 columns):
Column Non-Null Count Dtype
`--- ------ -------------- ----- 0 donor_id 10000 non-null object 1 date 10000 non-null object 2 campaign 10000 non-null object 3 donation_amount 6666 non-null object dtypes: object(4) memory usage: 312.6+ KB
Clean up the Mess:
Remove duplicates, handle missing values, and ensure data types are correct.
If we assume that we will not be able to get the correct donation amounts, we might as well remove those rows from the data.
<class āpandas.core.frame.DataFrameā> Index: 6666 entries, 1 to 9998 Data columns (total 4 columns):
Column Non-Null Count Dtype
`--- ------ -------------- ----- 0 donor_id 6666 non-null object 1 date 6666 non-null object 2 campaign 6666 non-null object 3 donation_amount 6666 non-null object dtypes: object(4) memory usage: 260.4+ KB
The marketing manager told us to replace any missing dates with ā1970-01-01ā so we can identify these and deal with them later.
donor_id | date | campaign | donation_amount | |
---|---|---|---|---|
5 | 764 | Invalid Date | Social Media | 77.55500350452421 |
10 | 278 | Invalid Date | Event | 76 thousand |
20 | 487 | Invalid Date | Event | 77.55500350452421 |
25 | 850 | Invalid Date | 76 thousand | |
35 | 710 | Invalid Date | Unknown | 6 dollars and 98 cents |
Here is where we set the dates toĀ 1970-01-01
.
donor_id | date | campaign | donation_amount | |
---|---|---|---|---|
1 | 560 | 2022-03-07T00:00:00.000000000 | 6 dollars and 98 cents | |
2 | 630 | 2022-11-08T00:00:00.000000000 | Social Media | 76 thousand |
4 | 836 | 2022-04-07T00:00:00.000000000 | $81 | |
5 | 764 | 1970-01-01 | Social Media | 77.55500350452421 |
7 | 360 | 2022-10-20T00:00:00.000000000 | Unknown | 76 thousand |
ā¦ | ā¦ | ā¦ | ā¦ | ā¦ |
9992 | 308 | 2022-01-13T00:00:00.000000000 | 77.55500350452421 | |
9994 | 694 | 2022-07-21T00:00:00.000000000 | Event | 6 dollars and 98 cents |
9995 | 426 | 1970-01-01 | Social Media | $81 |
9997 | 778 | 2022-08-24T00:00:00.000000000 | Event | $81 |
9998 | 974 | 2022-10-07T00:00:00.000000000 | $81 | |
6666 rows Ć 4 columns |
Although we successfully converted the strings into dates, the date column remains in string format.
<class āpandas.core.frame.DataFrameā> Index: 6666 entries, 1 to 9998 Data columns (total 4 columns):
Column Non-Null Count Dtype
`--- ------ -------------- ----- 0 donor_id 6666 non-null object 1 date 6666 non-null object 2 campaign 6666 non-null object 3 donation_amount 6666 non-null object dtypes: object(4) memory usage: 260.4+ KB
Convert string column to a datetime object.
1 2022-03-07
2 2022-11-08
4 2022-04-07
5 1970-01-01
7 2022-10-20
ā¦
9992 2022-01-13
9994 2022-07-21
9995 1970-01-01
9997 2022-08-24
9998 2022-10-07
Name: date, Length: 6666, dtype: datetime64[ns]
This morning, for some reason I canāt get these datatypes to behaveā¦ the code below did not work.
<class āpandas.core.frame.DataFrameā> Index: 6666 entries, 1 to 9998 Data columns (total 4 columns):
Column Non-Null Count Dtype
`--- ------ -------------- ----- 0 donor_id 6666 non-null object 1 date 6666 non-null object 2 campaign 6666 non-null object 3 donation_amount 6666 non-null object dtypes: object(4) memory usage: 260.4+ KB
We can also take care of the Donor ID pretty easily.
This also did not workā¦
<class āpandas.core.frame.DataFrameā> Index: 6666 entries, 1 to 9998 Data columns (total 4 columns):
Column Non-Null Count Dtype
`--- ------ -------------- ----- 0 donor_id 6666 non-null object 1 date 6666 non-null object 2 campaign 6666 non-null object 3 donation_amount 6666 non-null object dtypes: object(4) memory usage: 260.4+ KB
This did the trick for me to get the date types to be represented correctly.
<class āpandas.core.frame.DataFrameā> Index: 6666 entries, 1 to 9998 Data columns (total 4 columns):
Column Non-Null Count Dtype
`--- ------ -------------- -----
0 donor_id 6666 non-null Int64
1 date 6666 non-null datetime64[ns]
2 campaign 6666 non-null string
3 donation_amount 6666 non-null string
dtypes: Int64(1), datetime64ns, string(2)
memory usage: 266.9 KB
Donation Amount Cleanup
- Remove the dollar sign
- Apply a custom function to convert the values to a numeric format
1 6 dollars and 98 cents 2 76 thousand 4 81 5 77.55500350452421 7 76 thousand Name: donation_amount, dtype: string
1 6.98 2 76000 4 81 5 77.55500350452421 7 76000 Name: donation_amount, dtype: string
Now letās fix the datatype for the donation amount.
<class āpandas.core.frame.DataFrameā> Index: 6666 entries, 1 to 9998 Data columns (total 4 columns):
Column Non-Null Count Dtype
`--- ------ -------------- -----
0 donor_id 6666 non-null Int64
1 date 6666 non-null datetime64[ns]
2 campaign 6666 non-null string
3 donation_amount 6666 non-null float64
dtypes: Int64(1), datetime64ns, float64(1), string(1)
memory usage: 266.9 KB
OK, so we have taken care of a lot here.
- The donor_id column is now in integer format
- The date column is now in the correct format
- The donation_amount column has been successfully cleaned up and converted to the correct numeric format
donor_id | date | campaign | donation_amount | |
---|---|---|---|---|
1 | 560 | 2022-03-07 | 6.980000 | |
2 | 630 | 2022-11-08 | Social Media | 76000.000000 |
4 | 836 | 2022-04-07 | 81.000000 | |
5 | 764 | 1970-01-01 | Social Media | 77.555004 |
7 | 360 | 2022-10-20 | Unknown | 76000.000000 |
8 | 10 | 2022-06-18 | Social Media | 81.000000 |
10 | 278 | 1970-01-01 | Event | 76000.000000 |
11 | 755 | 2022-02-02 | Event | 76000.000000 |
13 | 600 | 2022-09-21 | Social Media | 6.980000 |
14 | 71 | 2022-05-25 | Unknown | 6.980000 |
16 | 601 | 2022-12-29 | Event | 76000.000000 |
17 | 397 | 2022-01-31 | Event | 81.000000 |
19 | 706 | 2022-12-05 | Social Media | 76000.000000 |
20 | 487 | 1970-01-01 | Event | 77.555004 |
22 | 88 | 2022-07-05 | Social Media | 6.980000 |
23 | 175 | 2022-07-24 | 81.000000 | |
25 | 850 | 1970-01-01 | 76000.000000 | |
26 | 678 | 2022-06-20 | Event | 76000.000000 |
28 | 846 | 2022-05-20 | Unknown | 77.555004 |
29 | 73 | 2022-08-31 | 6.980000 |
donor_id | date | donation_amount | |
---|---|---|---|
count | 6666.0 | 6666 | 6666.000000 |
mean | 501.192319 | 2012-01-04 18:13:04.158415872 | 19297.706620 |
min | 1.0 | 1970-01-01 00:00:00 | 6.980000 |
25% | 255.0 | 2022-01-25 00:00:00 | 77.555004 |
50% | 499.0 | 2022-05-24 00:00:00 | 77.555004 |
75% | 755.0 | 2022-09-13 00:00:00 | 76000.000000 |
max | 999.0 | 2022-12-31 00:00:00 | 76000.000000 |
std | 288.740445 | NaN | 33034.244560 |
Data Gaze
I am going to recommend you get this data into Microsoft Excel and do a quick glance. Excel does a much better job at letting you analyze the data on your nice and big monitor.