Note
- This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
- š„ Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq
Description:
Imagine youāre a data analyst at a utility company. Your team needs your help to analyze energy consumption data from smart meters. They want you to identify trends in energy usage, peak consumption periods, and areas with high energy demand. The data is in a large CSV file, and they need someone with your Pandas skills to wrangle it and extract valuable insights.
Tasks:
- Task 1:Ā Energy Usage PatternsĀ - Find the top 5 days with the highest average energy consumption.
- Task 2:Ā Peak HoursĀ - Identify the peak consumption hours of the day and calculate the average energy usage during those hours.
- Task 3:Ā High-Demand AreasĀ - Determine the top area (zip code) with the highest total energy consumption.
The Data
The code below generates a dataset of 100,000 energy consumption records with the following columns:
- date:Ā The date of energy consumption
- time:Ā The time of energy consumption
- zip_code:Ā The zip code of the area where energy consumption was recorded
- energy_consumption:Ā The amount of energy consumed (in kilowatt-hours, kWh) during the specified date and time in the given zip code area
The energy_consumption column contains the followingĀ patterns:
- For zip code ā12345ā, a normal distribution with a mean of 20 and a standard deviation of 5.
- For zip code ā67890ā, a normal distribution with a mean of 15 and a standard deviation of 3.
- For zip code ā34567ā, a normal distribution with a mean of 30 and a standard deviation of 7.
date | time | zip_code | energy_consumption | |
---|---|---|---|---|
0 | 2022-06-22 | 18:00 | 67890 | 13.514508 |
1 | 2022-02-17 | 20:00 | 34567 | 20.695992 |
2 | 2022-04-28 | 11:00 | 34567 | 30.532627 |
3 | 2022-07-12 | 09:00 | 67890 | 15.940379 |
4 | 2022-11-20 | 07:00 | 12345 | 29.480771 |
Letās take a look at the datatypes of our shiny new dataframe.
TheĀ timeĀ column is currently a string, we will see if this causes us any problems.
<class āpandas.core.frame.DataFrameā> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 4 columns):
Column Non-Null Count Dtype
`--- ------ -------------- -----
0 date 100000 non-null datetime64[ns]
1 time 100000 non-null object
2 zip_code 100000 non-null object
3 energy_consumption 100000 non-null float64
dtypes: datetime64ns, float64(1), object(2)
memory usage: 3.1+ MB
Energy Usage Patterns
Find the top 5 days with the highest average energy consumption.
energy_consumption | |
---|---|
date | |
2022-11-29 | 23.297248 |
2022-10-23 | 23.097614 |
2022-02-07 | 23.032625 |
2022-11-01 | 22.909447 |
2022-06-27 | 22.842946 |
Notice I calculated the averages by grouping all of the dates, but what if the question was actually asking for the days (Monday, Tuesday, ā¦)?
Luckily, we can pull out the name of the day and group by it. Follow along to see how.
TheĀ groupbyĀ method is very flexible, and it allows us to not only pass column names but actual functions as shown below.
date | day_name | energy_consumption | |
---|---|---|---|
332 | 2022-11-29 | Tuesday | 23.297248 |
295 | 2022-10-23 | Sunday | 23.097614 |
37 | 2022-02-07 | Monday | 23.032625 |
304 | 2022-11-01 | Tuesday | 22.909447 |
177 | 2022-06-27 | Monday | 22.842946 |
Now, you may be tempted to only group by the name of the day (Monday, Tuesday, ā¦), but realize that you will get a different result.
You would be asking Pandas to group all of the Mondays and calculate the average and thisĀ IS NOT THE SAMEĀ as getting the average, on a particular date.
energy_consumption | |
---|---|
date | |
Monday | 21.810044 |
Thursday | 21.735327 |
Tuesday | 21.665705 |
Friday | 21.664956 |
Saturday | 21.656552 |
Peak Hours
Identify the peak consumption hours of the day and calculate the average energy usage during those hours.
Remember I said we might have issues with the datatype of theĀ timeĀ column? Well, I guess I was wrong.
energy_consumption | |
---|---|
time | |
11:00 | 21.865782 |
17:00 | 21.833443 |
13:00 | 21.829379 |
23:00 | 21.821426 |
00:00 | 21.782811 |
Here, we plot those times and clearly see the spikes that represent those high consumption periods.
Can You Solve the Last Question?
- High-Demand Areas - Determine the top area (zip code) with the highest total energy consumption.