Note
- This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
- š„ Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq
Letās practice Pandas using theĀ Group ByĀ function.
Import Libraries
The Data
Here is the csv data if you want to follow along:
Read CSV file
For those who are following along, note that the code below will most likely return an error for you (the file Test_9_17_Python.csv is not on your computer).
Here are two options for you:
- Use theĀ
raw = pd.read_clipboard(sep=',')
Ā method - Manually create the csv and save it in the same location as this notebook
As you can see we are working with daily trading volume for various stocks over a period of time, from January 1, 2013, to May 28, 2013.
Here are the stock Symbols: A, B, C, and E
Date | Symbol | Volume | |
---|---|---|---|
0 | 1/1/2013 | A | 0 |
1 | 1/2/2013 | A | 200 |
2 | 1/3/2013 | A | 1200 |
3 | 1/4/2013 | A | 1001 |
4 | 1/5/2013 | A | 1300 |
At times I like to make a copy of the original dataframe just in case I have to go back to the original data. Yes, we can just recreate the data, but at times I rather have it handy.
The copy() function allows you to create a truly independent copy of the data, so changes to the new object wonāt affect the original
Fix Date Column
Whenever you import data from basically anywhere, please please check your data types. You want to make sure the data in the correct format before you begin to work with it.
As you can see below, the Date column, which represents our dates, came in as a string. We need to fix this.
<class āpandas.core.frame.DataFrameā> RangeIndex: 22 entries, 0 to 21 Data columns (total 3 columns):
Column Non-Null Count Dtype
--- ------ -------------- ----- 0 Date 22 non-null object 1 Symbol 22 non-null object 2 Volume 22 non-null int64 dtypes: int64(1), object(2) memory usage: 660.0+ bytes
Pandas to the rescue!
We can automagically convert our dates that are currently represented as strings and converted to datetime objects. Note that itās not always this easy and expect to battle date strings in the future.
We now have all our columns looking good.
- Our dates are datetime objects
- The symbols are strings
- The trading volume are integers
Date datetime64[ns] Symbol object Volume int64 dtype: object
Add a New Column
Letās randomly assign a buy or sell flag to each row in this dataset. This will give us more data to work with and show off some of Pandas GroupBy capabilities.
Date | Symbol | Volume | flag | |
---|---|---|---|---|
0 | 1/1/2013 | A | 0 | buy |
1 | 1/2/2013 | A | 200 | sell |
2 | 1/3/2013 | A | 1200 | buy |
3 | 1/4/2013 | A | 1001 | sell |
4 | 1/5/2013 | A | 1300 | buy |
Group by one column
At a high level, theĀ GroupByĀ method will:
- Group your data by one or more columns
- And then perform some action on each group (like calculate the sum, mean, etc.)
In the example below, we are grouping by the column namedĀ SymbolĀ and then adding all the values found in theĀ VolumeĀ column for that group.
As a side note:Ā If my dataframe has non numeric columns, in this case itās theĀ flagĀ column, I have been adding theĀ numeric_only=True
Ā code to avoid the Pandas warning message. Itās kind of annoying.
Symbol | Volume |
---|---|
A | 5051 |
B | 4830 |
C | 101800 |
E | 30800 |
Letās do a Deep Dive
When you use the groupby function in Pandas, you are grouping your data by one or more columns. These columns are called the āgrouping columnsā or āindex columnsā.
In our example above we grouped by the columnĀ Symbol:
group = df.groupby('Symbol')
- The āSymbolā column becomes the grouping column
When you apply a function to this grouped DataFrame usingĀ apply, the function is currently applied to both the grouping columns and the data columns.
- in this case, the grouping column is āSymbolā
- in this case, the data column is āVolumeā
However, in future versions of Pandas, the grouping columns will be excluded from the operation by default. This means that the function will only be applied to the data columns (āVolumeā in this example), and not to the grouping column named āSymbolā.
This change is being made to prevent unexpected behavior and to make the apply function more predictable. If you want to include the grouping columns in the operation, you will need to explicitly select them or pass include_groups=True.
Symbol | Volume |
---|---|
A | 5051 |
B | 4830 |
C | 101800 |
E | 30800 |
Letās mess around with theĀ transformĀ method. Unlike the previous example, transform allows us to keep the original shape of our dataframe.
To avoid any issues with the non-numeric flag column, I excluded it while creating the group object. Yes, I cheated a bit here.
Date | Symbol | Volume | flag | addition | |
---|---|---|---|---|---|
0 | 1/1/2013 | A | 0 | buy | 5051 |
1 | 1/2/2013 | A | 200 | sell | 5051 |
2 | 1/3/2013 | A | 1200 | buy | 5051 |
3 | 1/4/2013 | A | 1001 | sell | 5051 |
4 | 1/5/2013 | A | 1300 | buy | 5051 |
5 | 1/6/2013 | A | 1350 | sell | 5051 |
6 | 3/8/2013 | B | 500 | buy | 4830 |
7 | 3/9/2013 | B | 1150 | sell | 4830 |
8 | 3/10/2013 | B | 1180 | buy | 4830 |
9 | 3/11/2013 | B | 2000 | sell | 4830 |
Here we get fancy!
The function below namedĀ testĀ will do the following:
- filter rows whereĀ flagĀ is equal to āsellā
- filter rows whereĀ VolumeĀ is greater than 1000
So we only return the rows that meet the two conditions above.
Date | Volume | flag | addition | ||
---|---|---|---|---|---|
Symbol | |||||
A | 3 | 2013-01-04 | 1001 | sell | 5051 |
5 | 2013-01-06 | 1350 | sell | 5051 | |
B | 7 | 2013-03-09 | 1150 | sell | 4830 |
9 | 2013-03-11 | 2000 | sell | 4830 | |
C | 11 | 2013-01-06 | 45000 | sell | 101800 |
E | 13 | 2013-05-20 | 1300 | sell | 30800 |
17 | 2013-05-24 | 8000 | sell | 30800 | |
19 | 2013-05-26 | 1900 | sell | 30800 | |
21 | 2013-05-28 | 1900 | sell | 30800 |
Group by two columns
The examples below show you how to group by multiple columns.
We are not going to go through new material, we are just grouping by 2 columns. You got this!
Volume | addition | ||
---|---|---|---|
Symbol | flag | ||
A | buy | 2500 | 15153 |
sell | 2551 | 15153 | |
B | buy | 1680 | 9660 |
sell | 3150 | 9660 | |
C | buy | 56800 | 203600 |
sell | 45000 | 101800 | |
E | buy | 16800 | 123200 |
sell | 14000 | 154000 |
Date | Symbol | Volume | flag | addition | |
---|---|---|---|---|---|
0 | 2013-01-01 | A | 0 | buy | 2500 |
1 | 2013-01-02 | A | 200 | sell | 2551 |
2 | 2013-01-03 | A | 1200 | buy | 2500 |
3 | 2013-01-04 | A | 1001 | sell | 2551 |
4 | 2013-01-05 | A | 1300 | buy | 2500 |
Date | Volume | addition | |||
---|---|---|---|---|---|
Symbol | flag | ||||
A | buy | 2 | 2013-01-03 | 1200 | 2500 |
4 | 2013-01-05 | 1300 | 2500 | ||
sell | 3 | 2013-01-04 | 1001 | 2551 | |
5 | 2013-01-06 | 1350 | 2551 | ||
B | buy | 8 | 2013-03-10 | 1180 | 1680 |
sell | 7 | 2013-03-09 | 1150 | 3150 | |
9 | 2013-03-11 | 2000 | 3150 | ||
C | buy | 10 | 2013-01-05 | 56600 | 56800 |
sell | 11 | 2013-01-06 | 45000 | 45000 | |
E | buy | 14 | 2013-05-21 | 1700 | 16800 |
16 | 2013-05-23 | 2100 | 16800 | ||
18 | 2013-05-25 | 12000 | 16800 | ||
sell | 13 | 2013-05-20 | 1300 | 14000 | |
17 | 2013-05-24 | 8000 | 14000 | ||
19 | 2013-05-26 | 1900 | 14000 | ||
21 | 2013-05-28 | 1900 | 14000 |