Note

Letā€™s practice Pandas using theĀ Group ByĀ function.

Import Libraries

import pandas as pd
import sys
 
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1

The Data


Here is the csv data if you want to follow along:

Date,Symbol,Volume
1/1/2013,A,0
1/2/2013,A,200
1/3/2013,A,1200
1/4/2013,A,1001
1/5/2013,A,1300
1/6/2013,A,1350
3/8/2013,B,500
3/9/2013,B,1150
3/10/2013,B,1180
3/11/2013,B,2000
1/5/2013,C,56600
1/6/2013,C,45000
1/7/2013,C,200
5/20/2013,E,1300
5/21/2013,E,1700
5/22/2013,E,900
5/23/2013,E,2100
5/24/2013,E,8000
5/25/2013,E,12000
5/26/2013,E,1900
5/27/2013,E,1000
5/28/2013,E,1900
 

Read CSV file

For those who are following along, note that the code below will most likely return an error for you (the file Test_9_17_Python.csv is not on your computer).

Here are two options for you:

  • Use theĀ raw = pd.read_clipboard(sep=',')Ā method
  • Manually create the csv and save it in the same location as this notebook

As you can see we are working with daily trading volume for various stocks over a period of time, from January 1, 2013, to May 28, 2013.

Here are the stock Symbols: A, B, C, and E

raw = pd.read_csv('Test_9_17_Python.csv')
raw.head()
DateSymbolVolume
01/1/2013A0
11/2/2013A200
21/3/2013A1200
31/4/2013A1001
41/5/2013A1300

At times I like to make a copy of the original dataframe just in case I have to go back to the original data. Yes, we can just recreate the data, but at times I rather have it handy.

The copy() function allows you to create a truly independent copy of the data, so changes to the new object wonā€™t affect the original

df = raw.copy()

Fix Date Column

Whenever you import data from basically anywhere, please please check your data types. You want to make sure the data in the correct format before you begin to work with it.

As you can see below, the Date column, which represents our dates, came in as a string. We need to fix this.

df.info()

<class ā€˜pandas.core.frame.DataFrameā€™> RangeIndex: 22 entries, 0 to 21 Data columns (total 3 columns):

Column Non-Null Count Dtype

--- ------ -------------- ----- 0 Date 22 non-null object 1 Symbol 22 non-null object 2 Volume 22 non-null int64 dtypes: int64(1), object(2) memory usage: 660.0+ bytes

Pandas to the rescue!

We can automagically convert our dates that are currently represented as strings and converted to datetime objects. Note that itā€™s not always this easy and expect to battle date strings in the future.

df['Date'] = pd.to_datetime(df['Date'])

We now have all our columns looking good.

  • Our dates are datetime objects
  • The symbols are strings
  • The trading volume are integers
df.dtypes

Date datetime64[ns] Symbol object Volume int64 dtype: object

Add a New Column

Letā€™s randomly assign a buy or sell flag to each row in this dataset. This will give us more data to work with and show off some of Pandas GroupBy capabilities.

# here are our buy and sell options
pool = ['buy','sell']
 
# create our data in the correct length
# Since we have 2 items in the dataset, I just multiplied half of the dataset by the list. Can you think of a better solution?
pool = pool*int(len(df)/2)
 
# how many strings did we just create?
print('We created ' + str(len(pool)) + ' strings')
 
pool
We created 22 strings

['buy',
 'sell',
 'buy',
 'sell',
 'buy',
 'sell',
 'buy',
 'sell',
 'buy',
 'sell',
 'buy',
 'sell',
 'buy',
 'sell',
 'buy',
 'sell',
 'buy',
 'sell',
 'buy',
 'sell',
 'buy',
 'sell']
# add the new column
df['flag'] = pool
 
# print top 5 rows
df.head()
DateSymbolVolumeflag
01/1/2013A0buy
11/2/2013A200sell
21/3/2013A1200buy
31/4/2013A1001sell
41/5/2013A1300buy

Group by one column

At a high level, theĀ GroupByĀ method will:

  • Group your data by one or more columns
  • And then perform some action on each group (like calculate the sum, mean, etc.)

In the example below, we are grouping by the column namedĀ SymbolĀ and then adding all the values found in theĀ VolumeĀ column for that group.

As a side note:Ā If my dataframe has non numeric columns, in this case itā€™s theĀ flagĀ column, I have been adding theĀ numeric_only=TrueĀ code to avoid the Pandas warning message. Itā€™s kind of annoying.

# create group object
group = df.groupby('Symbol')
 
# perform a sum function on the group
group.sum(numeric_only=True)
SymbolVolume
A5051
B4830
C101800
E30800

Letā€™s do a Deep Dive

When you use the groupby function in Pandas, you are grouping your data by one or more columns. These columns are called the ā€œgrouping columnsā€ or ā€œindex columnsā€.

In our example above we grouped by the columnĀ Symbol:

group = df.groupby('Symbol')

  • The ā€œSymbolā€ column becomes the grouping column

When you apply a function to this grouped DataFrame usingĀ apply, the function is currently applied to both the grouping columns and the data columns.

  • in this case, the grouping column is ā€œSymbolā€
  • in this case, the data column is ā€œVolumeā€

However, in future versions of Pandas, the grouping columns will be excluded from the operation by default. This means that the function will only be applied to the data columns (ā€œVolumeā€ in this example), and not to the grouping column named ā€œSymbolā€.

This change is being made to prevent unexpected behavior and to make the apply function more predictable. If you want to include the grouping columns in the operation, you will need to explicitly select them or pass include_groups=True.

def suma(group):
    ''' perform a sum on the group '''
    return group.sum(numeric_only=True)
 
group.apply(suma, include_groups=False)
SymbolVolume
A5051
B4830
C101800
E30800

Letā€™s mess around with theĀ transformĀ method. Unlike the previous example, transform allows us to keep the original shape of our dataframe.

To avoid any issues with the non-numeric flag column, I excluded it while creating the group object. Yes, I cheated a bit here.

# create group object
group = df[['Symbol','Volume']].groupby('Symbol')
 
def suma(group):
    ''' perform a sum on the group '''
    return group.sum(numeric_only=True)
 
# create a new column
df['addition'] = group.transform(suma)['Volume']
df.head(10)
DateSymbolVolumeflagaddition
01/1/2013A0buy5051
11/2/2013A200sell5051
21/3/2013A1200buy5051
31/4/2013A1001sell5051
41/5/2013A1300buy5051
51/6/2013A1350sell5051
63/8/2013B500buy4830
73/9/2013B1150sell4830
83/10/2013B1180buy4830
93/11/2013B2000sell4830

Here we get fancy!

The function below namedĀ testĀ will do the following:

  • filter rows whereĀ flagĀ is equal to ā€œsellā€
  • filter rows whereĀ VolumeĀ is greater than 1000

So we only return the rows that meet the two conditions above.

group = df.groupby('Symbol')
 
def test(group):
    mask1 = group.apply(lambda x: x.iloc[2]=='sell' and x.iloc[1]>1000, axis=1)
    return group[mask1]
 
group.apply(test, include_groups=False)
DateVolumeflagaddition
Symbol
A32013-01-041001sell5051
52013-01-061350sell5051
B72013-03-091150sell4830
92013-03-112000sell4830
C112013-01-0645000sell101800
E132013-05-201300sell30800
172013-05-248000sell30800
192013-05-261900sell30800
212013-05-281900sell30800

Group by two columns

The examples below show you how to group by multiple columns.

We are not going to go through new material, we are just grouping by 2 columns. You got this!

# create group object, remember to pass a list when grouping by multiple columns
group = df.groupby(['Symbol', 'flag'])
 
# perform a sum function on the group
group.sum(numeric_only=True)
Volumeaddition
Symbolflag
Abuy250015153
sell255115153
Bbuy16809660
sell31509660
Cbuy56800203600
sell45000101800
Ebuy16800123200
sell14000154000
# apply a function to the multi column group
def suma(group):
    return group['Volume'].sum(numeric_only=True)
 
group.apply(suma, include_groups=False)
Symbol  flag
A       buy      2500
        sell     2551
B       buy      1680
        sell     3150
C       buy     56800
        sell    45000
E       buy     16800
        sell    14000
dtype: int64
# use transform
group = df[['Symbol','Volume','flag']].groupby(['Symbol', 'flag'])
 
def suma(group):
    return group.sum(numeric_only=True)
 
df['addition'] = group.transform(suma)['Volume']
df.head()
DateSymbolVolumeflagaddition
02013-01-01A0buy2500
12013-01-02A200sell2551
22013-01-03A1200buy2500
32013-01-04A1001sell2551
42013-01-05A1300buy2500
# filter away
group = df.groupby(['Symbol', 'flag'])
 
def test(group):
    mask1 = group.apply(lambda x: x.iloc[1]>1000, axis=1)
    return group[mask1]
 
group.apply(test)
DateVolumeaddition
Symbolflag
Abuy22013-01-0312002500
42013-01-0513002500
sell32013-01-0410012551
52013-01-0613502551
Bbuy82013-03-1011801680
sell72013-03-0911503150
92013-03-1120003150
Cbuy102013-01-055660056800
sell112013-01-064500045000
Ebuy142013-05-21170016800
162013-05-23210016800
182013-05-251200016800
sell132013-05-20130014000
172013-05-24800014000
192013-05-26190014000
212013-05-28190014000