Note

Description:

Imagine you’re a fintech professional working for a leading investment firm. Your task is to analyze trading data to identify market trends and optimize investment strategies.

Tasks:

  • Reshape and Pivot: Reshape the data to create a pivot table that shows the average open_price and close_price for each stock_symbol on a monthly basis. Use the trade_date column to extract the month.
  • Data Merging: Merge the trading_data with another dataset, stock_info, which contains additional information about each stock. The stock_info dataset has the following columns: stock_symbol, sector, and industry. Merge the two datasets on the stock_symbol column and create a new column, sector_average, which calculates the average close_price for each sector-industry.
  • Feature Engineering: Create a new column, price_change, which calculates the daily percentage change in close_price for each stock. Then, create another column, trend, which categorizes the price_change into three groups: ‘Up’ (above 1%), ‘Down’ (below -1%), and ‘Neutral’ (between -1% and 1%).
# import libraries
import pandas as pd
import numpy as np
import sys
 
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
Numpy version 1.26.4

The Data

The dataset, trading_data, contains 1,830 rows of trading information for various stocks over a year.

Columns:

stock_symbol: Unique stock identifier
trade_date: Date of the trade
open_price: Stock price at market open
high_price: Highest stock price of the day
low_price Lowest: stock price of the day
close_price: Stock price at market close
volume: Number of shares traded

# set the seed
np.random.seed(0)
 
# generate trading data
trade_date = pd.date_range('2023-01-01', '2024-01-01', freq='D')
stock_symbols = ['AAPL', 'GOOG', 'MSFT', 'AMZN', 'FB']
open_price = np.random.uniform(100, 500, size=(len(trade_date), len(stock_symbols)))
high_price = open_price + np.random.uniform(0, 50, size=(len(trade_date), len(stock_symbols)))
low_price = open_price - np.random.uniform(0, 50, size=(len(trade_date), len(stock_symbols)))
close_price = open_price + np.random.uniform(-20, 20, size=(len(trade_date), len(stock_symbols)))
volume = np.random.randint(10000, 100000, size=(len(trade_date), len(stock_symbols)))
 
trading_data = pd.DataFrame({
    'trade_date': np.repeat(trade_date, len(stock_symbols)),
    'stock_symbol': np.tile(stock_symbols, len(trade_date)),
    'open_price': open_price.flatten(),
    'high_price': high_price.flatten(),
    'low_price': low_price.flatten(),
    'close_price': close_price.flatten(),
    'volume': volume.flatten()
})
 
# generate stock info data
stock_info = pd.DataFrame({
    'stock_symbol': stock_symbols,
    'sector': ['Tech', 'Tech', 'Tech', 'Tech', 'Tech'],
    'industry': ['Software', 'Internet', 'Software', 'E-commerce', 'Social Media']
})

Let us take a look at the data and the data types.

trading_data.head()
trade_datestock_symbolopen_pricehigh_pricelow_priceclose_pricevolume
02023-01-01AAPL319.525402335.210154292.730235331.20846559105
12023-01-01GOOG386.075747411.696171340.855887392.99802912378
22023-01-01MSFT341.105350356.190429326.361387338.67342982360
32023-01-01AMZN317.953273361.044423302.510066318.14068830510
42023-01-01FB269.461920311.678270238.102758285.30401022898
trading_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1830 entries, 0 to 1829
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   trade_date    1830 non-null   datetime64[ns]
 1   stock_symbol  1830 non-null   object        
 2   open_price    1830 non-null   float64       
 3   high_price    1830 non-null   float64       
 4   low_price     1830 non-null   float64       
 5   close_price   1830 non-null   float64       
 6   volume        1830 non-null   int32         
dtypes: datetime64[ns](1), float64(4), int32(1), object(1)
memory usage: 93.1+ KB
stock_info.head()
stock_symbolsectorindustry
0AAPLTechSoftware
1GOOGTechInternet
2MSFTTechSoftware
3AMZNTechE-commerce
4FBTechSocial Media
stock_info.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   stock_symbol  5 non-null      object
 1   sector        5 non-null      object
 2   industry      5 non-null      object
dtypes: object(3)
memory usage: 252.0+ bytes

Reshape and Pivot

Reshape the data to create a pivot table that shows the average open_price and close_price for each stock_symbol on a monthly basis. Use the trade_date column to extract the month.

In order to extract the year and month from a date object, we can make use of the to_period() method as shown below.

trading_data['trade_date'].dt.to_period('M')
0       2023-01
1       2023-01
2       2023-01
3       2023-01
4       2023-01
         ...   
1825    2024-01
1826    2024-01
1827    2024-01
1828    2024-01
1829    2024-01
Name: trade_date, Length: 1830, dtype: period[M]
# reshape the data via pivot_table
pivot = trading_data.pivot_table(index=trading_data['trade_date'].dt.to_period('M'), columns='stock_symbol', values=['open_price','close_price'], aggfunc='mean')
pivot.head()
stock_symbolAAPLAMZNFBGOOGMSFTAAPLAMZNFBGOOGMSFT
trade_date
2023-01296.842894340.198596280.793326292.595610308.192058298.096047338.740435277.524525294.751026303.295615
2023-02310.812102296.270349299.107653294.990639290.411256309.474246294.099067301.515984295.286629292.067761
2023-03280.517658305.165263306.077719305.371521257.166686281.888576305.294154310.372706305.474469256.246475
2023-04262.306455316.037693340.514381286.895555311.488378264.836217316.830908345.158919285.310857309.797786
2023-05272.245808275.654902305.448578304.926896283.436121273.720308276.731507302.508553308.299429284.021319

Data Merging

Merge the trading_data with another dataset, stock_info, which contains additional information about each stock. The stock_info dataset has the following columns: stock_symbol, sector, and industry. Merge the two datasets on the stock_symbol column and create a new column, sector_average, which calculates the average close_price for each sector-industry.

# merge the two dataframes
df = trading_data.merge(stock_info, on='stock_symbol')
df.head()
trade_datestock_symbolopen_pricehigh_pricelow_priceclose_pricevolumesectorindustry
02023-01-01AAPL319.525402335.210154292.730235331.20846559105TechSoftware
12023-01-01GOOG386.075747411.696171340.855887392.99802912378TechInternet
22023-01-01MSFT341.105350356.190429326.361387338.67342982360TechSoftware
32023-01-01AMZN317.953273361.044423302.510066318.14068830510TechE-commerce
42023-01-01FB269.461920311.678270238.102758285.30401022898TechSocial Media

Notice that I used transform in order to add the group averages back to the original dataframe.

# create group object
group = df.groupby(['sector','industry'])
 
# calculate the average close_price for each sector
df['sector_average'] = group['close_price'].transform('mean')
df.head()
trade_datestock_symbolopen_pricehigh_pricelow_priceclose_pricevolumesectorindustrysector_average
02023-01-01AAPL319.525402335.210154292.730235331.20846559105TechSoftware292.958217
12023-01-01GOOG386.075747411.696171340.855887392.99802912378TechInternet298.679561
22023-01-01MSFT341.105350356.190429326.361387338.67342982360TechSoftware292.958217
32023-01-01AMZN317.953273361.044423302.510066318.14068830510TechE-commerce305.923249
42023-01-01FB269.461920311.678270238.102758285.30401022898TechSocial Media317.740507

If you want to make sure the new column named “sector_average” has the correct number, you can do a quick check using the table below.

group['close_price'].mean()
sector  industry    
Tech    E-commerce      305.923249
        Internet        298.679561
        Social Media    317.740507
        Software        292.958217
Name: close_price, dtype: float64

Feature Engineering

Create a new column, price_change, which calculates the daily percentage change in close_price for each stock. Then, create another column, trend, which categorizes the price_change into three groups: ‘Up’ (above 1%), ‘Down’ (below -1%), and ‘Neutral’ (between -1% and 1%).

# sort the data
sorted_df = trading_data.sort_values(by=['stock_symbol','trade_date'])
 
# create group object
group = sorted_df.groupby('stock_symbol')
 
# calculate the price_change for the column close_price
sorted_df['pct_change'] = group['close_price'].transform(lambda x: x.pct_change())
sorted_df.head()
trade_datestock_symbolopen_pricehigh_pricelow_priceclose_pricevolumepct_change
02023-01-01AAPL319.525402335.210154292.730235331.20846559105NaN
52023-01-02AAPL358.357645374.130903333.098067373.464608111760.127582
102023-01-03AAPL416.690015451.576435377.748453409.149364998610.095551
152023-01-04AAPL134.851720173.80941693.355852141.97687796561-0.652995
202023-01-05AAPL491.447337497.050931489.991647471.466975116662.320731

The bins for ‘Up’ (above 1%), ‘Down’ (below -1%), and ‘Neutral’ (between -1% and 1%) would be:

  • Down: (-∞, -0.01]
  • Neutral: (-0.01, 0.01]
  • Up: (0.01, ∞)
bins = [-float('inf'), -0.01, 0.01, float('inf')]
labels = ['Down', 'Neutral', 'Up']
 
# create the new column named "trend"
sorted_df['trend'] = pd.cut(sorted_df['pct_change'], bins=bins, labels=labels)
sorted_df.head()
trade_datestock_symbolopen_pricehigh_pricelow_priceclose_pricevolumepct_changetrend
02023-01-01AAPL319.525402335.210154292.730235331.20846559105NaNNaN
52023-01-02AAPL358.357645374.130903333.098067373.464608111760.127582Up
102023-01-03AAPL416.690015451.576435377.748453409.149364998610.095551Up
152023-01-04AAPL134.851720173.80941693.355852141.97687796561-0.652995Down
202023-01-05AAPL491.447337497.050931489.991647471.466975116662.320731Up

Summary:

The tutorial demonstrated how to analyze trading data using Pandas. It covered reshaping and pivoting data, merging datasets, and feature engineering.

Key Takeaways:

  • How to reshape and pivot data using pivot_table and dt.to_period.
  • How to merge datasets using merge and create new columns using transform.
  • How to calculate daily percentage changes using pct_change and categorize changes using pd.cut.
  • How to use groupby to calculate group averages and add them back to the original dataframe.
  • How to sort data by multiple columns using sort_values.
  • How to use bins and labels to categorize data into groups.