Note

Description: You are a data analyst at a school, and you need to analyze the performance of students in a mathematics exam. The school has provided you with a dataset containing the scores of 50 students in three subjects: Algebra, Geometry, and Calculus.

Your task is to:

  • Load the dataset into a Pandas DataFrame
  • Calculate the average score for each subject
  • Identify the top 5 students with the highest overall score (average of all three subjects)
  • Create a new column to indicate whether each student passed or failed the exam (passing score is 80 or higher)
# import libraries
import pandas as pd
import numpy as np
import sys
 
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Numpy version ' + np.__version__)
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
Numpy version 1.26.4

The Data

The dataset contains information about the performance of 50 students in a mathematics exam, covering three subjects: Algebra, Geometry, and Calculus.

Columns:

  • Student_ID: A unique identifier for each student (integer).
  • Algebra: The student’s score in Algebra (integer).
  • Geometry: The student’s score in Geometry (integer).
  • Calculus: The student’s score in Calculus (integer).
Student_ID	Algebra	Geometry	Calculus
1	85	90	78
2	92	88	95
3	78	75	82
4	95	92	89
5	88	85	92
6	76	78	85
7	89	91	96
8	83	80	88
9	90	95	92
10	81	83	86
11	86	89	93
12	93	96	91
13	84	82	89
14	91	94	97
15	79	81	87
16	87	86	90
17	94	97	95
18	82	84	91
19	96	98	99
20	80	79	83
21	98	99	100
22	75	77	81
23	92	93	98
24	89	92	95
25	77	76	80
26	95	96	98
27	81	82	87
28	90	91	94
29	78	80	84
30	97	99	100
31	84	86	90
32	91	94	97
33	76	78	82
34	88	90	93
35	82	84	89
36	96	98	100
37	85	87	92
38	79	81	86
39	93	95	98
40	80	82	85
41	87	89	94
42	94	96	99
43	83	85	91
44	92	94	97
45	77	79	83
46	90	92	95
47	86	88	93
48	98	100	100
49	81	83	88
50	95	97	99
df = pd.read_clipboard()
df
Student_IDAlgebraGeometryCalculus
01859078
12928895
23787582
34959289
45888592
56767885
67899196
78838088
89909592
910818386
1011868993
1112939691
1213848289
1314919497
1415798187
1516878690
1617949795
1718828491
1819969899
1920807983
20219899100
2122757781
2223929398
2324899295
2425777680
2526959698
2627818287
2728909194
2829788084
29309799100
3031848690
3132919497
3233767882
3334889093
3435828489
35369698100
3637858792
3738798186
3839939598
3940808285
4041878994
4142949699
4243838591
4344929497
4445777983
4546909295
4647868893
474898100100
4849818388
4950959799
# make sure data types look good
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Student_ID  50 non-null     int64
 1   Algebra     50 non-null     int64
 2   Geometry    50 non-null     int64
 3   Calculus    50 non-null     int64
dtypes: int64(4)
memory usage: 1.7 KB
# calculate the average score for each subject
df['median'] = df[['Algebra', 'Geometry', 'Calculus']].median(axis=1) # I usually choose the median over the average
 
# identify the top 5 students with the highest overall score (average of all three subjects)
df.sort_values(by='median', ascending=False).head()
Student_IDAlgebraGeometryCalculusmedian
474898100100100.0
2930979910099.0
2021989910099.0
181996989998.0
3536969810098.0
# create a new column to indicate whether each student passed or failed the exam (passing score is 80 or higher)
df['passFail'] = df['median'].apply(lambda x: "pass" if x >= 80 else "fail")
df
Student_IDAlgebraGeometryCalculusmedianpassFail
0185907885.0pass
1292889592.0pass
2378758278.0fail
3495928992.0pass
4588859288.0pass
5676788578.0fail
6789919691.0pass
7883808883.0pass
8990959292.0pass
91081838683.0pass
101186899389.0pass
111293969193.0pass
121384828984.0pass
131491949794.0pass
141579818781.0pass
151687869087.0pass
161794979595.0pass
171882849184.0pass
181996989998.0pass
192080798380.0pass
2021989910099.0pass
212275778177.0fail
222392939893.0pass
232489929592.0pass
242577768077.0fail
252695969896.0pass
262781828782.0pass
272890919491.0pass
282978808480.0pass
2930979910099.0pass
303184869086.0pass
313291949794.0pass
323376788278.0fail
333488909390.0pass
343582848984.0pass
3536969810098.0pass
363785879287.0pass
373879818681.0pass
383993959895.0pass
394080828582.0pass
404187899489.0pass
414294969996.0pass
424383859185.0pass
434492949794.0pass
444577798379.0fail
454690929592.0pass
464786889388.0pass
474898100100100.0pass
484981838883.0pass
495095979997.0pass

BONUS

def pass_fail(score):
    if score >= 80:
        return "pass"
    else:
        return "fail"
 
# calculate pass fail on all 3 subjects
label = df[['Algebra', 'Geometry', 'Calculus']].map(pass_fail)
label.head()
AlgebraGeometryCalculus
0passpassfail
1passpasspass
2failfailpass
3passpasspass
4passpasspass
# merge the two dataframes
df.merge(right=label,left_index=True, right_index=True)
AlgebraGeometryCalculus
0passpassfail
1passpasspass
2failfailpass
3passpasspass
4passpasspass

Summary:

The tutorial demonstrated how to analyze a dataset of student performance in a mathematics exam using Pandas. It covered importing libraries, loading data, checking data types, calculating average scores, identifying top performers, and creating new columns to indicate pass/fail status.

Key Takeaways:

  • Loading data from clipboard into a Pandas DataFrame
  • Checking data types and info using df.info() and df.head()
  • Calculating median/average scores using df.median() or df.mean()
  • Creating new columns using df['new_column']
  • Sorting data using df.sort_values() and selecting top rows using df.head()
  • Applying conditional logic using apply() and lambda functions (e.g., pass/fail status)
  • Merging DataFrames using df.merge()
  • Using map function to apply pass/fail logic to individual subjects