Note
- This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
- 🔥 Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq
Description: You are a data analyst for a movie streaming service. You have been tasked with analyzing a dataset of movie ratings to determine which genres are the most popular among users.
The dataset contains the following columns:
- user_id: Unique identifier for each user
- movie_id: Unique identifier for each movie
- rating: Rating given by the user to the movie (on a scale of 1-5)
- genre: Genre of the movie (e.g. Action, Comedy, Drama, etc.)
Your task is to:
- Load the dataset into a Pandas DataFrame
- Group the data by genre and calculate the average rating for each genre
- Sort the results in descending order by average rating
Data:
You can use the following sample data to get started:
user_id,movie_id,rating,genre
1,101,4,Action
1,102,3,Comedy
2,101,5,Action
2,103,4,Drama
3,102,2,Comedy
3,104,5,Action
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
user_id | movie_id | rating | genre | |
---|---|---|---|---|
0 | 1 | 101 | 4 | Action |
1 | 1 | 102 | 3 | Comedy |
2 | 2 | 101 | 5 | Action |
3 | 2 | 103 | 4 | Drama |
4 | 3 | 102 | 2 | Comedy |
5 | 3 | 104 | 5 | Action |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 6 non-null int64
1 movie_id 6 non-null int64
2 rating 6 non-null int64
3 genre 6 non-null object
dtypes: int64(3), object(1)
memory usage: 324.0+ bytes
genre
Action 4.666667
Comedy 2.500000
Drama 4.000000
Name: rating, dtype: float64
I decided to place it all in one line. Yes, it is a bit ugly.
Here is what I did:
- I decided to merge the Series that contains the average ratings with the original dataframe via the column named genre
- I renamed the Series so the column names were clear
- I finally sorted the values descending
user_id | movie_id | rating | average_rating | |
---|---|---|---|---|
genre | ||||
Action | 1 | 101 | 4 | 4.666667 |
Action | 2 | 101 | 5 | 4.666667 |
Action | 3 | 104 | 5 | 4.666667 |
Drama | 2 | 103 | 4 | 4.000000 |
Comedy | 1 | 102 | 3 | 2.500000 |
Comedy | 3 | 102 | 2 | 2.500000 |
If all you needed to see was the averages…
genre
Action 4.666667
Drama 4.000000
Comedy 2.500000
Name: rating, dtype: float64
Summary:
This tutorial guided you through the analysis of a movie ratings dataset using Pandas. It covered loading data, grouping by genre, calculating average ratings, merging data, and sorting results.
Key Takeaways:
- How to load data from a clipboard into a Pandas DataFrame using
pd.read_clipboard()
- Understanding data types using
df.info()
- Grouping data by a column (genre) using
df.groupby()
- Calculating the average rating for each group using
group['rating'].mean()
- Merging data from a Series into the original DataFrame using
df.merge()
ordf.set_index().merge()
- Renaming columns using
rename()
- Sorting data in descending order using
sort_values()