Note
- This tutorial is also available on nbviewer, offering an alternative platform for your learning convenience.
- 🔥 Free Pandas Course: https://hedaro.gumroad.com/l/tqqfq
Description:
Oh man, have you heard about the latest struggles at Telco Telecom Company? Their customer service team is overwhelmed with calls and emails from frustrated customers experiencing dropped calls, slow data speeds, and unexpected charges on their bills. The team needs help analyzing customer data to identify trends, resolve issues efficiently, and improve overall customer satisfaction.
Tasks:
- Trend Analysis: Identify the top 5 reasons for customer complaints and create a bar chart to visualize the results.
- Issue Resolution: Write a function to categorize customer complaints by severity (high, medium, low) based on keywords in the issue description.
- Phone Number Cleanup: Clean up malformed phone numbers in the dataset by extracting the numeric portion and formatting it as XXX-XXX-XXXX.
Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Pandas version 2.2.1
Numpy version 1.26.4
The Data
The dataset contains 10,000 customer records with information about their phone usage, issues they’ve experienced (such as dropped calls or slow data), and severity of those issues. The data also includes malformed phone numbers that need to be cleaned and standardized.
Columns:
- customer_id: Unique customer identifier
- name: Customer name
- phone_number: Customer phone number (some malformed)
- issue_description: Brief description of the customer’s issue
- issue_severity: Severity level of the issue (High, Medium, Low). Note: column created in Task #2
- calls: Number of calls made by the customer
- texts: Number of texts sent by the customer
- data_usage: Amount of data used by the customer (in MB)
customer_id | name | phone_number | issue_description | calls | texts | data_usage | |
---|---|---|---|---|---|---|---|
0 | 0 | Customer 0 | 555-795-4375 | Poor reception | 82 | 543 | 2696 |
1 | 1 | Customer 1 | 555-946-5970 | Poor reception | 98 | 22 | 6368 |
2 | 2 | Customer 2 | 555-818-9002 | Slow data | 1 | 973 | 2280 |
3 | 3 | Customer 3 | 555-388-6985 | Unexpected charge | 88 | 196 | 2595 |
4 | 4 | Customer 4 | 555-915-7855 | Poor reception | 40 | 583 |
Let us take a look at the data types.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customer_id 10000 non-null int32
1 name 10000 non-null object
2 phone_number 10000 non-null object
3 issue_description 10000 non-null object
4 calls 10000 non-null int32
5 texts 10000 non-null int32
6 data_usage 10000 non-null int32
dtypes: int32(4), object(3)
memory usage: 390.8+ KB
Trend Analysis:
Identify the top 5 reasons for customer complaints and create a bar chart to visualize the results.
From the results below, we end up with the “Other” category as the most frequent customer complaint. If you want to dig a little deeper, you will have to determine how this category is created.
Ask these questions:
- Am I able to see the details of the call that ended up in the “Other” category?
- Is there a specific timeframe that these calls were placed in?
- Is there a team member that may have some background on the “Other” calls?
You may have some further clean up of the data, depending on what you are able to find.
issue_description
Other 2020
Unexpected charge 2015
Poor reception 1996
Slow data 1992
Dropped call 1977
Name: count, dtype: int64
Now, let us create the bar chart. We will also sort the results so it makes it easier to see the most frequent customer complaints.
Issue Resolution:
Write a function to categorize customer complaints by severity (high, medium, low) based on keywords in the issue description.
Notes on the function named get_severity
:
**dict.fromkeys()
is a Python expression that creates a new dictionary from an iterable (like a list or tuple) and assigns a default value to each key.
Let us start by explaining how I created the dictionary:
dict.fromkeys()
is a method that creates a new dictionary from an iterable.- The ** operator is used to unpack the iterable and pass it as arguments to the
fromkeys()
method.
Below is the created dictionary.
{'other': 'low',
'reception': 'medium',
'data': 'medium',
'unexpected': 'high',
'charge': 'high',
'dropped': 'high'}
Here is an example run of the function. Assume the string being sent to the function is “slow data”.
- The mapping dict is created
- We start to loop through the keys of the dict
- The first key is “other”, we make sure it is lower case and we ask, is “other” in the string “slow data”?
- The answer is no, and we move on to the next key
- The second key is “reception”, we make sure it is lower case and we ask, is “reception” in the string “slow data”?
- The answer is no, and we move on to the next key
- The third key is “data”, we make sure it is lower case and we ask, is “data” in the string “slow data”?
- The answer is yes
- We then return the dict value of “data”, which in this case is “medium”
customer_id | name | phone_number | issue_description | calls | texts | data_usage | issue_severity | |
---|---|---|---|---|---|---|---|---|
0 | 0 | Customer 0 | 555-795-4375 | Poor reception | 82 | 543 | 2696 | medium |
1 | 1 | Customer 1 | 555-946-5970 | Poor reception | 98 | 22 | 6368 | medium |
2 | 2 | Customer 2 | 555-818-9002 | Slow data | 1 | 973 | 2280 | medium |
3 | 3 | Customer 3 | 555-388-6985 | Unexpected charge | 88 | 196 | 2595 | high |
4 | 4 | Customer 4 | 555-915-7855 | Poor reception | 40 | 583 | 8574 | medium |
Phone Number Cleanup:
Clean up malformed phone numbers in the dataset by extracting the numeric portion and formatting it as XXX-XXX-XXXX.
For example: “555-123-4567 ext 101” should become “555-123-4567”
I copied the “phone_number” column into Excel via pd.to_clipboard()
to visually verify that the phone numbers where in fact clean. I noticed some of my strings were of length 13 and that was due to an extra space at the end of the strings.
I noticed that some strings had extra spaces at the end, so I added .strip()
to the return statement to handle this.
How would you have solved the extra space at the end of the string?
customer_id | name | phone_number | issue_description | calls | texts | data_usage | issue_severity | |
---|---|---|---|---|---|---|---|---|
0 | 0 | Customer 0 | 555-795-4375 | Poor reception | 82 | 543 | 2696 | medium |
1 | 1 | Customer 1 | 555-946-5970 | Poor reception | 98 | 22 | 6368 | medium |
2 | 2 | Customer 2 | 555-818-9002 | Slow data | 1 | 973 | 2280 | medium |
3 | 3 | Customer 3 | 555-388-6985 | Unexpected charge | 88 | 196 | 2595 | high |
4 | 4 | Customer 4 | 555-915-7855 | Poor reception | 40 | 583 | 8574 | medium |
Summary
The tasks in this tutorial covered data manipulation, data cleaning, and data visualization, which are essential skills for working with real-world datasets.
What you learned:
- How to perform trend analysis by counting the frequency of issue descriptions and visualizing the results in a bar chart
- How to write a function to categorize customer complaints by severity based on keywords in the issue description
- How to clean up malformed phone numbers in the dataset using regular expressions
- How to apply functions to Pandas DataFrames using the
apply()
method - How to create new columns in a DataFrame based on the values of another column
- How to use the
value_counts()
method to get a frequency count of a column - How to use the
plot.barh()
method to create a horizontal bar chart - How to use the
re.sub()
method to replace patterns in strings using regular expressions