Every day, we generate a huge amount of text online but analyzing this text data isn’t an easy task. Converting text into structured information to analyze with a machine will be a complex task. In recent years, Text mining has become a lot more accessible for data analysts, developers and data scientists.
Text mining deals with the process of analyzing text to gather information which is useful. In today’s developing world, a text has become the most common means of communication. Text mining generally refers to analyze large natural language text and which detects usage patterns to extract useful information.
Now let’s discuss how to analyze WhatsApp chat message and plot a pie chart visualizing the percentage of texts that have a Positive, Negative, or Neutral connotation or Sentiment to it. we will also learn how to analyze Whatsapp Chats using Python.
Text Mining is just a fancy term for deriving super-awesome patterns and drawing amazing inferences from Textual Data.
Text Mining is just a fancy term for deriving super-awesome patterns and drawing amazing inferences from Textual Data. Just like you can look at an image and infer that it is of a baby or of a 77-year-old woman, we can do the same with texts. Let’s dive straight into code to understand the text mining concept.
Note: The code and a Jupyter notebook are available on GitHub, as always!
If you have ever emailed a WhatsApp chat to yourself, (or someone else, if that’s how you roll) you may have noticed that it includes a handful of details that can be used to analyze a group of texts as well as the entire chat history. For those who do not know what I am talking about, here’s a brief description: WhatsApp allows you to export/share your chat history with contact as a .txt file and it has the following structure:
We are going to take advantage of this data, so let’s take a look at various mining patterns from Chats.
Like any other project, there are a few dependencies (libraries) that you would need to have in order to analyze WhatsApp chats using Python. If you face any problem installing any of these dependencies, feel free to leave a comment!
Damn!
On my own development system, I use Anaconda (to which I am not associated ) as it provides seamless virtual environments in Python and also has pre-compiled libraries in its repositories.
Assuming you have pip installed, fire up a terminal and enter the following command:
pip install numpy matplotlib nltk
Or if you also prefer a conda, you can go to:
conda install numpy matplotlib nltk
To install nltk’s data, run:
python -m nltk.downloader all
After you have installed the libraries, we can go ahead and start mining! (There won’t be gold.)
Create a new directory — you can call it whatever-you-like but for me, it is a WhatsApp-chat-analysis — and open it in your favorite code editor. While you are at it, open WhatsApp on your smartphone, and export your chat with any of your contacts (preferably someone you talk a lot too.) Detailed instructions on how to do that can be found after an intelligent Google Search or by navigating to the following:
WhatsApp FAQ - Saving your chat history
Place the .txt file in your project’s root and proceed.
Those were just the raw materials and like any other food recipe, we need the raw materials to make anything worthwhile. Now that we have gathered all the raw materials, let’s load them in the mixer and make something awesome.
The first step is to load the chat file in Python and the easiest way to read a file in Python is to use the open utility and call the read method on the returned File object.
chat = open("chat-with-amfa.txt") chatText = chat.read() # read its contents chat.close()
Next comes the task of cleaning the text to remove artifacts that we aren’t going to need. Here’s a function that reads a text file, removes the artifacts and returns a list of lines extracted from the text file.
As you’ll find in any other tutorial that does Text Mining, we make use of regular expressions — or RegExp for short — to find recurring patterns in the text and substitute them in one go. It makes the code more comprehensible and easy to understand, as well as helps us avoid unnecessary loops. (If you are a beginner, think about how we can do what we want without RegExp.)
Let’s first declare two RegExp patterns that we would like to find and remove from every line of the text file. We then precompile it and save the compiled RegExp in the regexMedia and regexDate variables. These two variables now act as ready-to-use shredders. All you have to do is give them paper and they’ll make inedible spaghetti in no time! If you are having trouble understanding RegExp, you can go to https://regexr.com/3s301 and try changing it bits by bits to see how the matches change.
https://regexr.com/3s301 — RegExr
Next, we “substitute”, as implied by that sub method (5–6), all occurrences of the pattern with an empty string. You can think of it as a conditional and automated eraser.
Then, we simply split the chat file into separate lines and remove all lines that are empty. (Can you reason about why the file, at this time, would have empty lines?)
The complete code, with some schezwan sauce and decoration, is as below:
You can copy and paste that code in a file named utilities.py, but you’ll learn a lot about the language if you typed it yourself.
we simply split the chat file into separate lines and remove all lines that are empty. (Can you reason about why the file, at this time, would have empty lines?)
The complete code, with some schezwan sauce and decoration, is as below:
You can copy and paste that code in a file named utilities.py, but you’ll learn a lot about the language if you typed it yourself.
nltk comes pre-bundled with a Sentiment Analyzer that was pre-trained on the state-of-the-art Textual Datasets and we can use it to analyze the tone of each sentence and mark it as one of these: Negative, Positive, or Neutral. (8)
As always, we clean the chat text using the function we defined in utility.py (11), initialize helper variables (12) and start processing (15–16.) We then process each line in the chat file, which represents one text, one by one. In (18–19), you can see that we are again using RegExp to get rid of texts that only begin with an emoji. In most cases, this works, but it might fail in cases the text is preceded by an emoji. I look forward to a more sophisticated hack to get rid of this! PRs will always be welcome.
PRs are welcome
Moving on, we use our GreatSentimentAnalyzer to assign scores to the current text in question. (21) Then, we find the field — “neg”, “pos” or “neu” — which has the maximum score and increment the counter keeping track of it. We use Python’s lambda functions to find the field which has the maximum score (26). There are simpler ways to do it, so feel free to replace the lambda function with something else, keeping the end goal in mind.
After that, we apply some mathematics and boom-shaw-shey-done! The results are logged in the terminal, as well as plotted using matplotlib.
Sentiment Analysis — Chat with Amfa
I have omitted the code for plotting the results here for simplicity but you can grab the source from the project’s repository on GitHub, along with a Jupyter Notebook.
Further, we will use metadata which we deleted and draw awesome inferences from that to study and analyze Whatsapp chat messages using Python.
How many times has it occurred to you that maybe you had texted that person an hour ago, you probably would have gotten a reply? What we are going to do is will use Python to answer that question. (And yes, there will be probabilities!). Let’s us learn to analyze Whatsapp Chats using Python. Once you’re done with this, you’ll have a working system capable of plotting the distribution of texts exchanged by you and someone else, divided into 1-hour intervals.
Without further delay, let’s lift this baby off the ground!
I know, I know! But there’s nothing we can do about that, right?
pip install matplotlib
That’s it. Python is a powerful language and you’ll see how we can do basic text manipulation with what the language provides by default. Isn’t that heaven?
Go to or make a directory named WHATEVER-YOU-LIKE and open it in your favorite editor. Of course, you’ll need to have a text file containing your WhatsApp chat with someone in order to complete this tutorial. Detailed instructions on how to Analyze Whatsapp Chats using Python that can be found here:
WhatsApp FAQ - Saving your chat history
Once you have that .txt file, you are ready to move on. In a file named — timing.py (or whatever) — follow the steps below:
First things first, we gotta summon the libraries we are going to use:
import re
import sys
import matplotlib.pyplot as plt
If you remember, in the first part of this series, we used RegExp to match and filter out the Time and Date metadata available in an exported chat file. We are going to do a similar thing this time with only one exception: We won’t discard that information.
We need sys to be able to parse command-line arguments and execute our script on a chat file by providing its name.
And, you guessed it right! We need matplotlib for plotting.
To do this, we will use the splitlines() method attached to every string.
def split_text(filename): """ Split file contents by newline. """ chat = open(filename) chatText = chat.read() return chatText.splitlines()
We open a file, read its contents and then split the file by a newline character. (Can you think of how the splitlines() method is implemented under the hood?)
Wrapping up tasks like these in a method is considered good practice and it makes your code easier to comprehend and maintain.
Since, WhatsApp stores messages using 12-hour format, we will have to use two buckets, AM and PM, to collect the time information attached to every text.
def distributeByAmPm(linesText): AM, PM = [], [] # RegExp to extract Time information timeRegex = re.compile("(\d+(:)\d+)(\s)(\w+)") for index, line in enumerate(linesText): matches = re.findall(timeRegex, line) if(len(matches) > 0): # match now contains ('6:10', ':', ' ', 'PM') match = matches[0] if "AM" in match: AM.append(match[0]) else: PM.append(match[0]) return AM, PM
We need to pass the splitted lines we got in the previous step to distribute_by_am_pm to get the distributed buckets. Inside it, we first compile a RegExp that will match the time string in every text. As a brief reminder, this is how text looks like in the exported file:
10/09/16, 4:10 PM - Person 2: In an hour, maybe?
And that’s how the RegExp maps the pattern to the original string:
RegExp at work.
Later, we simply use an if statement to correctly distribute the time strings into AM and PM buckets. The two buckets are then returned.
Here, we’ll first create a skeleton container — a dictionary — to contain the 1-hour intervals. The key will represent the hour, and the value will represent the number of texts shared within that interval. To illustrate, this is how it’ll look, eventually:
First, let’s create the skeleton container with this fairly straightforward code:
time_groups = {} for i in range(24): time_groups[str(i)] = 0 # skeleton container
Then, we loop through each bucket — AM & PM — and increment the correct count in our time_groups container.
# if the hour is in AM for time in AM: current_hour = get_hour(time) if current_hour == 12: current_hour = 0 # Because we represent 12AM as 0 in our container add_to_time_groups(current_hour)
For all time strings in AM bucket, we first grab the current_hour (and ignore the minute information because we are grouping by Hour.) Then, if the hour is 12, we set it to 0 to map 12 AM →0 on the clock.
The two helper functions, get_hour and add_to_time_groups are defined below:
def add_to_time_groups(current_hour): current_hour = str(current_hour) time_groups[current_hour] += 1 def get_hour(time): return int(time.split(":")[0])
We follow a similar procedure for PM with just one caveat. We add 12 to every hour (except 12pm) to map 1 pm →13 on the clock and so on.
# Similarly for PM for time in PM: current_hour = get_hour(time) if current_hour > 12: continue # In case it has captured something else, ignore if current_hour != 12: current_hour += 12 add_to_time_groups(current_hour)
That’s it! That completes this step. I am leaving “wrapping this into a function” as an exercise to the reader. Check the Source code on GitHub to see how it would look, if you get stuck!
Now that we have defined almost every single bit of our API, let’s define an analyze method that we can call each time we want to run a chat file through our defined pipeline.
def analyze(name): splitted_lines = split_text(name + '.txt') # Distribute into AM and PM AM, PM = distributeByAmPm(splitted_lines) # Now group time into 1-hour intervals time_groups = groupByHour(AM, PM) # Plot plot_graph(time_groups, name) Notice how clean it looks? Try method-ifying your existing code and see how it goes! The plot_graph method is as below: def plot_graph(time_groups, name): plt.bar(range(len(time_groups)), list( time_groups.values()), align='center') plt.xticks(range(len(time_groups)), list(time_groups.keys())) # Add xlabel, ylabel, etc. yourself! plt.show()
Fire a terminal and type (the file name may vary):
# python file-name.py chat-file-name
$ python timing.py amfa
Timing Analysis — Chat with Amfa
The massive towers at 21, 22, 23, 0, 1, 2 intervals indicate that Amfa usually is an active texter late in the day and it’s highly unlikely I’ll get a reply at 10 in the morning. Isn’t that cool?
We have learned how simple and easy it is to define complex constructs and tasks in Python and how RegExp can save our souls a lot of trouble. We have used the GreatSentimentAnalyzer to analyze Whatsapp chat messages. We also learned method-ifying our code is great practice and that Amfa is a night owl and so not a morning person!
As always, the entire code is on GitHub and your shenanigans are always welcome!
Leave a Reply
Your email address will not be published. Required fields are marked *