top

Introduction to Text Mining in WhatsApp Chats using Python- Part 1

Sad or glad? Let’s find out!Not boring you with lengthy and banal introductions. In this post, we are going to analyze WhatsApp chat messages and plot a pie chart visualizing the percentage of texts that have a Positive, Negative, or Neutral connotation or Sentiment to it. Ready to start?Text Mining is just a fancy term for deriving super-awesome patterns and drawing amazing inferences from Textual Data.1. Introduction to text mining in whatsapp chatsText Mining is a just a fancy term for deriving super-awesome patterns and drawing amazing inferences from Textual Data. Just like you can look at an image and infer that it is of a baby or of a 77-year-old woman, we can do the same with texts. Feel free to get on the rocket  and dive straight into code. The code, and a Jupyter notebook is available on GitHub, as always! 2. Mining patterns from ChatsIf you have ever emailed a WhatsApp chat to yourself, (or someone else, if that’s how you roll) you may have noticed that it includes a handful of details that can be used to analyze a group of texts as well as the entire chat history. For those of you who do not know what I am talking about, here’s a brief description: WhatsApp allows you to export/share your chat history with a contact as a .txt file and it has the following structure:10/09/16, 6:10 PM - Person 1: Person 2? 10/09/16, 7:10 PM - Person 2: Yes? 10/09/16, 8:10 PM - Person 1: How are you? 10/09/16, 9:10 PM - Person 2: Idk. You? 10/09/16, 10:10 PM - Person 1: I just got a new phone 10/09/16, 11:10 PM - Person 2: Which one? 10/09/16, 12:10 AM - Person 1: I showed you? 10/09/16, 1:10 AM - Person 2: Cool 10/09/16, 2:10 AM - Person 1: Tes 10/09/16, 3:10 AM - Person 1: Yes 10/09/16, 4:10 AM - Person 1: You wanna write? 10/09/16, 5:10 AM - Person 1: If you have time. 10/09/16, 6:10 AM - Person 2: Not right now 10/09/16, 7:10 AM - Person 1: All right 10/09/16, 8:10 AM - Person 1: Do you think I should eat turkey? 10/09/16, 9:10 AM - Person 1: It's been a week 10/09/16, 10:10 AM - Person 1: We had a ten hour gaming session 10/09/16, 11:10 AM - Person 1: And it wasn't easy 10/09/16, 12:10 PM - Person 1: To be honest 10/09/16, 1:10 PM - Person 1: You should come over? 10/09/16, 2:10 PM - Person 2: Yes 10/09/16, 3:10 PM - Person 1: When? 10/09/16, 4:10 PM - Person 2: In an hour, maybe? 10/09/16, 5:10 PM - Person 2: See you!We are going to take advantage of this data, so let’s dive in!3. Project Setup and PrerequisitesLike any other project, there are a few dependencies (libraries) that you would need to have in order to analyze whatsapp chats using Python. If you face any problem installing any of these dependencies, feel free to leave a comment!                                                                                 Damn!On my own development system, I use Anaconda (to which I am not associated ) as it provides seamless virtual environments in Python and also has pre-compiled libraries in its repositories. Assuming you have pip installed, fire up a terminal and enter the following command:pip install numpy matplotlib nltkOr if you also prefer a conda, you can go to:conda install numpy matplotlib nltkTo install nltk’s data, run:python -m nltk.downloader allAfter you have installed the libraries, we can go ahead and start mining! (There won’t be gold.)4. CodeCreate a new directory — you can call it whatever-you-like but for me, it is a whatsapp-chat-analysis — and open it in your favorite code editor. While you are at it, open WhatsApp on your smartphone, and export your chat with any of your contacts (preferably someone you talk a lot too.) Detailed instructions on how to do that can be found after an intelligent Google Search or by navigating to the following: Place the .txt file in your project’s root and proceed.Step 1. Loading chat data into Python, and pre-process itThose were just the raw materials and like any other food recipe, we need the raw materials to make anything worthwhile. Now that we have gathered all the raw materials, let’s load them in the mixer and make something awesome.The first step is to load the chat file in Python and the easiest way to read a file in Python is to use the open utility and call the read method on the returned File object. chat = open("chat-with-amfa.txt") chatText = chat.read() # read its contents chat.close() Next comes the task of cleaning the text to remove artifacts that we aren’t going to need. Here’s a function that reads a text file, removes the artifacts and returns a list of lines extracted from the text file.As you’ll find in any other tutorial that does Text Mining, we make use of regular expressions — or RegExp for short — to find recurring patterns in the text and substitute them in one go. It makes the code more comprehensible and easy to understand, as well as helps us avoid unnecessary loops. (If you are a beginner, think about how we can do what we want without RegExp.)import re mediaPattern = r"(\<Media omitted\>)" # Because it serves no purpose regexMedia = re.compile(mediaPattern, flags=re.M) dateAndTimepattern = r"(\d+\/\d+\/\d+)(,)(\s)(\d+:\d+)(\s)(\w+)(\s)(-)(\s\w+)*(:)" regexDate = re.compile(dateAndTimepattern, flags=re.M)Let’s first declare two RegExp patterns that we would like to find and remove from every line of the text file. We then precompile it and save the compiled RegExp in the regexMedia and regexDate variables. These two variables now act as ready-to-use shredders. All you have to do is give them paper and they’ll make inedible spaghetti in no time! If you are having trouble understanding RegExp, you can go to https://regexr.com/3s301 and try changing it bits by bits to see how the matches change.                                                                             https://regexr.com/3s301 — RegExr                 Next, we “substitute”, as implied by that sub method (5–6), all occurrences of the pattern with an empty string. You can think of it as a conditional and automated eraser.""" Removes the matches and replace them with an empty string """ chatText = regexMedia.sub("", chatText) chatText = regexDate.sub("", chatText) lines = [] for line in chatText.splitlines():     if line.strip() is not "": # If it's empty, we don't need it         lines.append(line.strip())Then, we simply split the chat file into separate lines and remove all lines that are empty. (Can you reason about why the file, at this time, would have empty lines?)The complete code, with some schezwan sauce and decoration, is as below:import re mediaPattern = r"(\<Media omitted\>)" # Because it serves no purpose regexMedia = re.compile(mediaPattern, flags=re.M) dateAndTimepattern = r"(\d+\/\d+\/\d+)(,)(\s)(\d+:\d+)(\s)(\w+)(\s)(-)(\s\w+)*(:)" regexDate = re.compile(dateAndTimepattern, flags=re.M) def cleanText(filename):         chat = open(filename)     chatText = chat.read()     chat.close()     # 01/09/17, 11:34 PM - Amfa:     """     Removes the matches and     replace them with an empty string     """     chatText = regexMedia.sub("", chatText)     chatText = regexDate.sub("", chatText)     lines = []     for line in chatText.splitlines():         if line.strip() is not "": # If it's empty, we don't need it             lines.append(line.strip())     return linesYou can copy and paste that code in a file named utilities.py, but you’ll learn a lot about the language if you typed it yourself.Step 2. Sad or Glad — A Novelimport sys import re import matplotlib.pyplot as plt import nltk from utilities import cleanText from nltk.sentiment.vader import SentimentIntensityAnalyzer sentiment_analyzer = SentimentIntensityAnalyzer() # Our Great Sentiment Analyzer def analyze(name):     linesList = cleanText(name + '.txt')     neutral, negative, positive = 0, 0, 0     for index, sentence in enumerate(linesList):         print("Processing {0}%".format(str((index * 100) / len(linesList))))                 # Ignore Emoji         if re.match(r'^[\w]', sentence):             continue                 scores = sentiment_analyzer.polarity_scores(sentence)                 # We don't need that component         scores.pop('compound', None)                 maxAttribute = max(scores, key=lambda k: scores[k])         if maxAttribute == "neu":             neutral += 1         elif maxAttribute == "neg":             negative += 1         else:             positive += 1     total = neutral + negative + positive     print("Negative: {0}% | Neutral: {1}% | Positive: {2}%".format(         negative*100/total, neutral*100/total, positive*100/total))         # Plot     #### Code Omitted #### analyze(sys.argv[1]) nltk comes pre-bundled with a Sentiment Analyzer that was pre-trained on the state-of-the-art Textual Datasets and we can use it to analyze the tone of each sentence and mark it as one of these: Negative, Positive, or Neutral. (8)As always, we clean the chat text using the function we defined in utility.py (11), initialize helper variables (12) and start processing (15–16.) We then process each line in the chat file, which represents one text, one by one. In (18–19), you can see that we are again using RegExp to get rid of texts that only begin with an emoji. In most cases, this works, but it might fail in cases the text is preceded by an emoji. I look forward to a more sophisticated hack to get rid of this! PRs will always be welcome.                                                                                PRs are welcomeMoving on, we use our GreatSentimentAnalyzer to assign scores to the current text in question. (21) Then, we find the field — “neg”, “pos” or “neu” — which has the maximum score and increment the counter keeping track of it. We use Python’s lambda functions to find the field which has the maximum score (26). There are simpler ways to do it, so feel free to replace the lambda function with something else, keeping the end goal in mind.After that, we apply some mathematics and boom-shaw-shey-done!  The results are logged in the terminal, as well as plotted using matplotlib.                                                            Sentiment Analysis — Chat with AmfaI have omitted the code for plotting the results here for simplicity but you can grab the source from the project’s repository on GitHub, along with a Jupyter Notebook.ConclusionIn this post, we used the GreatSentimentAnalyzer to study and analyze WhatsApp chat messages using Python, and damn, wasn’t it fun? Here is the project link for review.In the next post, we are going to use the metadata we deleted and draw awesome inferences from that. Stay tuned! (Hint: Time is important, isn’t it?)
Rated 4.5/5 based on 8 customer reviews
Normal Mode Dark Mode

Introduction to Text Mining in WhatsApp Chats using Python- Part 1

Abhishek Soni
Blog
09th Aug, 2018
Introduction to Text Mining in WhatsApp Chats using Python- Part 1

Sad or glad? Let’s find out!

Not boring you with lengthy and banal introductions. In this post, we are going to analyze WhatsApp chat messages and plot a pie chart visualizing the percentage of texts that have a Positive, Negative, or Neutral connotation or Sentiment to it. Ready to start?

Text Mining is just a fancy term for deriving super-awesome patterns and drawing amazing inferences from Textual Data.


1. Introduction to text mining in whatsapp chats

Text Mining is a just a fancy term for deriving super-awesome patterns and drawing amazing inferences from Textual Data. Just like you can look at an image and infer that it is of a baby or of a 77-year-old woman, we can do the same with texts. Feel free to get on the rocket  and dive straight into code. The code, and a Jupyter notebook is available on GitHub, as always!

 

2. Mining patterns from Chats

If you have ever emailed a WhatsApp chat to yourself, (or someone else, if that’s how you roll) you may have noticed that it includes a handful of details that can be used to analyze a group of texts as well as the entire chat history. For those of you who do not know what I am talking about, here’s a brief description: WhatsApp allows you to export/share your chat history with a contact as a .txt file and it has the following structure:

10/09/16, 6:10 PM - Person 1: Person 2?
10/09/16, 7:10 PM - Person 2: Yes?
10/09/16, 8:10 PM - Person 1: How are you?
10/09/16, 9:10 PM - Person 2: Idk. You?
10/09/16, 10:10 PM - Person 1: I just got a new phone
10/09/16, 11:10 PM - Person 2: Which one?
10/09/16, 12:10 AM - Person 1: I showed you?
10/09/16, 1:10 AM - Person 2: Cool
10/09/16, 2:10 AM - Person 1: Tes
10/09/16, 3:10 AM - Person 1: Yes
10/09/16, 4:10 AM - Person 1: You wanna write?
10/09/16, 5:10 AM - Person 1: If you have time.
10/09/16, 6:10 AM - Person 2: Not right now
10/09/16, 7:10 AM - Person 1: All right
10/09/16, 8:10 AM - Person 1: Do you think I should eat turkey?
10/09/16, 9:10 AM - Person 1: It's been a week
10/09/16, 10:10 AM - Person 1: We had a ten hour gaming session
10/09/16, 11:10 AM - Person 1: And it wasn't easy
10/09/16, 12:10 PM - Person 1: To be honest
10/09/16, 1:10 PM - Person 1: You should come over?
10/09/16, 2:10 PM - Person 2: Yes
10/09/16, 3:10 PM - Person 1: When?
10/09/16, 4:10 PM - Person 2: In an hour, maybe?
10/09/16, 5:10 PM - Person 2: See you!

We are going to take advantage of this data, so let’s dive in!


3. Project Setup and Prerequisites

Like any other project, there are a few dependencies (libraries) that you would need to have in order to analyze whatsapp chats using Python. If you face any problem installing any of these dependencies, feel free to leave a comment!

 Project Setup and Prerequisites

                                                                                 Damn!

On my own development system, I use Anaconda (to which I am not associated ) as it provides seamless virtual environments in Python and also has pre-compiled libraries in its repositories. 

Libraries to install for text mining


Assuming you have pip installed, fire up a terminal and enter the following command:

pip install numpy matplotlib nltk

Or if you also prefer a conda, you can go to:

conda install numpy matplotlib nltk

To install nltk’s data, run:
python -m nltk.downloader all

After you have installed the libraries, we can go ahead and start mining! (There won’t be gold.)

4. Code

Create a new directory — you can call it whatever-you-like but for me, it is a whatsapp-chat-analysis — and open it in your favorite code editor. While you are at it, open WhatsApp on your smartphone, and export your chat with any of your contacts (preferably someone you talk a lot too.) Detailed instructions on how to do that can be found after an intelligent Google Search or by navigating to the following: 

Place the .txt file in your project’s root and proceed.

Step 1. Loading chat data into Python, and pre-process it

Those were just the raw materials and like any other food recipe, we need the raw materials to make anything worthwhile. Now that we have gathered all the raw materials, let’s load them in the mixer and make something awesome.

The first step is to load the chat file in Python and the easiest way to read a file in Python is to use the open utility and call the read method on the returned File object.

chat = open("chat-with-amfa.txt")
chatText = chat.read() # read its contents
chat.close()


Next comes the task of cleaning the text to remove artifacts that we aren’t going to need. Here’s a function that reads a text file, removes the artifacts and returns a list of lines extracted from the text file.

As you’ll find in any other tutorial that does Text Mining, we make use of regular expressions — or RegExp for short — to find recurring patterns in the text and substitute them in one go. It makes the code more comprehensible and easy to understand, as well as helps us avoid unnecessary loops. (If you are a beginner, think about how we can do what we want without RegExp.)

import re

mediaPattern = r"(\<Media omitted\>)" # Because it serves no purpose
regexMedia = re.compile(mediaPattern, flags=re.M)

dateAndTimepattern = r"(\d+\/\d+\/\d+)(,)(\s)(\d+:\d+)(\s)(\w+)(\s)(-)(\s\w+)*(:)"
regexDate = re.compile(dateAndTimepattern, flags=re.M)


Let’s first declare two RegExp patterns that we would like to find and remove from every line of the text file. We then precompile it and save the compiled RegExp in the regexMedia and regexDate variables. These two variables now act as ready-to-use shredders. All you have to do is give them paper and they’ll make inedible spaghetti in no time! If you are having trouble understanding RegExp, you can go to https://regexr.com/3s301 and try changing it bits by bits to see how the matches change.

                                                                            https://regexr.com/3s301 — RegExr                

 


Next, we “substitute”, as implied by that sub method (5–6), all occurrences of the pattern with an empty string. You can think of it as a conditional and automated eraser.

"""
Removes the matches and
replace them with an empty string
"""
chatText = regexMedia.sub("", chatText)
chatText = regexDate.sub("", chatText)

lines = []

for line in chatText.splitlines():
    if line.strip() is not "": # If it's empty, we don't need it
        lines.append(line.strip())


Then, we simply split the chat file into separate lines and remove all lines that are empty. (Can you reason about why the file, at this time, would have empty lines?)

The complete code, with some schezwan sauce and decoration, is as below:

import re

mediaPattern = r"(\<Media omitted\>)" # Because it serves no purpose
regexMedia = re.compile(mediaPattern, flags=re.M)

dateAndTimepattern = r"(\d+\/\d+\/\d+)(,)(\s)(\d+:\d+)(\s)(\w+)(\s)(-)(\s\w+)*(:)"
regexDate = re.compile(dateAndTimepattern, flags=re.M)

def cleanText(filename):    
    chat = open(filename)
    chatText = chat.read()
    chat.close()

    # 01/09/17, 11:34 PM - Amfa:

    """
    Removes the matches and
    replace them with an empty string
    """
    chatText = regexMedia.sub("", chatText)
    chatText = regexDate.sub("", chatText)

    lines = []

    for line in chatText.splitlines():
        if line.strip() is not "": # If it's empty, we don't need it
            lines.append(line.strip())

    return lines


You can copy and paste that code in a file named utilities.py, but you’ll learn a lot about the language if you typed it yourself.

Step 2. Sad or Glad — A Novel

import sys
import re
import matplotlib.pyplot as plt
import nltk
from utilities import cleanText
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sentiment_analyzer = SentimentIntensityAnalyzer() # Our Great Sentiment Analyzer

def analyze(name):
    linesList = cleanText(name + '.txt')
    neutral, negative, positive = 0, 0, 0

    for index, sentence in enumerate(linesList):
        print("Processing {0}%".format(str((index * 100) / len(linesList))))
       
        # Ignore Emoji
        if re.match(r'^[\w]', sentence):
            continue
       
        scores = sentiment_analyzer.polarity_scores(sentence)
       
        # We don't need that component
        scores.pop('compound', None)
       
        maxAttribute = max(scores, key=lambda k: scores[k])

        if maxAttribute == "neu":
            neutral += 1
        elif maxAttribute == "neg":
            negative += 1
        else:
            positive += 1

    total = neutral + negative + positive
    print("Negative: {0}% | Neutral: {1}% | Positive: {2}%".format(
        negative*100/total, neutral*100/total, positive*100/total))
   
    # Plot
    #### Code Omitted ####

analyze(sys.argv[1])

nltk comes pre-bundled with a Sentiment Analyzer that was pre-trained on the state-of-the-art Textual Datasets and we can use it to analyze the tone of each sentence and mark it as one of these: Negative, Positive, or Neutral. (8)

As always, we clean the chat text using the function we defined in utility.py (11), initialize helper variables (12) and start processing (15–16.) We then process each line in the chat file, which represents one text, one by one. In (18–19), you can see that we are again using RegExp to get rid of texts that only begin with an emoji. In most cases, this works, but it might fail in cases the text is preceded by an emoji. I look forward to a more sophisticated hack to get rid of this! PRs will always be welcome.


PRs are welcome

                                                                                PRs are welcome

Moving on, we use our GreatSentimentAnalyzer to assign scores to the current text in question. (21) Then, we find the field — “neg”, “pos” or “neu” — which has the maximum score and increment the counter keeping track of it. We use Python’s lambda functions to find the field which has the maximum score (26). There are simpler ways to do it, so feel free to replace the lambda function with something else, keeping the end goal in mind.

After that, we apply some mathematics and boom-shaw-shey-done!  The results are logged in the terminal, as well as plotted using matplotlib.

   Sentiment Analysis —

                                                            Sentiment Analysis — Chat with Amfa

I have omitted the code for plotting the results here for simplicity but you can grab the source from the project’s repository on GitHub, along with a Jupyter Notebook.


Conclusion

In this post, we used the GreatSentimentAnalyzer to study and analyze WhatsApp chat messages using Python, and damn, wasn’t it fun? Here is the project link for review.

In the next post, we are going to use the metadata we deleted and draw awesome inferences from that. Stay tuned! (Hint: Time is important, isn’t it?)

Abhishek

Abhishek Soni

Blog author

Software Developer and ML Enthusiast


Website : https://github.com/abhisheksoni27

Leave a Reply

Your email address will not be published. Required fields are marked *

SUBSCRIBE OUR BLOG

Follow Us On

Share on

other Blogs