top
Kickstart the New Year with exciting deals on all courses Use Coupon NY15 Click to Copy

Introduction to Text Mining: WhatsApp Chats (Part 2)

How many times has it occurred to you that maybe, maybe if you had texted that person an hour ago, you probably would have gotten a reply? What we are going to do today is use Python to answer that question. (And yes, there will be probabilities!). Let’s us learn to Analyze Whatsapp Chats using Python.Once you’re done with this tutorial, you’ll have a working system capable of plotting the distribution of texts exchanged by you and someone else, divided into 1-hour intervals.Note: If you haven’t checked out Part 1 yet, you can find it here: Introduction to Text Mining in WhatsApp Chats using Python- Part 1Without further delay, let’s lift this baby off the ground!Dependencies (heaven?)I know, I know! But there’s nothing we can do about that, right?If you had followed the previous tutorial, you already have the dependencies. If you are starting afresh, fire up a terminal and go:pip install matplotlibThat’s it. Python is a powerful language and you’ll see how we can do basic text manipulation with what the language provides by default. Isn’t that heaven?CodeGo to or make a directory named WHATEVER-YOU-LIKE and open it in your favorite editor. Of course, you’ll need to have a text file containing your WhatsApp chat with someone in order to complete this tutorial. Detailed instructions on how to Analyze Whatsapp Chats using Python that can be found here:WhatsApp FAQ - Saving your chat historySaving your chat history - Your WhatsApp chats are automatically backed up and saved daily to your phone's memory…faq.whatsapp.comOnce you have that .txt file, you are ready to move on. In a file named — timing.py (or whatever) — follow the steps below:Step 1. Import dependenciesFirst things first, we gotta summon the libraries we are going to use:import re import sys import matplotlib.pyplot as plt If you remember, in the first part of this series, we used RegExp to match and filter out the Time and Date metadata available in an exported chat file. We are going to do a similar thing this time with only one exception: We won’t discard that information.We need sys to be able to parse command-line arguments and execute our script on a chat file by providing its name.And, you guessed it right! We need matplotlib for plotting.Step 2. Read and split file contents by new lineTo do this, we will use the splitlines method attached to every string.def split_text(filename):   """   Split file contents by newline.   """   chat = open(filename)   chatText = chat.read()   return chatText.splitlines() We open a file, read its contents and then split the file by a newline character. (Can you think of how the splitlines method is implemented under the hood?)Wrapping up tasks like these in a method is considered good practice and it makes your code easier to comprehend and maintain.Step 3. Distribute chats into AM and PM bucketsSince, WhatsApp stores messages using 12-hour format, we will have to use two buckets, AM and PM, to collect the time information attached to every text.def distributeByAmPm(linesText):   AM, PM = [], []   # RegExp to extract Time information   timeRegex = re.compile("(\d+(:)\d+)(\s)(\w+)")    for index, line in enumerate(linesText):       matches = re.findall(timeRegex, line)       if(len(matches) > 0):           # match now contains ('6:10', ':', ' ', 'PM')           match = matches[0]           if "AM" in match:               AM.append(match[0])           else:               PM.append(match[0])    return AM, PM We need to pass the splitted lines we got in the previous step to distribute_by_am_pm to get the distributed buckets. Inside it, we first compile a RegExp that will match the time string in every text. As a brief reminder, this is how text looks like in the exported file:10/09/16, 4:10 PM - Person 2: In an hour, maybe?And that’s how the RegExp maps the pattern to the original string: RegExp at work.Later, we simply use an if statement to correctly distribute the time strings into AM and PM buckets. The two buckets are then returned.Step 4. Group time strings into 1-hour intervalsHere, we’ll first create a skeleton container — a dictionary — to contain the 1-hour intervals. The key will represent the hour, and the value will represent the number of texts shared within that interval. To illustrate, this is how it’ll look, eventually:First, let’s create the skeleton container with this fairly straightforward code:time_groups = {} for i in range(24):  time_groups[str(i)] = 0  # skeleton container Then, we loop through each bucket — AM & PM — and increment the correct count in our time_groups container.# if the hour is in AM for time in AM:       current_hour = get_hour(time)       if current_hour == 12:           current_hour = 0  # Because we represent 12AM as 0 in our container add_to_time_groups(current_hour) For all time strings in AM bucket, we first grab the current_hour (and ignore the minute information because we are grouping by Hour. ) Then, if the hour is 12, we set it to 0 to map 12 AM →0 on the clock.The two helper functions, get_hour and add_to_time_groups are defined below:def add_to_time_groups(current_hour):       current_hour = str(current_hour)       time_groups[current_hour] += 1 def get_hour(time):       return int(time.split(":")[0]) We follow a similar procedure for PM with just one caveat. We add 12 to every hour (except 12pm) to map 1 pm →13 on the clock and so on.# Similarly for PM for time in PM:       current_hour = get_hour(time) if current_hour > 12:           continue # In case it has captured something else, ignore if current_hour != 12:           current_hour += 12 add_to_time_groups(current_hour) That’s it! That completes this step. I am leaving “wrapping this into a function” as an exercise to the reader. Check the Source code on GitHub to see how it would look, if you get stuck!Now that we have defined almost every single bit of our API, let’s define an analyze method that we can call each time we want to run a chat file through our defined pipeline.Step 5. Analyze!def analyze(name):   splitted_lines = split_text(name + '.txt') # Distribute into AM and PM   AM, PM = distributeByAmPm(splitted_lines) # Now group time into 1-hour intervals   time_groups = groupByHour(AM, PM) # Plot   plot_graph(time_groups, name)Notice how clean it looks? Try method-ifying your existing code and see how it goes!The plot_graph method is as below: def plot_graph(time_groups, name):   plt.bar(range(len(time_groups)), list(       time_groups.values()), align='center')   plt.xticks(range(len(time_groups)), list(time_groups.keys()))   # Add xlabel, ylabel, etc. yourself!   plt.show() Step 6. RUNFire a terminal and type (the file name may vary):# python file-name.py chat-file-name $ python timing.py amfa Timing Analysis — Chat with AmfaThe massive towers at 21, 22, 23, 0, 1, 2 intervals indicate that Amfa usually is an active texter late in the day and it’s highly unlikely I’ll get a reply at 10 in the morning. Isn’t that cool?ConclusionIn this tutorial, we learned how simple and easy it is to define complex constructs and tasks in Python and how RegExp can save our souls a lot of trouble. We also learned method-ifying our code is great practice and that Amfa is a night owl and so-not a morning person!As always, the entire code is on GitHub and your shenanigans are always welcome!
Rated 4.5/5 based on 2 customer reviews
Normal Mode Dark Mode

Introduction to Text Mining: WhatsApp Chats (Part 2)

Abhishek Soni
Blog
11th Feb, 2019
Introduction to Text Mining: WhatsApp Chats (Part 2)


How many times has it occurred to you that maybe, maybe if you had texted that person an hour ago, you probably would have gotten a reply? What we are going to do today is use Python to answer that question. (And yes, there will be probabilities!). Let’s us learn to Analyze Whatsapp Chats using Python.

Once you’re done with this tutorial, you’ll have a working system capable of plotting the distribution of texts exchanged by you and someone else, divided into 1-hour intervals.

Note: If you haven’t checked out Part 1 yet, you can find it here: Introduction to Text Mining in WhatsApp Chats using Python- Part 1

Without further delay, let’s lift this baby off the ground!

Dependencies (heaven?)

I know, I know! But there’s nothing we can do about that, right?

If you had followed the previous tutorial, you already have the dependencies. If you are starting afresh, fire up a terminal and go:

pip install matplotlib

That’s it. Python is a powerful language and you’ll see how we can do basic text manipulation with what the language provides by default. Isn’t that heaven?

Code

Go to or make a directory named WHATEVER-YOU-LIKE and open it in your favorite editor. Of course, you’ll need to have a text file containing your WhatsApp chat with someone in order to complete this tutorial. Detailed instructions on how to Analyze Whatsapp Chats using Python that can be found here:

WhatsApp FAQ - Saving your chat history

Saving your chat history - Your WhatsApp chats are automatically backed up and saved daily to your phone's memory…faq.whatsapp.com

Once you have that .txt file, you are ready to move on. In a file named —timing.py (or whatever) — follow the steps below:

Step 1. Import dependencies

First things first, we gotta summon the libraries we are going to use:

import re
import sys
import matplotlib.pyplot as plt

If you remember, in the first part of this series, we used RegExp to match and filter out the Time and Date metadata available in an exported chat file. We are going to do a similar thing this time with only one exception: We won’t discard that information.

We need sys to be able to parse command-line arguments and execute our script on a chat file by providing its name.

And, you guessed it right! We need matplotlib for plotting.

Step 2. Read and split file contents by new line

To do this, we will use the splitlines method attached to every string.

def split_text(filename):
  """
  Split file contents by newline.
  """
  chat = open(filename)
  chatText = chat.read()
  return chatText.splitlines()

We open a file, read its contents and then split the file by a newline character. (Can you think of how the splitlines method is implemented under the hood?)

Wrapping up tasks like these in a method is considered good practice and it makes your code easier to comprehend and maintain.

Step 3. Distribute chats into AM and PM buckets

Since, WhatsApp stores messages using 12-hour format, we will have to use two buckets, AM and PM, to collect the time information attached to every text.

def distributeByAmPm(linesText):
  AM, PM = [], []
  # RegExp to extract Time information
  timeRegex = re.compile("(\d+(:)\d+)(\s)(\w+)") 
  for index, line in enumerate(linesText):
      matches = re.findall(timeRegex, line)
      if(len(matches) > 0):
          # match now contains ('6:10', ':', ' ', 'PM')
          match = matches[0]
          if "AM" in match:
              AM.append(match[0])
          else:
              PM.append(match[0])
   return AM, PM

We need to pass the splitted lines we got in the previous step to distribute_by_am_pm to get the distributed buckets. Inside it, we first compile a RegExp that will match the time string in every text. As a brief reminder, this is how text looks like in the exported file:

10/09/16, 4:10 PM - Person 2: In an hour, maybe?

And that’s how the RegExp maps the pattern to the original string:

 RegExp at work.

Later, we simply use an if statement to correctly distribute the time strings into AM and PM buckets. The two buckets are then returned.

Step 4. Group time strings into 1-hour intervals

Here, we’ll first create a skeleton container — a dictionary — to contain the 1-hour intervals. The key will represent the hour, and the value will represent the number of texts shared within that interval. To illustrate, this is how it’ll look, eventually:

First, let’s create the skeleton container with this fairly straightforward code:

time_groups = {}
for i in range(24):
 time_groups[str(i)] = 0  # skeleton container

Then, we loop through each bucket — AM & PM — and increment the correct count in our time_groups container.

# if the hour is in AM
for time in AM:
      current_hour = get_hour(time)
      if current_hour == 12:
          current_hour = 0  # Because we represent 12AM as 0 in our container
add_to_time_groups(current_hour)

For all time strings in AM bucket, we first grab the current_hour (and ignore the minute information because we are grouping by Hour. ) Then, if the hour is 12, we set it to 0 to map 12 AM →0 on the clock.

The two helper functions, get_hour and add_to_time_groups are defined below:

def add_to_time_groups(current_hour):
      current_hour = str(current_hour)
      time_groups[current_hour] += 1
def get_hour(time):
      return int(time.split(":")[0])

We follow a similar procedure for PM with just one caveat. We add 12 to every hour (except 12pm) to map 1 pm →13 on the clock and so on.

# Similarly for PM
for time in PM:
      current_hour = get_hour(time)
if current_hour > 12:
          continue # In case it has captured something else, ignore
if current_hour != 12:
          current_hour += 12
add_to_time_groups(current_hour)

That’s it! That completes this step. I am leaving “wrapping this into a function” as an exercise to the reader. Check the Source code on GitHub to see how it would look, if you get stuck!

Now that we have defined almost every single bit of our API, let’s define an analyze method that we can call each time we want to run a chat file through our defined pipeline.

Step 5. Analyze!

def analyze(name):
  splitted_lines = split_text(name + '.txt')
# Distribute into AM and PM
  AM, PM = distributeByAmPm(splitted_lines)
# Now group time into 1-hour intervals
  time_groups = groupByHour(AM, PM)
# Plot
  plot_graph(time_groups, name)

Notice how clean it looks? Try method-ifying your existing code and see how it goes!

The plot_graph method is as below:
def plot_graph(time_groups, name):
  plt.bar(range(len(time_groups)), list(
      time_groups.values()), align='center')
  plt.xticks(range(len(time_groups)), list(time_groups.keys()))
  # Add xlabel, ylabel, etc. yourself!
  plt.show()

Step 6. RUN

Fire a terminal and type (the file name may vary):

# python file-name.py chat-file-name
$ python timing.py amfa

Timing Analysis — Chat with Amfa

The massive towers at 21, 22, 23, 0, 1, 2 intervals indicate that Amfa usually is an active texter late in the day and it’s highly unlikely I’ll get a reply at 10 in the morning. Isn’t that cool?

Conclusion

In this tutorial, we learned how simple and easy it is to define complex constructs and tasks in Python and how RegExp can save our souls a lot of trouble. We also learned method-ifying our code is great practice and that Amfa is a night owl and so-not a morning person!

As always, the entire code is on GitHub and your shenanigans are always welcome!

Abhishek

Abhishek Soni

Blog author

I am a passionate Software Developer with experience in implementing a variety of software projects. (e.g., An Android App with Speed Reading capabilities, or an Android App that uses ML to check whether that kid drew the letter A correctly)


Website : https://github.com/abhisheksoni27

Leave a Reply

Your email address will not be published. Required fields are marked *

SUBSCRIBE OUR BLOG

Follow Us On

Share on