April 13th 2020

In my last post, I worked on connecting Twitter and collecting tweets on #Corona using Tweepy. So now that we have collected some tweets, let's work on cleaning and analyzing those text step by step.

1. Import Libraries:

Let's import all the required libraries first.

import twitter
import tweepy as tw
# For Preprocessing
import re
from nltk.tokenize
import word_tokenize
from nltk.corpus import stopwords
set(stopwords.words('english'))
from wordcloud import WordCloud
# ForPlots
import matplotlib.pyplot as plt

Tryazon Taste Republic â€“ Pasta for …
TikTok Gaming: A Step-by-Step Guide t…
Malaysia vs Indonesia: Battling to Be…
Microsoft reveals custom Final Fantas…
Another Appliance Company Now Files A…

2. Initialize and connect to Twitter:

Now, let's connect to twitter account using the respective key and access token.

# initialize api instance
consumer_key='################'
consumer_secret='######################'
access_token='############################'
access_token_secret ='#####################'
#Connect to Twitter through the API
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth,wait_on_rate_limit=True)

3. Collect Tweets:

Now that we are connected to twitter, we will look for English Tweets on #Coronavirus. Let's collect 1000 tweets and work on Text Analytics.

The variable Tweets is a list object containing all the text and links and emojis and pictures, etc.

4. Cleaning and Preprocessing:

Now, we need to clean all the tweets to extract some meaningful words.

5. Text Analytics:

We have a corpus and we can work on some basic Text Analytics techniques. First, let's take a look at how the corpus looks like at this point.

If we check the length of the corpus, the result is 9105 for this run.

Can we get a count of each distinct word from the Corpus? Let's do it!

A. Frequency Distribution:

Our aim is to find out the frequency of each distinct word from the corpus. Then we will take a look at the most used terms.

from nltk.probability import FreqDist
fdist = FreqDist(corpus)
print(fdist.most_common(100))

If I check my results, Trump is treding again.

Let's plot 30 most used terms in today's tweets on #coronavirus.

fdist.plot(30,cumulative=False)
plt.show()

Now, let's try some other tools. For visuals, my most favorite is Wordcloud.

B. Wordcloud:

Before creating the word cloud, let's check the frequency count in a different way.

import operator # To sort dictionary values
# join the words in string
words = ' '.join(corpus)
words = words.split()
#print(words)
# create a empty dictionary
data = dict()
# Get frequency for each words where word is the key and the count is the value
for word in (words):
word = word.lower()
data[word] = data.get(word, 0) + 1
# Sort the dictionary in reverse order to print first the most used terms
dict(sorted(data.items(), key=operator.itemgetter(1),reverse=True))

Again as expected, the output shows that Mr. Trump is most frequently used in Tweets. Now, let's create a word cloud with 100 most used words.

word_cloud = WordCloud(width = 800, height = 800, background_color ='white',max_words = 100)
word_cloud.generate_from_frequencies(data)
# plot the WordCloud image
plt.figure(figsize = (10, 8), edgecolor = 'k')
plt.imshow(word_cloud,interpolation = 'bilinear')
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()

Okay, Trump, Lockdown, Pandemic - that pretty much sums up the current trend in the whole world! Next, I will try to figure out the people's sentiments during this crisis period and will work on Sentiment Analysis using Tweets on #coronavirus.

Till then... stay safe!

This post first appeared on What The Data Says, please read the originial post: here