Sep 22, 2018

Example 7: Creating Word Count

Using Text from Online Book


Word count helps in finding how many times particular word/s are used in a document, book, email, tweets, etc., for sentiment analysis. This blog does not talks about sentiment analysis. It only focus on  finding most common words from a pool of words and their counts.

The following example shows a Python example to create a word cloud with text from online book found at https://www.gutenberg.org/, and then counting the top twenty most frequently occurring words in the word cloud. Finally it plots those twenty most frequently occurring words in a graph to visualise the comparative presence. The following steps are followed:
  1. Create a data structure to store the words from the source and the number of occurrences of the words.
  2. Read in each word from the source file, make it lower case and remove punctuation. (Optionally, skip common words).
    For each remaining word, add the word to the data structure and update word count
  3. Extract the top ten most frequently occurring words from the data structure and print them, along with their frequencies.
  4. Plot the word frequency count in a graph using matplotlib.
Create a text file named "tale-of-two-cities.txt" from an online book. Also create a text file named "stopwords.txt" containing texts to be skipped.

# Simple Word Cloud wordcount
# Using text from the book - A Tale of Two Cities, by Charles Dickens
# Finally plotting frequency of the most common words


import collections
import matplotlib.pyplot as plt

file=open("./MyData/tale-of-two-cities.txt", encoding="utf8")
stopwords = set(line.strip() for line in open('./MyData/stopwords.txt'))

# Creating the Data Structure as Dictionary
dict = {}

# Instantiating dictionary: For every word in the file, add to dictionary if it doesn't exist.
# If it exists in the dictionary, increment the wordcount.

for word in file.read().lower().split():
    word = word.replace(".", "")    # All cleanings
    word = word.replace(",", "")
    word = word.replace("\'", "")
    word = word.replace('“', '')
    word = word.replace('"', '')
    word = word.replace("?", "")
    word = word.replace(";", "")
    word = word.replace(":", "")
    word = word.replace("/", "")
    word = word.replace("(", "")
    word = word.replace(")", "")
    word = word.replace("!", "")
    word = word.replace("-", "")
    word = word.replace("@", "")
    if word not in stopwords:
        if word not in dict:
            dict[word] = 1
        else:
            dict[word] += 1

# After building wordcount, sort it and return the first n words.
wds = collections.Counter(dict)
#print(wds.most_common(10))

# Creating two Lists for x and y axis for plotting
x = []
y = []
for word, count in wds.most_common(20):
    print(count, ': ', word)
    x.append(word)
    y.append(count)
#print(x)
#print(y)


# Plotting the graph
x1 = range(len(y))
plt.figure(figsize=(10,10))
plt.xticks(x1, x, rotation=45)
plt.plot(x1,y,'*')
plt.show()

file.close()


Result of the word cloud from the online book with twenty most common words and their counts:

said :  659
mr :  620
one :  427
lorry :  322
upon :  290
will :  290
defarge :  268
man :  265
little :  264
time :  247
hand :  241
now :  233
miss :  226
two :  214
know :  213
good :  202
looked :  193
long :  187
made :  185
never :  185


Plotting the word frequency count in a graph:
Twenty most common words vs frequency

Next check WordCloud in this link: WordCount in Python