Using Text from Online Book
Word count helps in finding how many times particular word/s are used in a document, book, email, tweets, etc., for sentiment analysis. This blog does not talks about sentiment analysis. It only focus on finding most common words from a pool of words and their counts.
- Create a data structure to store the words from the source and the number of occurrences of the words.
- Read in each word from the source file, make it lower case and remove punctuation. (Optionally, skip common words).
For each remaining word, add the word to the data structure and update word count - Extract the top ten most frequently occurring words from the data structure and print them, along with their frequencies.
- Plot the word frequency count in a graph using matplotlib.
# Simple Word Cloud wordcount
# Using text from the book - A Tale of Two Cities, by Charles Dickens
# Finally plotting frequency of the most common words
import collections
import matplotlib.pyplot as plt
file=open("./MyData/tale-of-two-cities.txt", encoding="utf8")
stopwords = set(line.strip() for line in open('./MyData/stopwords.txt'))
# Creating the Data Structure as Dictionary
dict = {}
# Instantiating dictionary: For every word in the file, add to dictionary if it doesn't exist.
# If it exists in the dictionary, increment the wordcount.
for word in file.read().lower().split():
word = word.replace(".", "") # All cleanings
word = word.replace(",", "")
word = word.replace("\'", "")
word = word.replace('“', '')
word = word.replace('"', '')
word = word.replace("?", "")
word = word.replace(";", "")
word = word.replace(":", "")
word = word.replace("/", "")
word = word.replace("(", "")
word = word.replace(")", "")
word = word.replace("!", "")
word = word.replace("-", "")
word = word.replace("@", "")
if word not in stopwords:
if word not in dict:
dict[word] = 1
else:
dict[word] += 1
# After building wordcount, sort it and return the first n words.
wds = collections.Counter(dict)
#print(wds.most_common(10))
# Creating two Lists for x and y axis for plotting
x = []
y = []
for word, count in wds.most_common(20):
print(count, ': ', word)
x.append(word)
y.append(count)
#print(x)
#print(y)
# Plotting the graph
x1 = range(len(y))
plt.figure(figsize=(10,10))
plt.xticks(x1, x, rotation=45)
plt.plot(x1,y,'*')
plt.show()
file.close()
Result of the word cloud from the online book with twenty most common words and their counts:
said : 659
mr : 620
one : 427
lorry : 322
upon : 290
will : 290
defarge : 268
man : 265
little : 264
time : 247
hand : 241
now : 233
miss : 226
two : 214
know : 213
good : 202
looked : 193
long : 187
made : 185
never : 185
Plotting the word frequency count in a graph:
Twenty most common words vs frequency
Next check WordCloud in this link: WordCount in Python