Aug 22, 2020

Example 21: WordCloud in Python

A WordCloud (or Tag cloud) is a visual representation of text data used in any document or text container. It displays a list of words with its importance shown with font size or color. The importance of the word is measured by the number of its repeated use in the document or text container, thereby indicating the sentiment of the writer of that piece of document. We can use WordCloud for restaurant reviews, product reviews, tweets, emails, or any such textural instruments. This format is therefore useful for quickly perceiving the most prominent terms in the content.

The following example shows a sample of the use of the WordCloud technique in Python using a worldwide wine review sample dataset named winemag-data-130k-v2.csv. Here we use the libraries numpy, pandas, pillow (PIL), and wordcloud, and for plotting matplotlib.

import numpy as np
import pandas as pd
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import matplotlib.pyplot as plt
%matplotlib inline

# Load in the dataframe
df = pd.read_csv("../MLData/winemag-data-130k-v2.csv", index_col=0)
df.head()

First, some initial checks to know about the text data in the wine review csv file. First to see how many unique wine brands are there in the dataset - list of unique wine brands and then their counts.

# Check variety of unique wines
df.variety.unique()

# How many unique wines available?
len(df.variety.unique())

# Five unique wines?
df.variety.unique()[0:10]

# Print some basic information about the dataset
print("There are {} observations and {} features in this dataset. \n".format(df.shape[0],df.shape[1]))
print("There are {} types of wine in this dataset such as {}... \n".format(len(df.variety.unique()),
                                                                           ", ".join(df.variety.unique()[0:5])))
print("There are {} countries producing wine in this dataset such as {}... \n".format(len(df.country.unique()),
                                                                                      ", ".join(df.country.unique()[0:5])))


The dataset has a column called "points" where it shows points as numeric value in a scale of 1 to 100 derived from customers rating based on quality and liking of the wines. We will be using this column to measure the wine across the countries.

df[["country", "variety", "points"]].head()

# Groupby by country
country = df.groupby("country")


# Summary statistic of all countries
country.describe().head()


Now let's plot some figures. First lets see number of wines per country from the dataset using country.size().

# Plot the number of wines by country
plt.figure(figsize=(20,8))
country.size().sort_values(ascending=False).plot.bar()
plt.title("Number of Wine Brands per Country", fontsize="18")
plt.xticks(rotation=75)
plt.xlabel("Country of Origin", fontsize="14")
plt.ylabel("Number of Wines", fontsize="14")
plt.show()


Now, let's plot the "average wine price" and "average wine points" among the countries.

# Selects the top 5 highest average wine price among all 44 countries
country.mean().sort_values(by="price", ascending=False).head()


# Selects the top 5 highest average points among all 44 countries
country.mean().sort_values(by="points",ascending=False).head()


# plot of all 44 countries by its highest prices wine
plt.figure(figsize=(20,10))

#country.max().sort_values(by="points", ascending=False)["points"].plot.bar()
country.mean().sort_values(by="price", ascending=False)["price"].plot.bar()
plt.title("Highest Point of Wines per Country", fontsize="18")
plt.xticks(rotation=75)
plt.xlabel("Country of Origin", fontsize="14")
plt.ylabel("Highest Point of Wines", fontsize="14")
plt.show()


Now, let's create the main part, i.e., the WordCloud. Let's take the texts from the description column where customers have provided their review comments.

?WordCloud

# Start with one review:
text = df.description[0]


# Create and generate a word cloud image:
wordcloud = WordCloud().generate(text)


# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()



Another variant of the WordCloud design through defining the maximum font size and word counts.

# lower max_font_size, change the maximum number of word and lighten the background:
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()



We can save the WordCloud image in a folder location for any use.

# Save the image in the img folder:
wordcloud.to_file("../MLData/word_cloud_first_review.png")
text = " ".join(review for review in df.description)
print ("There are {} words in the combination of all review.".format(len(text)))


We can create and use a list of stopwords that we do not want to appear in the WordCloud.

# Create stopword list:
stopwords = set(STOPWORDS)
stopwords.update(["drink", "now", "wine", "flavor", "flavors"])


# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)


# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


Masking is a great way to showcase the WordCloud in some required shape. Since this is a wine review, so let's try to mask the WordCloud in the shape of a wine bottle and glass. For this, we will first need a masking image containing the shape that we want the Word Cloud to appear in.

This is important. The masking image should be grayscale image with one RBG layer. You can download one layered grayscale image from web OR you can create your own from a coloured image using the pillow library in Python.

# Convert a downloaded image to one channel image using pillow package
from PIL import Image
img = Image.open('../MLData/wine.png').convert('L')
img.save('../MLData/wine_mask.png')
img

Now check the generated mask image.

wine_mask = np.array(Image.open("../MLData/wine_mask.png"))
wine_mask

The mask image basically have black area showing the shape of the object. So, we need to transform the mask image into another image where the black area with "0" values are converted into "255" values, as shown below:

def transform_format(val):
    if val == 0:
        return 255
    else:
        return val


# Transform your mask into a new one that will work with the function:
transformed_wine_mask = np.ndarray((wine_mask.shape[0],wine_mask.shape[1]), np.int32)
for i in range(len(wine_mask)):
    transformed_wine_mask[i] = list(map(transform_format, wine_mask[i]))

# Check the expected result of your mask
transformed_wine_mask


We can now generate the new WordCloud shape by combining the WordCloud with the transformed mask image, as following: 

# Create a word cloud image
wc = WordCloud(background_color="white", max_words=1000, mask=transformed_wine_mask,
               stopwords=stopwords, contour_width=3, contour_color='firebrick')
# Generate a wordcloud
wc.generate(text)
# store to file
wc.to_file("../MLData/wine.png")
# show
plt.figure(figsize=[20,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()


Now, let's create another WordCloud with all US review comments and using the US flag! So first join all review comments per country. We will use the review comments for US for our task. Rest all steps are same as in the above example.

country.size().sort_values(ascending=False).head(10)

# Join all reviews of each country:
usa = " ".join(review for review in df[df["country"]=="US"].description)
fra = " ".join(review for review in df[df["country"]=="France"].description)
ita = " ".join(review for review in df[df["country"]=="Italy"].description)
spa = " ".join(review for review in df[df["country"]=="Spain"].description)
por = " ".join(review for review in df[df["country"]=="Portugal"].description)


usa

# Convert a downloaded USA flag to one channel image using pillow package
from PIL import Image
img = Image.open('../MLData/us_flag.png').convert('L')
img.save('../MLData/us_flag_rgb.png')
img



# us_mask = np.array(img)
us_mask = np.array(Image.open("../MLData/us_flag_rgb.png"))
us_mask


def transform_format(val):
    if val == 0:
        return 255
    else:
        return val


# Transform your mask into a new one that will work with the function:
transformed_us_mask = np.ndarray((us_mask.shape[0], us_mask.shape[1]), np.int32)
for i in range(len(us_mask)):
    transformed_us_mask[i] = list(map(transform_format, us_mask[i]))


# Check the expected result of your mask
transformed_us_mask

# Create a word cloud image
wc = WordCloud(background_color="white", max_words=1000, mask=transformed_us_mask, 
               stopwords=stopwords, contour_width=3, contour_color='firebrick')
# Generate a wordcloud
wc.generate(text)

# store to file
wc.to_file("MLData/us_wine.png")

# show
plt.figure(figsize=[20,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()



That's all! Hope you enjoyed.