My Data Science Notes: September 2018

Sep 30, 2018

Example 8: Sentiment Analysis of Tweets from Twitter API

Saving Tweets in CSV file with Tweet Sentiments

Sentiment Analysis is the process of understanding and extracting human feelings from text data. To define technically, Sentiment Analysis is the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.

Companies harness the power of social media and sentiment analysis through contextual mining of text which identifies and extracts subjective information in source material, and helping their business to understand the social sentiment for their brand, product or service while monitoring online conversations.

The easiest way to find sentiments is in tweets that users type in after they self understand, analyse, or make perception on a news, a situation, a product, a movie, or anything like that under the sun. The Twitter API lets us access the functionality of tweets reading from our code.

First of all I created a Twitter API with my login ID at https://developer.twitter.com and obtained the consumer_key, consumer_secret, access_token, and access_token_secret, that I have used to access the Twitter API.

We then use the tweepy package for accessing Twitter APIs in Python. For Python versions prior to 3.7, install tweepy simply with general pip command: pip install tweepy
For Python version 3.7, install tweepy updates directly from git as: pip install git+https://github.com/tweepy/tweepy.git

The TextBlob python library helps us in processing textual data to perform sentiment analysis with just few lines of code. It provides a simple API for diving into common Natural Language Processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. Install the TextBlob python library with pip as:
pip install textblob
python -m textblob.download_corpora
Note: Microsoft Visual C++ Compiler Package for Python 2.7 will be required to encounter error while installing textblob.

The TextBlob API provides sentiment analysis as Sentiment(polarity=0.0, subjectivity=0.0), where Polarity is the measures how positive or negative some text is, and Subjectivity is the measure of how much of an opinion is it vs how factual, where 0 is the neutral value with positive and negative values on the either sides.

The sample code I tested is shown below. Also, before running the code I created a blank csv file named as twitter_dataset.csv and put it in a directory as mentioned in the code.

Note: Please note that the tweepy package throws an error with Python 3.7. From Google search I found that apparently "async" cannot be used as an argument name in Python 3.7. To fix this, open streaming.py and replace async with async_. It fixed the error for me.

import tweepy
from textblob import TextBlob
import csv

#API keys and token for using the Twitter API
consumer_key = "#####E5UbboNCf3O9Z3N#####"
consumer_secret = "#####VbADAblJus2tF60JNX#####TQBV99kgHKpCvnC2b#####"
access_token = "#####8578-#####dqm2p2UtIp8Uu#####d4cQCO4RmqHK#####"
access_token_secret = "#####m0UYXecfEm6eDPEh#####LLCZmorNEQ8rcM#####"

#Authenticating with Twitter for the above keys
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

'''
public_tweets = api.home_timeline()
for tweet in public_tweets:
    print(tweet.text)
'''

tweet_search_topic = input("Enter the topic in twitter to analyse? : ")
public_tweets = api.search(tweet_search_topic)

#Writing the tweets in a csv file
with open('./MyData/twitter_dataset.csv', mode='w', encoding="utf-8") as tweets_file:
    tweet_writer = csv.writer(tweets_file, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
    #Labelling the columns for the tweet dataset
    tweet_writer.writerow(['Tweet', 'Author', 'Date', 'Sentiment', 'Polarity'])
    #Analyzing each tweet from the tweets and writing it in the csv file
    for tweet in public_tweets:
        tweet_text = tweet.text
        tweet_user = tweet.user.name
        tweet_created_at = tweet.created_at
        tweet_sentiment = TextBlob(tweet_text).sentiment
        tweet_polarity = TextBlob(tweet_text).sentiment.polarity
        tweet_subjectivity = TextBlob(tweet_text).sentiment.subjectivity
        print(tweet_text, tweet_user, tweet_created_at)
        print(tweet_sentiment)
        tweet_writer.writerow([tweet_text, tweet_user, tweet_created_at, tweet_polarity, tweet_subjectivity])

The Results:

On running the code it asks to enter the topic in twitter to analyse. I enter say, "Putin".

It finds and prints tweets on Putin like this, with user's sentiment values:

@BigTinyBird @FoxNews @WhiteHouse @POTUS So, the mission was to get Putin to love America?
And it worked?
Wow.... h… https://t.co/ipLahQfYnE Max 2018-10-08 17:55:48
Sentiment(polarity=0.3, subjectivity=0.8)
... ... ...

Also it write the csv file twitter_dataset.csv as following, which we can used for further detailed analysis:

Sep 22, 2018

Example 7: Creating Word Count

Using Text from Online Book

Word count helps in finding how many times particular word/s are used in a document, book, email, tweets, etc., for sentiment analysis. This blog does not talks about sentiment analysis. It only focus on finding most common words from a pool of words and their counts.

The following example shows a Python example to create a word cloud with text from online book found at https://www.gutenberg.org/, and then counting the top twenty most frequently occurring words in the word cloud. Finally it plots those twenty most frequently occurring words in a graph to visualise the comparative presence. The following steps are followed:

Create a data structure to store the words from the source and the number of occurrences of the words.
Read in each word from the source file, make it lower case and remove punctuation. (Optionally, skip common words).
For each remaining word, add the word to the data structure and update word count
Extract the top ten most frequently occurring words from the data structure and print them, along with their frequencies.
Plot the word frequency count in a graph using matplotlib.

Create a text file named "tale-of-two-cities.txt" from an online book. Also create a text file named "stopwords.txt" containing texts to be skipped.

# Simple Word Cloud wordcount
# Using text from the book - A Tale of Two Cities, by Charles Dickens
# Finally plotting frequency of the most common words

import collections
import matplotlib.pyplot as plt

file=open("./MyData/tale-of-two-cities.txt", encoding="utf8")
stopwords = set(line.strip() for line in open('./MyData/stopwords.txt'))

# Creating the Data Structure as Dictionary
dict = {}

# Instantiating dictionary: For every word in the file, add to dictionary if it doesn't exist.
# If it exists in the dictionary, increment the wordcount.
for word in file.read().lower().split():
    word = word.replace(".", "")    # All cleanings
    word = word.replace(",", "")
    word = word.replace("\'", "")
    word = word.replace('“', '')
    word = word.replace('"', '')
    word = word.replace("?", "")
    word = word.replace(";", "")
    word = word.replace(":", "")
    word = word.replace("/", "")
    word = word.replace("(", "")
    word = word.replace(")", "")
    word = word.replace("!", "")
    word = word.replace("-", "")
    word = word.replace("@", "")
    if word not in stopwords:
        if word not in dict:
            dict[word] = 1
        else:
            dict[word] += 1

# After building wordcount, sort it and return the first n words.
wds = collections.Counter(dict)
#print(wds.most_common(10))

# Creating two Lists for x and y axis for plotting
x = []
y = []
for word, count in wds.most_common(20):
    print(count, ': ', word)
    x.append(word)
    y.append(count)
#print(x)
#print(y)

# Plotting the graph
x1 = range(len(y))
plt.figure(figsize=(10,10))
plt.xticks(x1, x, rotation=45)
plt.plot(x1,y,'*')
plt.show()

file.close()

Result of the word cloud from the online book with twenty most common words and their counts:

said : 659
mr : 620
one : 427
lorry : 322
upon : 290
will : 290
defarge : 268
man : 265
little : 264
time : 247
hand : 241
now : 233
miss : 226
two : 214
know : 213
good : 202
looked : 193
long : 187
made : 185
never : 185

Plotting the word frequency count in a graph:

Twenty most common words vs frequency

Next check WordCloud in this link: WordCount in Python

Sep 20, 2018

Example 6: Reading a File in Python

Different Ways to Read Data from File

There are different ways to read the content of a text file. While Python simplifies the process of reading text files, at times it can still be tricky. The following tested examples summarises three ways to read the content of a text file from a folder.

The .txt file is taken from the online book - A Tale of Two Cities, by Charles Dickens, found at https://www.gutenberg.org/.

Reading a File Line by Line

# Reading a File Line by Line
file = "./MyData/tale-of-two-cities.txt"
filehandle = open(file, 'r', encoding='utf8')

while True:
    line = filehandle.readline()
    if not line:
        break
    print(line)

filehandle.close()

Reading a File as Chunks of Lines

# Reading a File as Chunks of Lines
from itertools import islice

file = "./MyData/tale-of-two-cities.txt"
filehandle = open(file, 'r', encoding='utf8')

num_of_lines = 6

with filehandle as input_file:
    line_cache = islice(input_file, num_of_lines)

    for current_line in line_cache:
        print(current_line)

filehandle.close()

Reading a Specific Line from a File

# Reading a Specific Line from a File
file = "./MyData/tale-of-two-cities.txt"
filehandle = open(file, 'r', encoding='utf8')

line_number = 300
print("line %i of %s is:\n" % (line_number, file))

with filehandle as input_file:
    current_line = 1
    for line in input_file:
        if current_line == line_number:
            print(line)
            break
        current_line += 1

filehandle.close()

Reading the Entire File at Once using the function read()

# Reading the Entire File at Once
file = "./MyData/tale-of-two-cities.txt"
filehandle = open(file, 'r', encoding='utf8')

with filehandle as input_file:
    content = filehandle.read()
    print(content)

filehandle.close()

Reading the Entire File at Once using the function readlines()

# Reading the Entire File at Once
file = "./MyData/tale-of-two-cities.txt"
filehandle = open(file, 'r', encoding='utf8')

with filehandle as input_file:
    content = filehandle.readlines()
    for line in content:
        print(line)

filehandle.close()

Reference:
https://stackabuse.com/reading-files-with-python/

Sep 16, 2018

Example 5: Simple Clustering and Visualisations

Using European Soccer Dataset from Kaggle

One of the important aspect of any data-centric analysis is visualisation the data, both in term of physical row-column contents or statistical graphical plots, to get the required insights for analysis. We have the European Soccer Database containing data of more than 25,000 matches and more than 10,000 players for European professional soccer seasons from 2008 to 2016. This example uses this dataset to show some simple steps for:

Exploring a dataset obtained from some source,
Cleaning or pre-processing the dataset for analytical use,
Some steps for predicting player performance using basic statistics, and
Some steps for grouping similar clusters using Machine Learning
Plotting some statistical parameters and clusterings for Visualisation

The soccer dataset, downloaded from the Kaggle site, is in the form of .sqlite format. Hence the in-built sqllite3 module in Python 3 is used for interacting with the local relational database. The modules pandas and numpy are used for data ingestion and manipulation, matplotlib for data visualisation, and specific methods from sklearn for Machine Learning. Also a custom function customplot (mentioned at the end of the page) is used for cluster plotting purpose.

The codes are run using PyCharm IDE, although Jupyter notebook is more helpful to execute section by section. Since I am using PyCharm, I have used # to stop a line from executing; you may delete the # to run that line.

import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn.cluster import KMeans
from customplot import *

# Step 1: Injesting data from European Soccer Database downloaded from Kaggle
db = sqlite3.connect('./MyData/database.sqlite')
df = pd.read_sql_query("select * from Player_Attributes", db)

# Step 2: Checking the data content
# ---------------------------------
#print(df.head())
print(df.columns)
#dsc = df.describe().transpose()
#print(dsc.head())

# Step 3: Data Cleaning: Removing rows with missing data
# ------------------------------------------------------
#isnull = df.isnull().any().any(), df.shape
#isnull = df.isnull().sum(axis=0)
#print(isnull)
rows = df.shape[0]
print("\nNo of rows before dropping nulls:", rows)    # Gives 183978 rows
# Dropping the rows with null values
df = df.dropna()
rows_new = df.shape[0]
print("No of rows after dropping nulls:", rows_new)     # Gives 180354 rows
#isnull = df.isnull().any().any(), df.shape
#print(isnull)   # Gives (False, (180354, 42))
rows_dropped = rows - rows_new
print("No of rows with nulls dropped:", rows_dropped) # Gives 3624, so 3624 rows dropped with null values
df = df.reindex(np.random.permutation(df.index))
#print(df.head())

# Step 4: Decide on a required case say analyse on 'overall_rating' of a player
# Create a list of potential Features that you want to measure correlation with 'overall_rating'
# ----------------------------------------------------------------------------------------------
potentialFeatures = ['acceleration', 'curve', 'free_kick_accuracy', 'ball_control', 'shot_power', 'stamina']
print(35*'-')
for f in potentialFeatures:
    related = df['overall_rating'].corr(df[f])
    print("%s: %f" % (f, related))
# Gives this result:
#acceleration: 0.243998
#curve: 0.357566
#free_kick_accuracy: 0.349800
#ball_control: 0.443991
#shot_power: 0.428053
#stamina: 0.325606
# So 'overall_rating' is higher correlated with 'ball_control' (0.44) and 'shot_power' (0.43).

# Broadening the list of potential Features that can have correlation with 'overall_rating'
cols = [ 'potential', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes']

correlations = [df['overall_rating'].corr(df[f]) for f in cols]
#print(len(cols), len(correlations))

# Step 5: Data Visualization for correlation of the Features with 'overall_rating'
# Create a function for plotting a dataframe with string columns and numeric values
# ---------------------------------------------------------------------------------
def plot_dataframe(df, x_label, y_label):
    color='blue'
    fig = plt.gcf()    # Using gcf plot of matplotlib
    fig.set_size_inches(21, 4)
    plt.ylabel(y_label, fontsize=14)
    plt.xlabel(x_label, fontsize=14)

    ax = df.correlation.plot(linewidth=3.3, color=color)
    ax.set_xticks(df.index)
    ax.set_xticklabels(df.attributes, rotation=45)
    plt.grid(True)
    plt.show()

df2 = pd.DataFrame({'attributes': cols, 'correlation': correlations})
plot_dataframe(df2, 'Player\'s Attributes', 'Player\'s Overall Rating')

# Create Clusters with top features as per expert's choice that are essential for a player
# ----------------------------------------------------------------------------------------
select5features = ['gk_kicking', 'potential', 'marking', 'interceptions', 'standing_tackle']
df_select = df[select5features].copy(deep = True)
#print(df_select.head())
# Perform scaling on the dataframe containing the selected features
data = scale(df_select)
# Define number of clusters
noOfClusters = 4
# Train a Machine Learning model
model = KMeans(init='k-means++', n_clusters=noOfClusters, n_init=20).fit(data)

print(30*'_')
print("\nNo of players in each cluster")
print(30*'_')
kmclu = pd.value_counts(model.labels_, sort=False)
print(kmclu)

# Create a composite dataframe for plotting using custom function declared in customplot.py
# -----------------------------------------------------------------------------------------
P = pd_centers(featuresUsed=select5features, centers=model.cluster_centers_)
pplt = parallel_plot(P)
plt.show(pplt)

Textual Results:

Index(['id', 'player_fifa_api_id', 'player_api_id', 'date', 'overall_rating',
       'potential', 'preferred_foot', 'attacking_work_rate',
       'defensive_work_rate', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes'],
      dtype='object')

No of rows before dropping nulls: 183978
No of rows after dropping nulls: 180354
No of rows with nulls dropped: 3624
-----------------------------------
acceleration: 0.243998
curve: 0.357566
free_kick_accuracy: 0.349800
ball_control: 0.443991
shot_power: 0.428053
stamina: 0.325606
_____________________________

No of players in each cluster
_____________________________
0    47071
1    43286
2    73628
3    16369
dtype: int64

Graphical Results:

1. On Correlation of Features:

Player's "Overall Ratings Correlated with Features" vs "Overall Ratings"

Analysis of Findings:

Now it is time to analyze what we plotted, the most important step in any Data Science analysis. We answer questions such as:

Qtn: Suppose you have to predict a player's overall rating. Which 5 player attributes would you ask for?

Ans: To anwer this question, we need to find which are the five features with highest correlation coefficients. We want to judge a player based on his playing capability in the field. So, after looking at the above plot, we would ask for the following 5 player attributes:
    - short_passing
    - long_passing
    - ball_control
    - shot_power
    - vision

We are ignoring higher value attributes "potential" and "reactions", as these do not actually show a player's playing capability in the field.

2. On Clustering of Features:

Plot of Clusters of the five Features selected by an imaginary expert

Analysis of Findings:

We can identify similar groups as follows:

Two groups are very similar except in gk_kicking - these players can coach each other on gk_kicking, where they differ.
Group 0 and 2 have similarity in potential, marking, interceptions, and standing_tackle, except in gk_kicking, where they differ wide apart.
Two groups are somewhat similar to each other except in potential.
Group 1 and 3 are somewhat similar to each other in gk_kicking, marking, interceptions, and standing_tackle, except in potential.

Details on the Custom Plot:

Custom functions in customplot.py used for above Cluster plot

def pd_centers(featuresUsed, centers):
    from itertools import cycle, islice
    #from pandas.tools.plotting import parallel_coordinates
    from pandas.plotting import parallel_coordinates
    import matplotlib.pyplot as plt
    import pandas as pd
    import numpy as np

    colNames = list(featuresUsed)
    colNames.append('prediction')

    # Zip with a column called 'prediction' (index)
    Z = [np.append(A, index) for index, A in enumerate(centers)]

    # Convert to pandas for plotting
    P = pd.DataFrame(Z, columns=colNames)
    P['prediction'] = P['prediction'].astype(int)
    return P

def parallel_plot(data):
    from itertools import cycle, islice
    #from pandas.tools.plotting import parallel_coordinates
    from pandas.plotting import parallel_coordinates
    import matplotlib.pyplot as plt

    my_colors = list(islice(cycle(['b', 'r', 'g', 'y', 'k']), None, len(data)))
    plt.figure(figsize=(15,8)).gca().axes.set_ylim([-2.5,+2.5])
    parallel_coordinates(data, 'prediction', color = my_colors, marker='o')

Sep 4, 2018

Example 4: Comparing Multiple Classifiers

Understanding scikit-learn Classification

This example is an extension of Example 3 where it shows the implementation of all the main Classification models in scikit-learn module, thereby comparing their prediction scores side-by-side. The program also shows the use of the score function of the base classifiers instead of using the accuracy_score function in sklearn.metrics shown in Example 2 and 3.

# Comparing multiple Classifiers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
import pandas as pd

classifiers = {
    "KNN(3)"        : KNeighborsClassifier(3),
    "Linear SVM"    : SVC(kernel="linear"),
    "RBF SVM"       : SVC(gamma=2, C=1),
    "Gaussian Proc" : GaussianProcessClassifier(1.0 * RBF(1.0)),
    "Decision Tree" : DecisionTreeClassifier(max_depth=7),
    "Random Forest" : RandomForestClassifier(max_depth=2, n_estimators=10, random_state=0),
    "AdaBoost"      : AdaBoostClassifier(),
    "Neural Net"    : MLPClassifier(alpha=1),
    "Naive Bayes"   : GaussianNB(),
    "QDA"           : QuadraticDiscriminantAnalysis()
}

file = open('./MyData/gender_data.csv')
data = pd.read_csv(file)

X = data.drop(['gender'], axis=1)
y = data['gender']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=1)

for name, classifier in classifiers.items():
    classifier.fit(X_train, y_train)
    score = classifier.score(X_test, y_test) * 100
    print("{:<15}| score = {:.2f}".format(name, score))

On running this code, the result comes like this:

The most difficult part of solving a machine learning problem is finding the right classifier or algorithm, as different estimators are better suited for different types of data and different problems. In that sense, classification score can help in selecting the optimum model.

Pages