Sep 16, 2018

Example 5: Simple Clustering and Visualisations

Using European Soccer Dataset from Kaggle


One of the important aspect of any data-centric analysis is visualisation the data, both in term of physical row-column contents or statistical graphical plots, to get the required insights for analysis. We have the European Soccer Database containing data of more than 25,000 matches and more than 10,000 players for European professional soccer seasons from 2008 to 2016. This example uses this dataset to show some simple steps for:
  • Exploring a dataset obtained from some source,
  • Cleaning or pre-processing the dataset for analytical use,
  • Some steps for predicting player performance using basic statistics, and
  • Some steps for grouping similar clusters using Machine Learning
  • Plotting some statistical parameters and clusterings for Visualisation
The soccer dataset, downloaded from the Kaggle site, is in the form of .sqlite format. Hence the in-built sqllite3 module in Python 3 is used for interacting with the local relational database. The modules pandas and numpy are used for data ingestion and manipulation, matplotlib for data visualisation, and specific methods from sklearn for Machine Learning. Also a custom function customplot (mentioned at the end of the page) is used for cluster plotting purpose.

The codes are run using PyCharm IDE, although Jupyter notebook is more helpful to execute section by section. Since I am using PyCharm, I have used # to stop a line from executing; you may delete the # to run that line.

import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn.cluster import KMeans
from customplot import *

# Step 1: Injesting data from European Soccer Database downloaded from Kaggle
db = sqlite3.connect('./MyData/database.sqlite')
df = pd.read_sql_query("select * from Player_Attributes", db)

# Step 2: Checking the data content
# ---------------------------------
#print(df.head())

print(df.columns)
#dsc = df.describe().transpose()
#print(dsc.head())

# Step 3: Data Cleaning: Removing rows with missing data
# ------------------------------------------------------
#isnull = df.isnull().any().any(), df.shape
#isnull = df.isnull().sum(axis=0)
#print(isnull)

rows = df.shape[0]
print("\nNo of rows before dropping nulls:", rows)    # Gives 183978 rows
# Dropping the rows with null values

df = df.dropna()
rows_new = df.shape[0]
print("No of rows after dropping nulls:", rows_new)     # Gives 180354 rows
#isnull = df.isnull().any().any(), df.shape
#print(isnull)   # Gives (False, (180354, 42))

rows_dropped = rows - rows_new
print("No of rows with nulls dropped:", rows_dropped)  # Gives 3624, so 3624 rows dropped with null values
df = df.reindex(np.random.permutation(df.index))
#print(df.head())

# Step 4: Decide on a required case say analyse on 'overall_rating' of a player
# Create a list of potential Features that you want to measure correlation with 'overall_rating'
# ----------------------------------------------------------------------------------------------

potentialFeatures = ['acceleration', 'curve', 'free_kick_accuracy', 'ball_control', 'shot_power', 'stamina']
print(35*'-')
for f in potentialFeatures:
    related = df['overall_rating'].corr(df[f])
    print("%s: %f" % (f, related))
# Gives this result:
#acceleration: 0.243998
#curve: 0.357566
#free_kick_accuracy: 0.349800
#ball_control: 0.443991
#shot_power: 0.428053
#stamina: 0.325606
# So 'overall_rating' is higher correlated with 'ball_control' (0.44) and 'shot_power' (0.43).

# Broadening the list of potential Features that can have correlation with 'overall_rating'

cols = [ 'potential', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes']

correlations = [df['overall_rating'].corr(df[f]) for f in cols]
#print(len(cols), len(correlations))

# Step 5: Data Visualization for correlation of the Features with 'overall_rating'
# Create a function for plotting a dataframe with string columns and numeric values
# ---------------------------------------------------------------------------------

def plot_dataframe(df, x_label, y_label): 
    color='blue'
    fig = plt.gcf()    # Using gcf plot of matplotlib
    fig.set_size_inches(21, 4)
    plt.ylabel(y_label, fontsize=14)
    plt.xlabel(x_label, fontsize=14)   

    ax = df.correlation.plot(linewidth=3.3, color=color)
    ax.set_xticks(df.index)
    ax.set_xticklabels(df.attributes, rotation=45)
    plt.grid(True)
    plt.show()

df2 = pd.DataFrame({'attributes': cols, 'correlation': correlations})
plot_dataframe(df2, 'Player\'s Attributes', 'Player\'s Overall Rating')

# Create Clusters with top features as per expert's choice that are essential for a player
# ----------------------------------------------------------------------------------------

select5features = ['gk_kicking', 'potential', 'marking', 'interceptions', 'standing_tackle']
df_select = df[select5features].copy(deep = True)
#print(df_select.head())
# Perform scaling on the dataframe containing the selected features

data = scale(df_select)
# Define number of clusters
noOfClusters = 4
# Train a Machine Learning model
model = KMeans(init='k-means++', n_clusters=noOfClusters, n_init=20).fit(data)

print(30*'_')
print("\nNo of players in each cluster")
print(30*'_')
kmclu = pd.value_counts(model.labels_, sort=False)
print(kmclu)

# Create a composite dataframe for plotting using custom function declared in customplot.py
# -----------------------------------------------------------------------------------------

P = pd_centers(featuresUsed=select5features, centers=model.cluster_centers_)
pplt = parallel_plot(P)
plt.show(pplt)


Textual Results:

Index(['id', 'player_fifa_api_id', 'player_api_id', 'date', 'overall_rating',
       'potential', 'preferred_foot', 'attacking_work_rate',
       'defensive_work_rate', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes'],
      dtype='object')

No of rows before dropping nulls: 183978
No of rows after dropping nulls: 180354
No of rows with nulls dropped: 3624
-----------------------------------
acceleration: 0.243998
curve: 0.357566
free_kick_accuracy: 0.349800
ball_control: 0.443991
shot_power: 0.428053
stamina: 0.325606
_____________________________

No of players in each cluster
_____________________________
0    47071
1    43286
2    73628
3    16369
dtype: int64


Graphical Results:

1. On Correlation of Features:

Player's "Overall Ratings Correlated with Features" vs "Overall Ratings"

Analysis of Findings:

Now it is time to analyze what we plotted, the most important step in any Data Science analysis. We answer questions such as:

Qtn: Suppose you have to predict a player's overall rating. Which 5 player attributes would you ask for?

Ans: To anwer this question, we need to find which are the five features with highest correlation coefficients. We want to judge a player based on his playing capability in the field. So, after looking at the above plot, we would ask for the following 5 player attributes:
    - short_passing
    - long_passing
    - ball_control
    - shot_power
    - vision

We are ignoring higher value attributes "potential" and "reactions", as these do not actually show a player's playing capability in the field.

2. On Clustering of Features:

Plot of Clusters of the five Features selected by an imaginary expert

Analysis of Findings:

We can identify similar groups as follows:
  • Two groups are very similar except in gk_kicking - these players can coach each other on gk_kicking, where they differ.
    Group 0 and 2 have similarity in potential, marking, interceptions, and standing_tackle, except in gk_kicking, where they differ wide apart.
  • Two groups are somewhat similar to each other except in potential.
    Group 1 and 3 are somewhat similar to each other in gk_kicking, marking, interceptions, and standing_tackle, except in potential.

Details on the Custom Plot:

Custom functions in customplot.py used for above Cluster plot 

def pd_centers(featuresUsed, centers):
    from itertools import cycle, islice
    #from pandas.tools.plotting import parallel_coordinates
    from pandas.plotting import parallel_coordinates
    import matplotlib.pyplot as plt
    import pandas as pd
    import numpy as np

    colNames = list(featuresUsed)
    colNames.append('prediction')

    # Zip with a column called 'prediction' (index)
    Z = [np.append(A, index) for index, A in enumerate(centers)]

    # Convert to pandas for plotting
    P = pd.DataFrame(Z, columns=colNames)
    P['prediction'] = P['prediction'].astype(int)
    return P

def parallel_plot(data):
    from itertools import cycle, islice
    #from pandas.tools.plotting import parallel_coordinates
    from pandas.plotting import parallel_coordinates
    import matplotlib.pyplot as plt

    my_colors = list(islice(cycle(['b', 'r', 'g', 'y', 'k']), None, len(data)))
    plt.figure(figsize=(15,8)).gca().axes.set_ylim([-2.5,+2.5])
    parallel_coordinates(data, 'prediction', color = my_colors, marker='o')