My Data Science Notes: August 2018

Aug 25, 2018

Example 3: Gender Classifiers Comparision

Comparing different Gender Classifier models

This example extends Example 2 with comparing four different Gender Classifier models from the sklearn module namely, DecisionTreeClasifier, Support Vector Classifier, Linear Perceptron model, and KNeighborsClassifier. Please see the respective modules imported in the program.

from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model.perceptron import Perceptron
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Reading .csv file from a folder using pandas
file = open('./MyData/gender_data.csv')
data = pd.read_csv(file)

# Checking the input dataset
print('The dataset looks like:\n', data.head())
print('... ...\nNo. of rows and column:', data.shape)
#print(dtable.info())

# Creating two parts, X with attributes and y with genders
X = data.drop(['gender'], axis=1)
y = data['gender']

# Creating two random datasets for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=1)

# Trying different Classifier models
model_tree = tree.DecisionTreeClassifier()
model_svc = SVC(gamma='auto') # Support Vector Classifier
model_per = Perceptron(max_iter=10000) # Add a max_iter value to remove the warning
model_KNeigh = KNeighborsClassifier()

# Training the above models with the training data: X_train, y_train
model_tree.fit(X_train, y_train)
model_svc.fit(X_train, y_train)
model_per.fit(X_train, y_train)
model_KNeigh.fit(X_train, y_train)

# Testing the above models with the test data: x_test, y_test
predict_tree = model_tree.predict(X_test)
acc_tree = accuracy_score(y_test, predict_tree) * 100
print('\nAccuracy for DecissionTreeClassifier: {0:.3f}'.format(acc_tree) )

predict_svc = model_svc.predict(X_test)
acc_svc = accuracy_score(y_test, predict_svc) * 100
print('Accuracy for Support Vector Classifier: {0:.3f}'.format(acc_svc))

predict_per = model_per.predict(X_test)
acc_per = accuracy_score(y_test, predict_per) * 100
print('Accuracy for Linear Perceptron model: {0:.3f}'.format(acc_per))

predict_KNeigh = model_KNeigh.predict(X_test)
acc_KNeigh = accuracy_score(y_test, predict_KNeigh) * 100
print('Accuracy for KNeighborsClassifier: {0:.3f}'.format(acc_KNeigh))

# Selecting the best classifier from Decission Tree, SVM, Perceptron, and KNeigh
index = np.argmax([acc_tree, acc_svc, acc_per, acc_KNeigh])
Compare_clfs = {0: 'DecissionTreeClassifier', 1: 'Support Vector Classifier', 2: 'Linear Perceptron', 3: 'KNeighborsClassifier'}
print('\nBest Gender Classifier is: {}'.format(Compare_clfs[index]))

On running this code, the result comes like this:

The dataset looks like:
    height weight shoe size gender
0     172      85         90    male
1     161      64         64 female
2     183      75         64    male
3     191      82         61 female
4     178      56         84    male
... ...
No. of rows and column: (500, 4)

Accuracy for DecissionTreeClassifier: 80.000
Accuracy for Support Vector Classifier: 20.000
Accuracy for Linear Perceptron model: 40.000
Accuracy for KNeighborsClassifier: 40.000

Best Gender Classifier is: DecissionTreeClassifier

Since this example runs of the training and testing datasets created by a random data generator from sklearn.model_selection, so each time the models could generate different levels of accuracy for different number of rows of data, thereby showing a different model better than the others in different test runs. In the above test run DecissionTreeClassifier was found the better model over others.

Aug 22, 2018

Example 2: Gender Classifier with CSV File

Gender Classifier with .csv Dataset

This example extends Example 1 with the data reading directly from a .csv file in a folder, which looks like the table shown in the image below. It uses a Python module called pandas to read the physical .csv file. The pandas module, which is an easy-to-use data structures and data analysis tools, is installed through pip in command prompt as:
...>pip install pandas

In the last example we tested the prediction capacity of the model with a small hard coded dataset X_test, which was an array. In this example we are using the train_test_split function of the sklearn module to split the dataset randomly into training and testing datasets for the respective task for the model.

Additionally, this example shows the Accuracy Score and Confusion Matrix of the DecisionTreeClassifier() model.

from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import pandas as pd

# Reading the dataset from a csv file using pandas
file = open('./MyData/gender_data.csv')
data = pd.read_csv(file)
print(data.head())
print('Original dataset shape: ', data.shape)

# Creating two parts, X with attributes and y with genders
X = data.drop(['gender'], axis=1)
y = data['gender']

# Creating two random datasets for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=1)
#print(X_train.head())
#print(y_train.head())
#print('Train Dataset shape: ', X_train.shape)
#print('Train Dataset shape: ', y_train.shape)
#print(X_test.head())
#print(y_test.head())
#print('Test Dataset shape: ', X_test.shape)
#print('Test Dataset shape: ', y_test.shape)

model = tree.DecisionTreeClassifier()

model.fit(X_train, y_train)

predict = model.predict(X_test)

print("\nThe test data:\n", X_test, "\nbelongs to:")
print(predict)

# Finding prediction accuracy of the model
acc = accuracy_score(y_test, predict)
print("\nAccuracy Score:", acc)

# Finding Confusion matrix
conf = pd.DataFrame(
    confusion_matrix(y_test, predict),
    columns=['Predicted False', 'Predicted True'],
    index= ['Actual False', 'Actual True']
)
print("\nConfusion Matrix:\n", conf)

On running this code, the result comes as this:

   height weight shoe size gender
0     179      82         45    male
1     187      72         44    male
2     180      61         40    male
3     156      56         38 female
4     162      64         42 female
Original dataset shape: (500, 4)

The test data:
      height weight shoe size
304     163      88         90
340     184      80         79
47      159      70         39
67      161      77         57
479     174      61         58
belongs to:
['female' 'male' 'female' 'female' 'male']

Accuracy Score: 0.4

Confusion Matrix:
               Predicted False Predicted True
Actual False                1               1
Actual True                 2               1

In the above example, in the section for generating random datasets for training and testing, a number of print codes are mention as comments. Remove the # for the line that you want to run for checking the training and testing datasets.

Aug 10, 2018

Example 1: Simple Gender Classifier

Gender Classifier with Array Dataset

A simple "Hello World" kind of Machine Learning example to show how it works. Suppose we have a dataset containing height, weight, shoe size, and respective genders of customers of a shoe shop as shown in the image below. From this data we can train the computer to predict the genders of future customers using Machine Learning algorithms.

Training the machine with a training dataset to learn to identify a male and female customer based on some features

To feed this data to the Machine Learning algorithm, we need to prepare the data in a format that can be used in the algorithm. We first divide the data in two parts - features and labels. Features contain the attributes that can be used to describe an object. Labels define the object itself. In our case customer is an object with the features (or attributes) - height, weight, and shoe size. A particular combination of these attributes defines whether the customer label (or gender) is male or female. So, once the computer learns the nature of combination of height, weight, and shoe size values that belongs to a specific gender of the customer, it can predict the gender of future customers as male or female by reading the height, weight, and shoe size details.

So for the above example, we divide the dataset into features and labels as following arrays:

X = [[179, 82, 45], [187, 72, 44], [180, 61, 40], [156, 56, 38], [162, 64, 42], [...], [...], [...]]
y = ['male', 'male', 'male', 'female', 'female', ..., ..., ...]

Where, X denotes features and y denotes labels.
Please note that it is a general convention that features are represented in capital X because feature values are independent. And label is represented in small y because label values are dependent.

We then feed the X and y data to a classifier algorithm or model as shown in the Python program below. The program does two things:

First, trains the model DecisionTreeClassifier() with the X and y data to learn which kind of attributes - height, weight, and shoe size, could belongs to male or female.
Second, once the computer learns about the attributes of male and female customers, we test its capability of prediction with a some attributes to check whether a customer is male or female.

Alright! Let's start. The code goes like this. Please see the descriptive comments in the program to understand the code.
Note: The DecisionTreeClassifier() model belongs to the sklearn module, a simple and efficient tools for data mining and data analysis, is installed through the pip package in command prompt as: ...>pip install -U scikit-learn

# Gender Classifier: Machine Learning for classifying gender
# Example with hard coded data
from sklearn import tree

# Training Dataset: [height, weight, shoe_size] of customers
X = [[179, 82, 45], [187, 72, 44], [180, 61, 40], [156, 56, 38], [162, 64, 42], [189, 91, 48], [172, 63, 40],
     [175, 69, 42], [161, 56, 38], [173, 77, 44], [184, 88, 45], [181, 80, 44], [177, 70, 43], [160, 60, 38],
     [154, 54, 40], [166, 65, 40], [190, 90, 47], [175, 64, 39], [177, 70, 40], [159, 63, 40], [171, 75, 42],
     [181, 85, 43], [180, 83, 46], [188, 74, 45], [179, 62, 41], [156, 56, 38], [162, 64, 42], [189, 91, 48]]
# Gender of the customers
y = ['male', 'male', 'male', 'female', 'female', 'male', 'female',
     'female', 'female', 'male', 'male', 'male', 'female', 'male',
     'female', 'female', 'male', 'female', 'male', 'female', 'female',
     'male', 'male', 'male', 'male', 'female', 'female', 'male']

# Using the Classifier model
model = tree.DecisionTreeClassifier()

# Training the machine with our dataset
model_training = model.fit(X, y)

# Testing gender prediction with some new data
X_test = [[160, 60, 42], [187, 73, 40], [175, 65, 38],]
prediction = model.predict(X_test)

# Providing the prediction result
print("The given new data: \n", X_test, "\nbelongs to:")
print(prediction)

On running this code, the result comes as this:

The new data:
[[160, 60, 42], [187, 73, 40], [175, 65, 38]]
belongs to:
['female' 'male' 'male']

So the based on the training dataset, the computer predicts the test dataset to represent female, male, and male customer.

This is a simple demonstration on how Machine Learning through Python works.

Pages