Aug 25, 2018

Example 3: Gender Classifiers Comparision

Comparing different Gender Classifier models 


This example extends Example 2 with comparing four different Gender Classifier models from the sklearn module namely, DecisionTreeClasifier, Support Vector Classifier, Linear Perceptron model, and KNeighborsClassifier. Please see the respective modules imported in the program.


from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model.perceptron import Perceptron
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Reading .csv file from a folder using pandas
file = open('./MyData/gender_data.csv')
data = pd.read_csv(file)

# Checking the input dataset
print('The dataset looks like:\n', data.head())
print('... ...\nNo. of rows and column:', data.shape)
#print(dtable.info())

# Creating two parts, X with attributes and y with genders
X = data.drop(['gender'], axis=1)
y = data['gender']

# Creating two random datasets for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=1)

# Trying different Classifier models
model_tree = tree.DecisionTreeClassifier()
model_svc = SVC(gamma='auto')  # Support Vector Classifier
model_per = Perceptron(max_iter=10000)  # Add a max_iter value to remove the warning
model_KNeigh = KNeighborsClassifier()

# Training the above models with the training data: X_train, y_train
model_tree.fit(X_train, y_train)
model_svc.fit(X_train, y_train)
model_per.fit(X_train, y_train)
model_KNeigh.fit(X_train, y_train)

# Testing the above models with the test data: x_test, y_test
predict_tree = model_tree.predict(X_test)
acc_tree = accuracy_score(y_test, predict_tree) * 100
print('\nAccuracy for DecissionTreeClassifier: {0:.3f}'.format(acc_tree) )

predict_svc = model_svc.predict(X_test)
acc_svc = accuracy_score(y_test, predict_svc) * 100
print('Accuracy for Support Vector Classifier: {0:.3f}'.format(acc_svc))

predict_per = model_per.predict(X_test)
acc_per = accuracy_score(y_test, predict_per) * 100
print('Accuracy for Linear Perceptron model: {0:.3f}'.format(acc_per))

predict_KNeigh = model_KNeigh.predict(X_test)
acc_KNeigh = accuracy_score(y_test, predict_KNeigh) * 100
print('Accuracy for KNeighborsClassifier: {0:.3f}'.format(acc_KNeigh))

# Selecting the best classifier from Decission Tree, SVM, Perceptron, and KNeigh
index = np.argmax([acc_tree, acc_svc, acc_per, acc_KNeigh])
Compare_clfs = {0: 'DecissionTreeClassifier', 1: 'Support Vector Classifier', 2: 'Linear Perceptron', 3: 'KNeighborsClassifier'}
print('\nBest Gender Classifier is: {}'.format(Compare_clfs[index]))


On running this code, the result comes like this:

The dataset looks like:
    height  weight  shoe size  gender
0     172      85         90    male
1     161      64         64  female
2     183      75         64    male
3     191      82         61  female
4     178      56         84    male
... ...
No. of rows and column: (500, 4)

Accuracy for DecissionTreeClassifier: 80.000
Accuracy for Support Vector
Classifier: 20.000
Accuracy for Linear Perceptron model: 40.000
Accuracy for KNeighborsClassifier: 40.000


Best Gender Classifier is: DecissionTreeClassifier


Since this example runs of the training and testing datasets created by a random data generator from sklearn.model_selection, so each time the models could generate different levels of accuracy for different number of rows of data, thereby showing a different model better than the others in different test runs. In the above test run DecissionTreeClassifier was found the better model over others.