Comparing different Gender Classifier models
This example extends Example 2 with comparing four different Gender Classifier models from the sklearn module namely, DecisionTreeClasifier, Support Vector Classifier, Linear Perceptron model, and KNeighborsClassifier. Please see the respective modules imported in the program.
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model.perceptron import Perceptron
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
# Reading .csv file from a folder using pandas
file = open('./MyData/gender_data.csv')
data = pd.read_csv(file)
# Checking the input dataset
print('The dataset looks like:\n', data.head())
print('... ...\nNo. of rows and column:', data.shape)
#print(dtable.info())
# Creating two parts, X with attributes and y with genders
X = data.drop(['gender'], axis=1)
y = data['gender']
# Creating two random datasets for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=1)
# Trying different Classifier models
model_tree = tree.DecisionTreeClassifier()
model_svc = SVC(gamma='auto') # Support Vector Classifier
model_per = Perceptron(max_iter=10000) # Add a max_iter value to remove the warning
model_KNeigh = KNeighborsClassifier()
# Training the above models with the training data: X_train, y_train
model_tree.fit(X_train, y_train)
model_svc.fit(X_train, y_train)
model_per.fit(X_train, y_train)
model_KNeigh.fit(X_train, y_train)
# Testing the above models with the test data: x_test, y_test
predict_tree = model_tree.predict(X_test)
acc_tree = accuracy_score(y_test, predict_tree) * 100
print('\nAccuracy for DecissionTreeClassifier: {0:.3f}'.format(acc_tree) )
predict_svc = model_svc.predict(X_test)
acc_svc = accuracy_score(y_test, predict_svc) * 100
print('Accuracy for Support Vector Classifier: {0:.3f}'.format(acc_svc))
predict_per = model_per.predict(X_test)
acc_per = accuracy_score(y_test, predict_per) * 100
print('Accuracy for Linear Perceptron model: {0:.3f}'.format(acc_per))
predict_KNeigh = model_KNeigh.predict(X_test)
acc_KNeigh = accuracy_score(y_test, predict_KNeigh) * 100
print('Accuracy for KNeighborsClassifier: {0:.3f}'.format(acc_KNeigh))
# Selecting the best classifier from Decission Tree, SVM, Perceptron, and KNeigh
index = np.argmax([acc_tree, acc_svc, acc_per, acc_KNeigh])
Compare_clfs = {0: 'DecissionTreeClassifier', 1: 'Support Vector Classifier', 2: 'Linear Perceptron', 3: 'KNeighborsClassifier'}
print('\nBest Gender Classifier is: {}'.format(Compare_clfs[index]))
On running this code, the result comes like this:
The dataset looks like:
height weight shoe size gender
0 172 85 90 male
1 161 64 64 female
2 183 75 64 male
3 191 82 61 female
4 178 56 84 male
... ...
No. of rows and column: (500, 4)
Accuracy for DecissionTreeClassifier: 80.000
Accuracy for Support Vector Classifier: 20.000
Accuracy for Linear Perceptron model: 40.000
Accuracy for KNeighborsClassifier: 40.000
Best Gender Classifier is: DecissionTreeClassifier
Since this example runs of the training and testing datasets created by a random data generator from sklearn.model_selection, so each time the models could generate different levels of accuracy for different number of rows of data, thereby showing a different model better than the others in different test runs. In the above test run DecissionTreeClassifier was found the better model over others.