Aug 22, 2018

Example 2: Gender Classifier with CSV File

Gender Classifier with .csv Dataset


This example extends Example 1 with the data reading directly from a .csv file in a folder, which looks like the table shown in the image below. It uses a Python module called pandas to read the physical .csv file. The pandas module, which is an easy-to-use data structures and data analysis tools, is installed through pip in command prompt as:
...>pip install pandas


In the last example we tested the prediction capacity of the model with a small hard coded dataset X_test, which was an array. In this example we are using the train_test_split function of the sklearn module to split the dataset randomly into training and testing datasets for the respective task for the model.

Additionally, this example shows the Accuracy Score and Confusion Matrix of the DecisionTreeClassifier() model.

from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import pandas as pd

# Reading the dataset from a csv file using pandas
file = open('./MyData/gender_data.csv')
data = pd.read_csv(file)
print(data.head())
print('Original dataset shape: ', data.shape)

# Creating two parts, X with attributes and y with genders
X = data.drop(['gender'], axis=1)
y = data['gender']

# Creating two random datasets for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=1)
#print(X_train.head())
#print(y_train.head())
#print('Train Dataset shape: ', X_train.shape)
#print('Train Dataset shape: ', y_train.shape)
#print(X_test.head())
#print(y_test.head())
#print('Test Dataset shape: ', X_test.shape)
#print('Test Dataset shape: ', y_test.shape)


model = tree.DecisionTreeClassifier()

model.fit(X_train, y_train)

predict = model.predict(X_test)

print("\nThe test data:\n", X_test, "\nbelongs to:")
print(predict)

# Finding prediction accuracy of the model
acc = accuracy_score(y_test, predict)
print("\nAccuracy Score:", acc)

# Finding Confusion matrix
conf = pd.DataFrame(
    confusion_matrix(y_test, predict),
    columns=['Predicted False', 'Predicted True'],
    index= ['Actual False', 'Actual True']
)
print("\nConfusion Matrix:\n", conf)


On running this code, the result comes as this:

   height  weight  shoe size  gender
0     179      82         45    male
1     187      72         44    male
2     180      61         40    male
3     156      56         38  female
4     162      64         42  female
Original dataset shape:  (500, 4)

The test data:
      height  weight  shoe size
304     163      88         90
340     184      80         79
47      159      70         39
67      161      77         57
479     174      61         58
belongs to:
['female' 'male' 'female' 'female' 'male']

Accuracy Score: 0.4

Confusion Matrix:
               Predicted False  Predicted True
Actual False                1               1
Actual True                 2               1


In the above example, in the section for generating random datasets for training and testing, a number of print codes are mention as comments. Remove the # for the line that you want to run for checking the training and testing datasets.