Gender Classifier with .csv Dataset
This example extends Example 1 with the data reading directly from a .csv file in a folder, which looks like the table shown in the image below. It uses a Python module called pandas to read the physical .csv file. The pandas module, which is an easy-to-use data structures and data analysis tools, is installed through pip in command prompt as:
...>pip install pandas
In the last example we tested the prediction capacity of the model with a small hard coded dataset X_test, which was an array. In this example we are using the train_test_split function of the sklearn module to split the dataset randomly into training and testing datasets for the respective task for the model.
Additionally, this example shows the Accuracy Score and Confusion Matrix of the DecisionTreeClassifier() model.
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import pandas as pd
# Reading the dataset from a csv file using pandas
file = open('./MyData/gender_data.csv')
data = pd.read_csv(file)
print(data.head())
print('Original dataset shape: ', data.shape)
# Creating two parts, X with attributes and y with genders
X = data.drop(['gender'], axis=1)
y = data['gender']
# Creating two random datasets for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=1)
#print(X_train.head())
#print(y_train.head())
#print('Train Dataset shape: ', X_train.shape)
#print('Train Dataset shape: ', y_train.shape)
#print(X_test.head())
#print(y_test.head())
#print('Test Dataset shape: ', X_test.shape)
#print('Test Dataset shape: ', y_test.shape)
model = tree.DecisionTreeClassifier()
model.fit(X_train, y_train)
predict = model.predict(X_test)
print("\nThe test data:\n", X_test, "\nbelongs to:")
print(predict)
# Finding prediction accuracy of the model
acc = accuracy_score(y_test, predict)
print("\nAccuracy Score:", acc)
# Finding Confusion matrix
conf = pd.DataFrame(
confusion_matrix(y_test, predict),
columns=['Predicted False', 'Predicted True'],
index= ['Actual False', 'Actual True']
)
print("\nConfusion Matrix:\n", conf)
On running this code, the result comes as this:
height weight shoe size gender
0 179 82 45 male
1 187 72 44 male
2 180 61 40 male
3 156 56 38 female
4 162 64 42 female
Original dataset shape: (500, 4)
The test data:
height weight shoe size
304 163 88 90
340 184 80 79
47 159 70 39
67 161 77 57
479 174 61 58
belongs to:
['female' 'male' 'female' 'female' 'male']
Accuracy Score: 0.4
Confusion Matrix:
Predicted False Predicted True
Actual False 1 1
Actual True 2 1
In the above example, in the section for generating random datasets for training and testing, a number of print codes are mention as comments. Remove the # for the line that you want to run for checking the training and testing datasets.