Gender Classifier with Array Dataset
A simple "Hello World" kind of Machine Learning example to show how it works. Suppose we have a dataset containing height, weight, shoe size, and respective genders of customers of a shoe shop as shown in the image below. From this data we can train the computer to predict the genders of future customers using Machine Learning algorithms.
Training the machine with a training dataset to learn to identify a male and female customer based on some features
To feed this data to the Machine Learning algorithm, we need to prepare the data in a format that can be used in the algorithm. We first divide the data in two parts - features and labels. Features contain the attributes that can be used to describe an object. Labels define the object itself. In our case customer is an object with the features (or attributes) - height, weight, and shoe size. A particular combination of these attributes defines whether the customer label (or gender) is male or female. So, once the computer learns the nature of combination of height, weight, and shoe size values that belongs to a specific gender of the customer, it can predict the gender of future customers as male or female by reading the height, weight, and shoe size details.
So for the above example, we divide the dataset into features and labels as following arrays:
X = [[179, 82, 45], [187, 72, 44], [180, 61, 40], [156, 56, 38], [162, 64, 42], [...], [...], [...]]
y = ['male', 'male', 'male', 'female', 'female', ..., ..., ...]
Where, X denotes features and y denotes labels.
Please note that it is a general convention that features are represented in capital X because feature values are independent. And label is represented in small y because label values are dependent.
We then feed the X and y data to a classifier algorithm or model as shown in the Python program below. The program does two things:
- First, trains the model DecisionTreeClassifier() with the X and y data to learn which kind of attributes - height, weight, and shoe size, could belongs to male or female.
- Second, once the computer learns about the attributes of male and female customers, we test its capability of prediction with a some attributes to check whether a customer is male or female.
Note: The DecisionTreeClassifier() model belongs to the sklearn module, a simple and efficient tools for data mining and data analysis, is installed through the pip package in command prompt as: ...>pip install -U scikit-learn
# Gender Classifier: Machine Learning for classifying gender
# Example with hard coded data
from sklearn import tree
# Training Dataset: [height, weight, shoe_size] of customers
X = [[179, 82, 45], [187, 72, 44], [180, 61, 40], [156, 56, 38], [162, 64, 42], [189, 91, 48], [172, 63, 40],
[175, 69, 42], [161, 56, 38], [173, 77, 44], [184, 88, 45], [181, 80, 44], [177, 70, 43], [160, 60, 38],
[154, 54, 40], [166, 65, 40], [190, 90, 47], [175, 64, 39], [177, 70, 40], [159, 63, 40], [171, 75, 42],
[181, 85, 43], [180, 83, 46], [188, 74, 45], [179, 62, 41], [156, 56, 38], [162, 64, 42], [189, 91, 48]]
# Gender of the customers
y = ['male', 'male', 'male', 'female', 'female', 'male', 'female',
'female', 'female', 'male', 'male', 'male', 'female', 'male',
'female', 'female', 'male', 'female', 'male', 'female', 'female',
'male', 'male', 'male', 'male', 'female', 'female', 'male']
# Using the Classifier model
model = tree.DecisionTreeClassifier()
# Training the machine with our dataset
model_training = model.fit(X, y)
# Testing gender prediction with some new data
X_test = [[160, 60, 42], [187, 73, 40], [175, 65, 38],]
prediction = model.predict(X_test)
# Providing the prediction result
print("The given new data: \n", X_test, "\nbelongs to:")
print(prediction)
On running this code, the result comes as this:
The new data:
[[160, 60, 42], [187, 73, 40], [175, 65, 38]]
belongs to:
['female' 'male' 'male']
So the based on the training dataset, the computer predicts the test dataset to represent female, male, and male customer.
This is a simple demonstration on how Machine Learning through Python works.