Machine Learning from Titanic Disaster
This example shows a complete analysis of the survival prediction in the infamous Titanic sinking incident. The analysis is done using Jupyter Notebook.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class passengers.
In this analysis it is tried to understand what sorts of people were likely to survive. We apply the tools of machine learning to predict which passengers survived the tragedy.
The datasets with training and testing data of the passenger details are downloaded from the Kaggle site. The training dataset contain the survival information along with the passenger details, whereas the testing dataset doesn't contain the survival information, which we have to predict.
Using Pandas and Numpy libraries to analyse Titanic Survival Predictions using training and testing csv datasets.
In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import re as re
train_data = pd.read_csv('./MLData/01_Titanic_Survivals/train.csv')
test_data = pd.read_csv('./MLData/01_Titanic_Survivals/test.csv')
full_data = [train_data, test_data]
print(train_data.info(), '\n')
print(test_data.info())
In [2]:
train_data.head(10)
Out[2]:
In [3]:
test_data.head(3)
Out[3]:
In [4]:
train_data.describe()
Out[4]:
In [5]:
test_data.describe()
Out[5]:
In [6]:
train_data.isnull().sum()
Out[6]:
In [7]:
test_data.isnull().sum()
Out[7]:
Checking the relation between passenger class (Pclass) and the survived passengers (Survived) in the training dataset.
In [8]:
print(train_data[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean())
In [9]:
print(train_data[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean())
Column SibSp means Number of Siblings/Spouses Aboard
Column Parch means Number of Parents/Children Aboard
With the number of Siblings/Spouse and the number of Children/Parents we can create new feature called Family Size.
Selecting data by row numbers (.iloc)
Selecting data by label or by a conditional statment (.loc)
Column Parch means Number of Parents/Children Aboard
With the number of Siblings/Spouse and the number of Children/Parents we can create new feature called Family Size.
Selecting data by row numbers (.iloc)
Selecting data by label or by a conditional statment (.loc)
In [10]:
for dataset in full_data:
dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
print(train_data[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean())
If a passenger has at least FamilySize = 1, then he/she is not alone
Else the passenger is travelling alone.
Else the passenger is travelling alone.
In [11]:
for dataset in full_data:
dataset['IsNotAlone'] = 0
dataset.loc[dataset['FamilySize'] == 1, 'IsNotAlone'] = 1
print(train_data[['IsNotAlone', 'Survived']].groupby(['IsNotAlone'], as_index=False).mean())
The column embarked means Port of Embarkation
Embarkation is the process of loading passengers to a ship or an airplane.
Values are: C = Cherbourg, Q = Queenstown, S = Southampton
(The embarked column has some missing value. So we try to fill those with the most occurred value ( 'S' ), calculated with the collowing code.)
Embarkation is the process of loading passengers to a ship or an airplane.
Values are: C = Cherbourg, Q = Queenstown, S = Southampton
(The embarked column has some missing value. So we try to fill those with the most occurred value ( 'S' ), calculated with the collowing code.)
In [12]:
train_data['Embarked'].value_counts()
Out[12]:
In [13]:
for dataset in full_data:
dataset['Embarked'] = dataset['Embarked'].fillna('S')
print(train_data[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean())
Can Fare play a role in survival?
Fare also has some missing value and we will replace it with the median value. And we categorize it into 4 ranges.
Here we use qcut function from pandas library. Qcut is a quantile-based discretization function. It discretize variable into equal-sized buckets based on rank or based on sample quantiles.
Fare also has some missing value and we will replace it with the median value. And we categorize it into 4 ranges.
Here we use qcut function from pandas library. Qcut is a quantile-based discretization function. It discretize variable into equal-sized buckets based on rank or based on sample quantiles.
In [14]:
for dataset in full_data:
dataset['Fare'] = dataset['Fare'].fillna(train_data['Fare'].median())
train_data['CategoricalFare'] = pd.qcut(train_data['Fare'], 4)
print (train_data[['CategoricalFare', 'Survived']].groupby(['CategoricalFare'], as_index=False).mean())
Age could play a role in survival. Children and elderly ones may get preference in a life boat and so may get survived.
But we have plenty of missing values in this feature. We generate random numbers between (mean - std. dev) and (mean + std.dev). Then we categorize it into 5 range.
Use cut function from the pandas library when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable.
But we have plenty of missing values in this feature. We generate random numbers between (mean - std. dev) and (mean + std.dev). Then we categorize it into 5 range.
Use cut function from the pandas library when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable.
In [15]:
for dataset in full_data:
age_avg = dataset['Age'].mean()
age_std = dataset['Age'].std()
age_null_count = dataset['Age'].isnull().sum()
pd.set_option('mode.chained_assignment', None)
age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size = age_null_count)
dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
dataset['Age'] = dataset['Age'].astype(int)
train_data['CategoricalAge'] = pd.cut(train_data['Age'], 5)
print(train_data[['CategoricalAge', 'Survived']].groupby(['CategoricalAge'], as_index=False).mean())
In [16]:
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)
if title_search:
return title_search.group(1)
return ""
for dataset in full_data:
dataset['Title'] = dataset['Name'].apply(get_title)
print(pd.crosstab(train_data['Title'], train_data['Sex']))
Passengers title could also be a parameter which can decide on survival.
Inside the feature Name, we can find the title of people.
Python's built-in re module provides support for Regular Expressions in regexes or regex pattern.
Inside the feature Name, we can find the title of people.
Python's built-in re module provides support for Regular Expressions in regexes or regex pattern.
In [17]:
for dataset in full_data:
dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr',\
'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
print(train_data[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())
In [18]:
print(full_data)
In [19]:
for dataset in full_data:
# Mapping Sex
dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
# Mapping Titles
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
dataset['Title'] = dataset['Title'].map(title_mapping)
dataset['Title'] = dataset['Title'].fillna(0)
# Mapping Embarked
dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
# Mapping Fare
dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
dataset['Fare'] = dataset['Fare'].astype(int)
# Mapping Age
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[ dataset['Age'] > 64, 'Age'] = 4
# Feature Selection
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp', 'Parch', 'FamilySize']
train_data = train_data.drop(drop_elements, axis = 1)
train_data = train_data.drop(['CategoricalAge', 'CategoricalFare'], axis = 1)
test_data = test_data.drop(drop_elements, axis = 1)
print (train_data.head(10))
train_data = train_data.values
test_data = test_data.values
In [20]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
classifiers = [
KNeighborsClassifier(3),
SVC(gamma=2, C=1), # probability=True
DecisionTreeClassifier(max_depth=7),
RandomForestClassifier(max_depth=2, n_estimators=10, random_state=0),
AdaBoostClassifier(),
GradientBoostingClassifier(),
GaussianNB(),
LinearDiscriminantAnalysis(),
QuadraticDiscriminantAnalysis(),
LogisticRegression(random_state=42, solver='lbfgs')]
log_cols = ["Classifier", "Accuracy"]
log = pd.DataFrame(columns=log_cols)
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=0)
X = train_data[0::, 1::]
y = train_data[0::, 0]
acc_dict = {}
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
for clf in classifiers:
name = clf.__class__.__name__
clf.fit(X_train, y_train)
train_predictions = clf.predict(X_test)
acc = accuracy_score(y_test, train_predictions)
if name in acc_dict:
acc_dict[name] += acc
else:
acc_dict[name] = acc
for clf in acc_dict:
acc_dict[clf] = acc_dict[clf] / 10.0
log_entry = pd.DataFrame([[clf, acc_dict[clf]]], columns=log_cols)
log = log.append(log_entry)
plt.xlabel('Accuracy')
plt.title('Classifier Accuracy')
sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")
Out[20]:
In [21]:
candidate_classifier = GradientBoostingClassifier()
candidate_classifier.fit(train_data[0::, 1::], train_data[0::, 0])
result = candidate_classifier.predict(test_data)
print(result)
In [ ]:
The Final Result:
The above prediction results can be tallied with the passengers list in the train dataset to observe the survival prediction as following, where 1 means survived and 0 means not survived.