Using Pandas and Numpy libraries to analyse Titanic Survival Predictions using training and testing csv datasets.

%matplotlib inline
import pandas as pd
import numpy as np
import re as re
train_data = pd.read_csv('./MLData/01_Titanic_Survivals/train.csv')
test_data = pd.read_csv('./MLData/01_Titanic_Survivals/test.csv')
full_data = [train_data, test_data]
print(train_data.info(), '\n')
print(test_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
None

train_data.head(10)

test_data.head(3)

train_data.describe()

test_data.describe()

train_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

test_data.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Checking the relation between passenger class (Pclass) and the survived passengers (Survived) in the training dataset.

print(train_data[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean())

   Pclass  Survived
0       1  0.629630
1       2  0.472826
2       3  0.242363

print(train_data[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean())

      Sex  Survived
0  female  0.742038
1    male  0.188908

Column SibSp means Number of Siblings/Spouses Aboard
Column Parch means Number of Parents/Children Aboard
With the number of Siblings/Spouse and the number of Children/Parents we can create new feature called Family Size.

Selecting data by row numbers (.iloc)
Selecting data by label or by a conditional statment (.loc)

for dataset in full_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
print(train_data[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean())

   FamilySize  Survived
0           1  0.303538
1           2  0.552795
2           3  0.578431
3           4  0.724138
4           5  0.200000
5           6  0.136364
6           7  0.333333
7           8  0.000000
8          11  0.000000

If a passenger has at least FamilySize = 1, then he/she is not alone
Else the passenger is travelling alone.

for dataset in full_data:
    dataset['IsNotAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsNotAlone'] = 1
print(train_data[['IsNotAlone', 'Survived']].groupby(['IsNotAlone'], as_index=False).mean())

   IsNotAlone  Survived
0           0  0.505650
1           1  0.303538

The column embarked means Port of Embarkation
Embarkation is the process of loading passengers to a ship or an airplane.
Values are: C = Cherbourg, Q = Queenstown, S = Southampton

(The embarked column has some missing value. So we try to fill those with the most occurred value ( 'S' ), calculated with the collowing code.)

train_data['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
print(train_data[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean())

  Embarked  Survived
0        C  0.553571
1        Q  0.389610
2        S  0.339009

Can Fare play a role in survival?

Fare also has some missing value and we will replace it with the median value. And we categorize it into 4 ranges.

Here we use qcut function from pandas library. Qcut is a quantile-based discretization function. It discretize variable into equal-sized buckets based on rank or based on sample quantiles.

for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(train_data['Fare'].median())
train_data['CategoricalFare'] = pd.qcut(train_data['Fare'], 4)
print (train_data[['CategoricalFare', 'Survived']].groupby(['CategoricalFare'], as_index=False).mean())

   CategoricalFare  Survived
0   (-0.001, 7.91]  0.197309
1   (7.91, 14.454]  0.303571
2   (14.454, 31.0]  0.454955
3  (31.0, 512.329]  0.581081

Age could play a role in survival. Children and elderly ones may get preference in a life boat and so may get survived.

But we have plenty of missing values in this feature. We generate random numbers between (mean - std. dev) and (mean + std.dev). Then we categorize it into 5 range.

Use cut function from the pandas library when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable.

for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    
    pd.set_option('mode.chained_assignment', None)
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size = age_null_count)
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)

train_data['CategoricalAge'] = pd.cut(train_data['Age'], 5)

print(train_data[['CategoricalAge', 'Survived']].groupby(['CategoricalAge'], as_index=False).mean())

  CategoricalAge  Survived
0  (-0.08, 16.0]  0.535088
1   (16.0, 32.0]  0.346847
2   (32.0, 48.0]  0.379447
3   (48.0, 64.0]  0.434783
4   (64.0, 80.0]  0.090909

def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    if title_search:
        return title_search.group(1)
    return ""

for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)

print(pd.crosstab(train_data['Title'], train_data['Sex']))

Sex       female  male
Title                 
Capt           0     1
Col            0     2
Countess       1     0
Don            0     1
Dr             1     6
Jonkheer       0     1
Lady           1     0
Major          0     2
Master         0    40
Miss         182     0
Mlle           2     0
Mme            1     0
Mr             0   517
Mrs          125     0
Ms             1     0
Rev            0     6
Sir            0     1

Passengers title could also be a parameter which can decide on survival.
Inside the feature Name, we can find the title of people.

Python's built-in re module provides support for Regular Expressions in regexes or regex pattern.

for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr',\
                                                 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

print(train_data[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())

    Title  Survived
0  Master  0.575000
1    Miss  0.702703
2      Mr  0.156673
3     Mrs  0.793651
4    Rare  0.347826

Data Cleaning

Now let's clean our data and map our features into numerical values for applying Machine Learning.

print(full_data)

[     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3    
..           ...       ...     ...    
885          886         0       3   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex  Age  SibSp  \
0                              Braund, Mr. Owen Harris    male   22      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38      1   
2                               Heikkinen, Miss. Laina  female   26      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   35      1   
4                             Allen, Mr. William Henry    male   35      0   
..                                                 ...     ...  ...    ...   
885               Rice, Mrs. William (Margaret Norton)  female   39      0   
886                              Montvila, Rev. Juozas    male   27      0   
887                       Graham, Miss. Margaret Edith  female   19      0   
888           Johnston, Miss. Catherine Helen "Carrie"  female   15      1   
889                              Behr, Mr. Karl Howell    male   26      0   
890                                Dooley, Mr. Patrick    male   32      0   

     Parch            Ticket      Fare        Cabin Embarked  FamilySize  \
0        0         A/5 21171    7.2500          NaN        S           2   
1        0          PC 17599   71.2833          C85        C           2   
2        0  STON/O2. 3101282    7.9250          NaN        S           1   
3        0            113803   53.1000         C123        S           2   
4        0            373450    8.0500          NaN        S           1   
5        0            330877    8.4583          NaN        Q           1   
..     ...               ...       ...          ...      ...         ...      
885      5            382652   29.1250          NaN        Q           6   
886      0            211536   13.0000          NaN        S           1   
887      0            112053   30.0000          B42        S           1   
888      2        W./C. 6607   23.4500          NaN        S           4   
889      0            111369   30.0000         C148        C           1   
890      0            370376    7.7500          NaN        Q           1   

     IsNotAlone  CategoricalFare CategoricalAge   Title  
0             0   (-0.001, 7.91]   (16.0, 32.0]      Mr  
1             0  (31.0, 512.329]   (32.0, 48.0]     Mrs  
2             1   (7.91, 14.454]   (16.0, 32.0]    Miss  
3             0  (31.0, 512.329]   (32.0, 48.0]     Mrs  
4             1   (7.91, 14.454]   (32.0, 48.0]      Mr  
5             1   (7.91, 14.454]   (16.0, 32.0]      Mr  
..          ...              ...            ...     ...  
885           0   (14.454, 31.0]   (32.0, 48.0]     Mrs  
886           1   (7.91, 14.454]   (16.0, 32.0]    Rare  
887           1   (14.454, 31.0]   (16.0, 32.0]    Miss  
888           0   (14.454, 31.0]  (-0.08, 16.0]    Miss  
889           1   (14.454, 31.0]   (16.0, 32.0]      Mr  
890           1   (-0.001, 7.91]   (16.0, 32.0]      Mr  

[891 rows x 17 columns],      PassengerId  Pclass                                               Name  \
0            892       3                                   Kelly, Mr. James   
1            893       3                   Wilkes, Mrs. James (Ellen Needs)   
2            894       2                          Myles, Mr. Thomas Francis   
3            895       3                                   Wirz, Mr. Albert   
4            896       3       Hirvonen, Mrs. Alexander (Helga E Lindqvist)   
5            897       3                         Svensson, Mr. Johan Cervin   
..           ...     ...                                                ...    
412         1304       3                     Henriksson, Miss. Jenny Lovisa   
413         1305       3                                 Spector, Mr. Woolf   
414         1306       1                       Oliva y Ocana, Dona. Fermina   
415         1307       3                       Saether, Mr. Simon Sivertsen   
416         1308       3                                Ware, Mr. Frederick   
417         1309       3                           Peter, Master. Michael J   

        Sex  Age  SibSp  Parch              Ticket      Fare            Cabin  \
0      male   34      0      0              330911    7.8292              NaN   
1    female   47      1      0              363272    7.0000              NaN   
2      male   62      0      0              240276    9.6875              NaN   
3      male   27      0      0              315154    8.6625              NaN   
4    female   22      1      1             3101298   12.2875              NaN   
5      male   14      0      0                7538    9.2250              NaN   
..      ...  ...    ...    ...                 ...       ...              ...    
412  female   28      0      0              347086    7.7750              NaN   
413    male   21      0      0           A.5. 3236    8.0500              NaN   
414  female   39      0      0            PC 17758  108.9000             C105   
415    male   38      0      0  SOTON/O.Q. 3101262    7.2500              NaN   
416    male   33      0      0              359309    8.0500              NaN   
417    male   31      1      1                2668   22.3583              NaN   

    Embarked  FamilySize  IsNotAlone   Title  
0          Q           1           1      Mr  
1          S           2           0     Mrs  
2          Q           1           1      Mr  
3          S           1           1      Mr  
4          S           3           0     Mrs  
5          S           1           1      Mr  
..       ...         ...         ...     ...   
412        S           1           1    Miss  
413        S           1           1      Mr  
414        C           1           1    Rare  
415        S           1           1      Mr  
416        S           1           1      Mr  
417        C           3           0  Master  

[418 rows x 14 columns]]

for dataset in full_data:
    # Mapping Sex
    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
    
    # Mapping Titles
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)
    
    # Mapping Embarked
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
    
    # Mapping Fare
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)
    
    # Mapping Age
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age'] = 4

# Feature Selection
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp', 'Parch', 'FamilySize']
train_data = train_data.drop(drop_elements, axis = 1)
train_data = train_data.drop(['CategoricalAge', 'CategoricalFare'], axis = 1)

test_data  = test_data.drop(drop_elements, axis = 1)

print (train_data.head(10))

train_data = train_data.values
test_data  = test_data.values

   Survived  Pclass  Sex  Age  Fare  Embarked  IsNotAlone  Title
0         0       3    1    1     0         0           0      1
1         1       1    0    2     3         1           0      3
2         1       3    0    1     1         0           1      2
3         1       1    0    2     3         0           0      3
4         0       3    1    2     1         0           1      1
5         0       3    1    1     1         2           1      1
6         0       1    1    3     3         0           1      1
7         0       3    1    0     2         0           0      4
8         1       3    0    1     1         0           0      3
9         1       2    0    0     2         1           0      3

Classifier Comparision

Compring all the classifiers in the sklearn package to see which model gives the best comparision result.

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

classifiers = [
    KNeighborsClassifier(3),
    SVC(gamma=2, C=1), # probability=True
    DecisionTreeClassifier(max_depth=7),
    RandomForestClassifier(max_depth=2, n_estimators=10, random_state=0),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    GaussianNB(),
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis(),
    LogisticRegression(random_state=42, solver='lbfgs')]

log_cols = ["Classifier", "Accuracy"]
log = pd.DataFrame(columns=log_cols)

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=0)

X = train_data[0::, 1::]
y = train_data[0::, 0]

acc_dict = {}

for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    for clf in classifiers:
        name = clf.__class__.__name__
        clf.fit(X_train, y_train)
        train_predictions = clf.predict(X_test)
        acc = accuracy_score(y_test, train_predictions)
        if name in acc_dict:
            acc_dict[name] += acc
        else:
            acc_dict[name] = acc

for clf in acc_dict:
    acc_dict[clf] = acc_dict[clf] / 10.0
    log_entry = pd.DataFrame([[clf, acc_dict[clf]]], columns=log_cols)
    log = log.append(log_entry)

plt.xlabel('Accuracy')
plt.title('Classifier Accuracy')

sns.set_color_codes("muted")
sns.barplot(x='Accuracy', y='Classifier', data=log, color="b")

<matplotlib.axes._subplots.AxesSubplot at 0x255717f4c88>

Prediction with the best Classifier

The best classifier appears to be GradientBoostingClassifier. We can use it to predict our test data.

candidate_classifier = GradientBoostingClassifier()
candidate_classifier.fit(train_data[0::, 1::], train_data[0::, 0])
result = candidate_classifier.predict(test_data)
print(result)

[0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0
 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0
 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 1 0 0
 0 1 1 1 1 0 0 1 0 0 1]

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	PassengerId	Pclass	Age	SibSp	Parch	Fare
count	418.000000	418.000000	332.000000	418.000000	418.000000	417.000000
mean	1100.500000	2.265550	30.272590	0.447368	0.392344	35.627188
std	120.810458	0.841838	14.181209	0.896760	0.981429	55.907576
min	892.000000	1.000000	0.170000	0.000000	0.000000	0.000000
25%	996.250000	1.000000	21.000000	0.000000	0.000000	7.895800
50%	1100.500000	3.000000	27.000000	0.000000	0.000000	14.454200
75%	1204.750000	3.000000	39.000000	1.000000	0.000000	31.500000
max	1309.000000	3.000000	76.000000	8.000000	9.000000	512.329200

Pages

Oct 10, 2018

Example 11: Titanic Survival Predictions

Machine Learning from Titanic Disaster

Data Cleaning

Classifier Comparision

Prediction with the best Classifier

The Final Result:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

	PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	240276	9.6875	NaN	Q