In 1912, the largest ship afloat at the time- RMS Titanic sank after colliding with an iceberg. Of the 2224 passengers and crew abroad 1502 died.

In this project, we will explore the training dataset (train) from kaggle. This dataset contains demographic and passenger information about 891 of the 2224 passengers and crew abroad. The most interesting question here is what features made people more likely to survive the sinking? Based on the available feature information can we build a classification algorithm that can reasonably predict survival?

titanic

I will start my analysis by exploring individual features, and a combination of features to see how they correlate to survival. To make the analysis vivid, I will use the interactive plotting library - plotly (Take the mouse cursor to the plots for interactivity). Finally, I will build a logarithmic regression and random forest model to predict survival; and evaluate the accuracy of the model.

In the dataframe, each row represents a passenger on the Titanic, and each column represents some information about them. Let's take a look at what the columns represent:

  • Survived: Outcome of survival (0 = No; 1 = Yes)
  • Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
  • Name: Name of passenger
  • Sex: Sex of the passenger
  • Age: Age of the passenger (Some entries contain NaN)
  • SibSp: Number of siblings and spouses of the passenger aboard
  • Parch: Number of parents and children of the passenger aboard
  • Ticket: Ticket number of the passenger
  • Fare: Fare paid by the passenger
  • Cabin Cabin number of the passenger (Some entries contain NaN)
  • Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Getting started

In [1]:
# Import pandas and numpy
import pandas as pd
import numpy as np
In [2]:
# Import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import figure_factory as FF 
In [3]:
# Load the titanic train dataset to create dataFrame
# From https://www.kaggle.com/c/titanic/data
train_data = "data/train.csv"
train = pd.read_csv(train_data)
In [4]:
#Print the `head` of the dataframe
train.head()
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Understanding the data

Before we move on with the actual analysis, we will use pandas .shape and .describe() method to understand our data better. We will also examine how well individual features- like Sex, Age, Pclass, Fare, Port of embarkation predict survival.

In [5]:
print train.shape
(891, 12)
In [6]:
train.describe()
Out[6]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

How does the distribution of survival look like?

In [7]:
# 0 = deceased, 1 = survived
print(train["Survived"].value_counts())
0    549
1    342
Name: Survived, dtype: int64
In [8]:
# Passengers that survived vs passengers that passed away
deceased = train["Survived"].value_counts(normalize = True)[0]
survived = train["Survived"].value_counts(normalize = True)[1]

x0 = ['deceased', 'survived']
y0 = [deceased, survived]

data = [go.Bar(
        x=x0,
        y=y0
    )]
layout = go.Layout(autosize = False, width = 300, height = 400,
              yaxis = dict(title = 'Normalized counts'),
              title = 'Distribution of survival')
fig0 = go.Figure(data = data, layout = layout)
py.iplot(fig0)
Out[8]:

The majority of passengers (61.6%) didn't survive the sinking ship.

Who was more likely to survive female or Male?

In [9]:
# Normalized male survival
male_survival = train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True)
# Normalized female survival
female_survival = train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True)

# Survival by Sex
x0 = ['male', 'female']
y0 = [male_survival[1], female_survival[1]]
data = [go.Bar(
        x=x0,
        y=y0
    )]
layout = go.Layout(autosize = False, width = 300, height = 400,
              yaxis = dict(title = 'Survival Rates'),
              title = 'Survival by Sex')
fig1 = go.Figure(data = data, layout = layout)
py.iplot(fig1)
Out[9]:

Examining the survival statistics, 74.2% of all females from the dataset survived the ship sinking, whereas only 18.9% of males survived the ship sinking.

How does the distribution of age look among survivors and non-survivors?

It's logical to think that children were saved first. Age could be another variable to predict survival. We will handle missing values with Age in the later section - Clean and format the data. For now, missing values have been excluded from the plot.

In [10]:
#Age distribution of those who passed away
ages_deceased = train["Age"][train["Survived"] == 0]

#Age distribution of survivors
ages_survived = train["Age"][train["Survived"] == 1]

#Boxplot to show age distribution of deceased vs survived
trace_deceased = go.Box(x = ages_deceased, name = "deceased")
trace_survived = go.Box(x = ages_survived, name = "survived")
survival_by_age_data = [trace_deceased, trace_survived]
layout = go.Layout(xaxis = dict(title = 'Age'),title = "Survival by Age", 
                   width = 600, height = 400)
fig2 = go.Figure(data=survival_by_age_data, layout=layout)
py.iplot(fig2)
Out[10]:

The age distribution for those who survived is shifted more towards the left. Albeit modestly, age does seem to correlate with survival. We will further test this assumption by creating a "child" column.

How does survival rate change across Pclass?

It's also logical to think that passenger class might affect the outcome, as first class cabins were closer to the deck of the ship.

In [11]:
# Normalized Pclass survival
Pclass1 = train["Survived"][train["Pclass"] == 1].value_counts(normalize = True)
Pclass2 = train["Survived"][train["Pclass"] == 2].value_counts(normalize = True)
Pclass3 = train["Survived"][train["Pclass"] == 3].value_counts(normalize = True)

# Survival by Pclass- Barplot
x0 = ['Pclass 1', 'Pclass 2', 'Pclass 3']
y0 = [Pclass1[1], Pclass2[1], Pclass3[1]]

data = [go.Bar(
        x=x0,
        y=y0
    )]
layout = go.Layout(autosize = False, width = 400, height = 400,
              yaxis = dict(title = 'Survival Rates'),
              title = 'Survival by Pclass')
fig3 = go.Figure(data = data, layout = layout)
py.iplot(fig3)
Out[11]:

Examining the survival statistics, survival rates for Pclass1 > Pclass2 > Pclass3. 63%, 47.3% and 24.2% of Pclass1, Pclass2 and Pclass3 survived respectively.

How does the distribution of fare look among survivors and non-survivors?

Fare is highly correlated with Pclass. It could be another variable to influence survival.

In [12]:
#Fare paid by those who passed away
fares_deceased = train["Fare"][train["Survived"] == 0]

#Fare paid by survivors
fares_survived = train["Fare"][train["Survived"] == 1]

#Survival by fare - Boxplot
trace0 = go.Box(x = fares_deceased, name = "deceased")
trace1 = go.Box(x = fares_survived, name = "survived")
fare_by_survival_data = [trace0, trace1]
layout = go.Layout(xaxis = dict(title = 'Fare'),title = "Survival by Fare",
                   width = 600, height = 400)
fig4 = go.Figure(data=fare_by_survival_data, layout=layout)
py.iplot(fig4)
Out[12]:

The fare distribution for those who survived is shifted more towards the right. Most survivors definitely paid higher than non-survivors.

Does Port of embarkation play a role?

We will handle missing values with 'Embarked' column in the later section - Clean and format the data. For now, missing values have been excluded from the plot, to examine the data without any bias.

In [13]:
# Normalized Pclass survival
S = train["Survived"][train["Embarked"] == "S"].value_counts(normalize = True)
C = train["Survived"][train["Embarked"] == "C"].value_counts(normalize = True)
Q = train["Survived"][train["Embarked"] == "Q"].value_counts(normalize = True)

# Survival by Embarked - Boxplot
x0 = ['S', 'C', 'Q']
y0 = [S[1], C[1], Q[1]]

data = [go.Bar(
        x=x0,
        y=y0
    )]
layout = go.Layout(autosize = False, width = 400, height = 400,
              yaxis = dict(title = 'Survival Rates'),
              title = 'Survival by Embarked')
fig5 = go.Figure(data = data, layout = layout)
py.iplot(fig5)
Out[13]:

Multiple variable (2d) explorations

We will now explore multiple combination of variables to see how well they correlate with survival.

How does survival rate varied by Class and Gender?

In [14]:
# Normalized Pclass survival by gender
Pclass1_male = train["Survived"][(train["Pclass"] == 1) & 
               (train["Sex"] == "male")].value_counts(normalize = True)
Pclass2_male = train["Survived"][(train["Pclass"] == 2) & 
               (train["Sex"] == "male")].value_counts(normalize = True)
Pclass3_male = train["Survived"][(train["Pclass"] == 3) & 
               (train["Sex"] == "male")].value_counts(normalize = True)

Pclass1_female = train["Survived"][(train["Pclass"] == 1) & 
                    (train["Sex"] == "female")].value_counts(normalize = True)
Pclass2_female = train["Survived"][(train["Pclass"] == 2) &
                    (train["Sex"] == "female")].value_counts(normalize = True)
Pclass3_female = train["Survived"][(train["Pclass"] == 3) & 
                    (train["Sex"] == "female")].value_counts(normalize = True)

# Survival by Class and Gender- Grouped Barplot
trace0 = go.Bar(
    x=['Pclass 1', 'Pclass 2', 'Pclass 3'],
    y=[Pclass1_male[1], Pclass2_male[1], Pclass3_male[1]],
    name='male'
)
trace1 = go.Bar(
    x=['Pclass 1', 'Pclass 2', 'Pclass 3'],
    y=[Pclass1_female[1], Pclass2_female[1], Pclass3_female[1]],
    name='female'
)

data = [trace0, trace1]
layout = go.Layout(autosize = False, width = 500, height = 400,
    barmode='group',yaxis = dict(title = 'Survival Rates'),
                   title = 'Survival by Class and Gender')

fig6 = go.Figure(data=data, layout=layout)
py.iplot(fig6)
Out[14]:

In each Pclass, females tended to survive over their male counterparts. For both males and females, people in higher fare class tickets had higher survival rates.

Was there chivalry at work - Women and Child first ?

We saw that age influenced survival. It is logical to think that children were saved first. We created a new column with a categorical variable Child. Child will take a value 1 for ages < 10 and a value of 0 for ages >= 10.

In [15]:
# Create the column Child and assign 1 to passengers under 10, 
# 0 to those 10 or older and NaN if age is NaN
def is_child(age):
    """Defines what age is considered a child"""
    if age < 10:
        return float(1)
    elif age >= 10:
        return float(0)
    else:
        return float('NaN')
# apply the function to 'Age' column of the dataframe
train['Child'] = train['Age'].apply(is_child)
In [16]:
# Print normalized Survival Rates for passengers under 10
children = train['Survived'][train['Child'] == 1].value_counts(normalize = True)

# Print normalized Survival Rates for passengers 10 or older
adult = train['Survived'][train['Child'] == 0].value_counts(normalize = True)

# Plot survival of children vs adults
x0=['children', 'adult']
y0=[children[1], adult[1]]

data = [go.Bar(
        x=x0,
        y=y0
    )]
layout = go.Layout(autosize = False, width = 300, height = 400,
              yaxis = dict(title = 'Survival Rates'),
              title = 'Children vs Adults')
fig7 = go.Figure(data = data, layout = layout)
py.iplot(fig7)
Out[16]:
In [17]:
# Normalised survival by sex and age
male_child = train["Survived"][(train["Sex"] == 'male') & 
                               (train["Child"] == 1)].value_counts(normalize = True)
male_adult = train["Survived"][(train["Sex"] == 'male') & 
                               (train["Child"] == 0)].value_counts(normalize = True)

female_child = train["Survived"][(train["Sex"] == 'female') & 
                                 (train["Child"] == 1)].value_counts(normalize = True)
female_adult = train["Survived"][(train["Sex"] == 'female') & 
                                 (train["Child"] == 0)].value_counts(normalize = True)

trace0 = go.Bar(
    x=['male', 'female'],
    y=[male_child[1], female_child[1]],
    name='child (<10)'
)
trace1 = go.Bar(
    x=['male', 'female'],
    y=[male_adult[1], female_adult[1]],
    name='adult'
)

data = [trace0, trace1]
layout = go.Layout(autosize = False, width = 500, height = 400,
                   barmode='group',
                   yaxis = dict(title = 'Survival Rates'),
                   title = 'Women and Children First')

fig8 = go.Figure(data=data, layout=layout)
py.iplot(fig8)
Out[17]:

There was chivalry at work - women and children first.

Clean and format the train data

Until now, we have been examining, the effect of one or two variable on survival. Machine learning algorithms automate this task by using multiple features to output a classification model or classifier. The use of classifier will be more exhaustive and more precise than our manual exploration above. Before we move on to use classification algorithms, we will have to clean the data so as to take maximum advantage of all the relevant features.

The data isn't perfectly clean as we saw with train.describe() earlier. There are some missing values. Also not all columns were shown with .describe(). Only numeric columns were shown.

We don't want to remove rows with missing data, as more data help us train our algorithm better. We also don't want to get rid of non-numeric columns. Non-numeric like 'Sex', as we saw were very important in predicting survival.

Missing data - Age

The Age column has missing values NaN. The count for the column is 714, whereas other columns have a count of 891. We will impute the missing values in Age column with median of the column. Median age before imputation is 28. It lies right in the peak of distribution.

In [18]:
age_bf_imputation = train['Age'].dropna()

# Impute the missing value with the median
train['Age'] = train['Age'].fillna(train['Age'].median())

hist_data = [age_bf_imputation, train['Age']]

group_labels = ['Before imputation', 'After imputation']
colors = ['#333F44', '#37AA9C']

# Create distplot
fig9 = FF.create_distplot(hist_data, group_labels, 
                          show_hist=False, colors=colors)

#Add title
fig9['layout'].update(title='Age distribution')

# Plot
#iplot(fig9, validate=False)
py.iplot(fig9)
Out[18]:
In [19]:
# Confirm missing values have been taken care of
train['Age'].isnull().values.any()
Out[19]:
False

Missing data - Embarked

In [20]:
print(train['Embarked'].unique())
['S' 'C' 'Q' nan]

The embarked column also has missing values - nan. We impute the missing values with most common port of embarkation Southampton(S).

In [21]:
embarked_bf_imputation = go.Histogram(
    x=train['Embarked'].dropna(),
    histnorm='count',
    name='Before Imputation',
)

# Impute the Embarked variable
train['Embarked'] = train['Embarked'].fillna('S')

embarked_af_imputation = go.Histogram(
    x=train['Embarked'],
    name = 'After Imputation'
)

embarked_bf_af_imputation = [embarked_bf_imputation, embarked_af_imputation]

#Layout
layout = go.Layout(autosize = False, width = 500, height = 400, bargap = 0.5,
                  barmode='group',
                  title = 'Passenger distribution by Port of Embarkation')

fig10 = go.Figure(data=embarked_bf_af_imputation, layout = layout)

#Plot
#iplot(fig10)
py.iplot(fig10)
Out[21]:

Notice the increase in passenger number from 644 to 646 in Southampton(S) port after imputation.

In [22]:
# Confirm missing values from Embarked column have been taken care of
train['Embarked'].isnull().values.any()
Out[22]:
False

Convert non-numeric columns - Sex and Embarked

Sex and Embarked variable are categorical, but in non-numeric format. We will have to convert non-numeric columns to numeric ones so that classifier can handle it. To do so, we have to find unique classes in non-numeric column and encode each class a unique integer.

In [23]:
print(train['Sex'].unique())
['male' 'female']
In [24]:
### Convert categorical variable Sex to integer form
sex_to_integers = {'male':0, 'female':1}
train['Sex'] = train['Sex'].apply(sex_to_integers.get)
In [25]:
print(train['Embarked'].unique())
['S' 'C' 'Q']
In [26]:
#Convert categorical variable Embarked to integer form
embarked_to_integers = {'S':0, 'C':1, 'Q':2}
train['Embarked'] = train['Embarked'].apply(embarked_to_integers.get)

How accurately can we predict survival based on available features?

Logistic Regression

In our case, the dependent variable is categorical. We only care about two outcomes either 0(deceased) or 1(survived). Logistic regression uses a logit function to squeeze the output values to 0 or 1.

Sklearn has a class for logistic regression that we can use.

In [27]:
# Import the `LogisticRegression` and cross validation
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection

# The features we'll use to predict survival
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm
logreg = LogisticRegression(random_state=1)

# Compute the accuracy score for all the cross validation folds.  
scores = model_selection.cross_val_score(logreg, train[predictors], train["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)
print "Accuracy and the 95% confidence interval of the estimate are: {0:.3f} (+/- {0:.2f})".format( \
       scores.mean(), scores.std() * 2)
Accuracy and the 95% confidence interval of the estimate are: 0.788 (+/- 0.79)

One parameter used in the evaluation of classification algorithm is accuracy. Accuracy measures the fraction of items in a class labelled correctly. We obtained an accuracy of 78.8%.

Random Forest

Random forest fits multiple (very deep) classification trees with slightly randomized input data, and slightly randomized split points using the training set. It uses averaging to improve the predictive accuracy and control over-fitting.

In [28]:
# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

#Build our forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 25, random_state = 1)
forest.fit( train[predictors], train["Survived"])
feature_importances = forest.feature_importances_
In [29]:
#Plot the importance of each feature
feature_data = [go.Bar(
            x=predictors,
            y=feature_importances
    )]

feature_layout = go.Layout(autosize = False, width = 400, height = 400,
                  yaxis = dict(title = 'Importance'),
                  title = 'Importance of features')
fig11 = go.Figure(data = feature_data, layout = feature_layout)
#iplot(fig11)
py.iplot(fig11)
Out[29]:
In [30]:
# Compute the accuracy score for all the cross validation folds. 
kf = model_selection.KFold(train.shape[0], random_state=1)
scores = model_selection.cross_val_score(forest, train[predictors], train["Survived"], cv=kf)

# Take the mean of the scores (because we have one for each fold)
print "Accuracy and the 95% confidence interval of the estimate are: {0:.3f} (+/- {0:.2f})".format( \
       scores.mean(), scores.std() * 2)
Accuracy and the 95% confidence interval of the estimate are: 0.834 (+/- 0.83)

Conclusion

We have built a useful classifier for predicting the survival of passengers aboard the RMS Titanic. We also saw that Sex, Fare, Age, and Pclass are the four most important features in determining the survivors of the ship sinking. The accuracy of our random forest classifier is 83.4%. Also, we measured the accuracy of our classifier on the same dataset we trained it on. It will be interesting to see how well our classifier performs on the test dataset. Similarly, it is important to evaluate our classifiers based on other metrics namely, Precision and Recall. Precision measures the results relevancy, whereas recall measures how many truly relevant results are returned.

We can improve the accuracy of our algorithm by engineering new features. Feature engineering involves creatively combining different variables to engineer a new one. The title of the passenger from their names. Was any particular title more likely to survive? Family size from variables SibSp and Parch. Did having more women and children in the family made the whole family more likely to survive?

There are limitations with the dataset too. There was several missing values in the 'Age' column and 'Embarked' column. We did our best approximation to fill in the missing values, however, our approximation might be biasing the prediction. In addition, the dataset contained information about 891 out of 2224 passengers and crew abroad. Even when combined with 418 passengers information, the numbers still don't add up to 2224. Also, the current data doesn't distinguish between passengers and crew.


Comments

comments powered by Disqus