Project: Predicting Titanic survival

16 minute read

In this project, we will be working with the very famous Titanic dataset from Kaggle.

Highlights:

Exploratory data analysis
SimpleImputer to impute null values
OneHotEncoder to encode categorical variables
Defining pipelines
Cross validation scores
LogisticRegression model
RandomForestClassifier model
XGBClassifier model
XGBRFClassifier model

#Imports
import numpy as np
import pandas as pd

#Visualizations
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

#Data loading
titanic = pd.read_csv('train.csv')

#Glimpse of data
titanic.head(2)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C

#drop the rows where target has missing values
titanic.dropna(axis=0,subset=['Survived'],inplace=True)

#Data information
titanic.info()

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 891 entries, 0 to 890
    Data columns (total 12 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Name           891 non-null object
    Sex            891 non-null object
    Age            714 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Ticket         891 non-null object
    Fare           891 non-null float64
    Cabin          204 non-null object
    Embarked       889 non-null object
    dtypes: float64(2), int64(5), object(5)
    memory usage: 90.5+ KB

We have both numerical and non-numerical data to deal with. We would have to convert the categorical features to numerical ones before we could fit our model.

#Describe numerical features
titanic.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

Observations:

PassengerId holds not significant meaning and will be dropped.
Pclass is an ordinal variable with most of the people in 3rd class
Age seems to have null values, we will have to fill them in with some value
SibSp is also an ordinal variable with most of the people travelling without siblings or spouses
Parch is also an ordinal variable with most of the people travelling without their parents or children
Max fare seems an outlier

#Describe categorical features
titanic.describe(include=['O'])

	Name	Sex	Ticket	Cabin	Embarked
count	891	891	891	204	889
unique	891	2	681	147	3
top	Thomas, Master. Assad Alexander	male	347082	C23 C25 C27	S
freq	1	577	7	4	644

Observations:

We can assume that name holds no significance in the survival of a passenger, hence we will drop this column
Sex will be included in our predictions
Looks like some people were sharing tickets. We will drop this column as well as it doesn’t show any significance
Cabin seems to have alot of null values. We will drop it as well
Embarked will be used in predictions

Exploratory data analysis

#Lets check for null values
plt.figure(figsize=(10,5))
sns.heatmap(titanic.isnull(),cbar=False,yticklabels=False,cmap='viridis')

    <matplotlib.axes._subplots.AxesSubplot at 0x28dd1756940>

png

[col for col in titanic.columns if titanic[col].isnull().any()]

    ['Age', 'Cabin', 'Embarked']

#We will drop cabin column as it has alot of null values. Dropping all unnecessary columns
titanic.drop(['Cabin','PassengerId','Name','Ticket'],axis=1,inplace=True)

We will impute the Age and Embarked columns later.

sns.countplot(x=titanic['Survived'],data=titanic,hue='Sex')

    <matplotlib.axes._subplots.AxesSubplot at 0x28dd1ab0978>

png

Looks like there were more female survivors.

sns.countplot(x=titanic['Survived'],data=titanic,hue='Embarked')

    <matplotlib.axes._subplots.AxesSubplot at 0x28dd1b03278>

png

sns.countplot(x=titanic['Survived'],data=titanic,hue='Pclass')

    <matplotlib.axes._subplots.AxesSubplot at 0x28dd1b69470>

png

Looks like, people from 3rd class died more. Also, survival rate of 1st class is, somewhat, more than the other classes.

sns.distplot(titanic['Age'].dropna(),kde=False,color='darkred',bins=30)

    <matplotlib.axes._subplots.AxesSubplot at 0x28dd1bd1c18>

png

g = sns.FacetGrid(titanic,col='Survived',row='Sex')
g.map(plt.hist,'Age')

    <seaborn.axisgrid.FacetGrid at 0x28dd1add160>

png

More infants survived. Also, 80 years old guy survived.

g = sns.FacetGrid(titanic,col='Survived',row='Pclass')
g.map(sns.countplot,'Embarked')

    <seaborn.axisgrid.FacetGrid at 0x28dd1c6df60>

png

titanic['Fare'].hist(bins=30,color='green')
    <matplotlib.axes._subplots.AxesSubplot at 0x28dd2155c50>

png

Pipelines

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

#Separate target from predictors
X = titanic.drop('Survived',axis=1)
y = titanic['Survived']

#Lets separate the numerical and categorical columns
num_cols = [col for col in X.columns if titanic[col].dtype in ['int64','float64']]
cat_cols = [col for col in X.columns if titanic[col].dtype == 'object']

#Numerical transformer
num_transformer = SimpleImputer(strategy='mean')

#Categorical transformer
cat_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='most_frequent')),
    ('onehot',OneHotEncoder(handle_unknown='ignore',sparse=False))
])

preprocessor = ColumnTransformer(transformers=[
    ('num',num_transformer,num_cols),
    ('cat',cat_transformer,cat_cols)
])

Get Model

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

model1 = LogisticRegression(solver='liblinear',random_state=1)
model2 = RandomForestClassifier(n_estimators=250,random_state=1)

#Model1 pipeline
final_pipeline_1 = Pipeline(steps=[
    ('preprocessor',preprocessor),
    ('model',model1)
])

#Model2 pipeline
final_pipeline_2 = Pipeline(steps=[
    ('preprocessor',preprocessor),
    ('model',model2)
])

Cross-Validation scores

from sklearn.model_selection import cross_val_score

scores_1 = cross_val_score(final_pipeline_1,X,y,cv=5,scoring='accuracy')

scores_1.mean()

    0.7912852282814269

scores_2 = cross_val_score(final_pipeline_2,X,y,cv=5,scoring='accuracy')

scores_2.mean()

    0.8115290623015141

We have achieved an accuracy of around 81% using RandomForest model.

XGBoost

from xgboost import XGBClassifier,XGBRFClassifier
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1)

model_XGBClassifier = XGBClassifier(n_estimators=1000, learning_rate=0.05, n_jobs=4)

model_XGBRClassifier = XGBRFClassifier(n_estimators=1000, learning_rate=0.05, n_jobs=4)

#Pipeline #1
pipeline_XGB = Pipeline(steps=[
    ('preprocessor',preprocessor),
    ('model',model_XGBClassifier)
])

#Pipeline #2
pipeline_XGBRF = Pipeline(steps=[
    ('preprocessor',preprocessor),
    ('model',model_XGBRClassifier)
])

params = {'model__early_stopping_rounds':5,
          'model__verbose':False,
          'model__eval_set':[(X_valid,y_valid)]}

scores_XGB = cross_val_score(pipeline_XGB,X_train,y_train,cv=5,scoring='accuracy')

scores_XGB.mean()

    0.8020450232017508

scores_XGBRF = cross_val_score(pipeline_XGBRF,X_train,y_train,cv=5,scoring='accuracy')

scores_XGBRF.mean()

    0.8356706923083909

Using the RandomForest flavor of XGBoost, we were able to achieve the accuracy of around 83.5%.

Share on

Twitter Facebook LinkedIn

Muzammil Iftikhar

Project: Predicting Titanic survival

Exploratory data analysis

Pipelines

Get Model

Cross-Validation scores

XGBoost

Share on

Leave a comment

You may also enjoy

Flask+Pipenv+Postgres+Docker+Nginx+uWSGI

Project: Predicting breast cancer

Webscraping sites with infinite scroll

Webscraping using scrapy