Project: Predicting Titanic survival

16 minute read

In this project, we will be working with the very famous Titanic dataset from Kaggle.

Highlights:

  • Exploratory data analysis
  • SimpleImputer to impute null values
  • OneHotEncoder to encode categorical variables
  • Defining pipelines
  • Cross validation scores
  • LogisticRegression model
  • RandomForestClassifier model
  • XGBClassifier model
  • XGBRFClassifier model
#Imports
import numpy as np
import pandas as pd
#Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
#Data loading
titanic = pd.read_csv('train.csv')
#Glimpse of data
titanic.head(2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
#drop the rows where target has missing values
titanic.dropna(axis=0,subset=['Survived'],inplace=True)
#Data information
titanic.info()

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 891 entries, 0 to 890
    Data columns (total 12 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Name           891 non-null object
    Sex            891 non-null object
    Age            714 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Ticket         891 non-null object
    Fare           891 non-null float64
    Cabin          204 non-null object
    Embarked       889 non-null object
    dtypes: float64(2), int64(5), object(5)
    memory usage: 90.5+ KB

We have both numerical and non-numerical data to deal with. We would have to convert the categorical features to numerical ones before we could fit our model.

#Describe numerical features
titanic.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Observations:

  • PassengerId holds not significant meaning and will be dropped.
  • Pclass is an ordinal variable with most of the people in 3rd class
  • Age seems to have null values, we will have to fill them in with some value
  • SibSp is also an ordinal variable with most of the people travelling without siblings or spouses
  • Parch is also an ordinal variable with most of the people travelling without their parents or children
  • Max fare seems an outlier
#Describe categorical features
titanic.describe(include=['O'])
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Thomas, Master. Assad Alexander male 347082 C23 C25 C27 S
freq 1 577 7 4 644

Observations:

  • We can assume that name holds no significance in the survival of a passenger, hence we will drop this column
  • Sex will be included in our predictions
  • Looks like some people were sharing tickets. We will drop this column as well as it doesn’t show any significance
  • Cabin seems to have alot of null values. We will drop it as well
  • Embarked will be used in predictions

Exploratory data analysis

#Lets check for null values
plt.figure(figsize=(10,5))
sns.heatmap(titanic.isnull(),cbar=False,yticklabels=False,cmap='viridis')

    <matplotlib.axes._subplots.AxesSubplot at 0x28dd1756940>

png

[col for col in titanic.columns if titanic[col].isnull().any()]

    ['Age', 'Cabin', 'Embarked']
#We will drop cabin column as it has alot of null values. Dropping all unnecessary columns
titanic.drop(['Cabin','PassengerId','Name','Ticket'],axis=1,inplace=True)

We will impute the Age and Embarked columns later.

sns.countplot(x=titanic['Survived'],data=titanic,hue='Sex')

    <matplotlib.axes._subplots.AxesSubplot at 0x28dd1ab0978>

png

Looks like there were more female survivors.

sns.countplot(x=titanic['Survived'],data=titanic,hue='Embarked')

    <matplotlib.axes._subplots.AxesSubplot at 0x28dd1b03278>

png

sns.countplot(x=titanic['Survived'],data=titanic,hue='Pclass')

    <matplotlib.axes._subplots.AxesSubplot at 0x28dd1b69470>

png

Looks like, people from 3rd class died more. Also, survival rate of 1st class is, somewhat, more than the other classes.

sns.distplot(titanic['Age'].dropna(),kde=False,color='darkred',bins=30)

    <matplotlib.axes._subplots.AxesSubplot at 0x28dd1bd1c18>

png

g = sns.FacetGrid(titanic,col='Survived',row='Sex')
g.map(plt.hist,'Age')

    <seaborn.axisgrid.FacetGrid at 0x28dd1add160>

png

More infants survived. Also, 80 years old guy survived.

g = sns.FacetGrid(titanic,col='Survived',row='Pclass')
g.map(sns.countplot,'Embarked')

    <seaborn.axisgrid.FacetGrid at 0x28dd1c6df60>

png

titanic['Fare'].hist(bins=30,color='green')
    <matplotlib.axes._subplots.AxesSubplot at 0x28dd2155c50>

png

Pipelines

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
#Separate target from predictors
X = titanic.drop('Survived',axis=1)
y = titanic['Survived']
#Lets separate the numerical and categorical columns
num_cols = [col for col in X.columns if titanic[col].dtype in ['int64','float64']]
cat_cols = [col for col in X.columns if titanic[col].dtype == 'object']
#Numerical transformer
num_transformer = SimpleImputer(strategy='mean')
#Categorical transformer
cat_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='most_frequent')),
    ('onehot',OneHotEncoder(handle_unknown='ignore',sparse=False))
])
preprocessor = ColumnTransformer(transformers=[
    ('num',num_transformer,num_cols),
    ('cat',cat_transformer,cat_cols)
])

Get Model

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
model1 = LogisticRegression(solver='liblinear',random_state=1)
model2 = RandomForestClassifier(n_estimators=250,random_state=1)
#Model1 pipeline
final_pipeline_1 = Pipeline(steps=[
    ('preprocessor',preprocessor),
    ('model',model1)
])
#Model2 pipeline
final_pipeline_2 = Pipeline(steps=[
    ('preprocessor',preprocessor),
    ('model',model2)
])

Cross-Validation scores

from sklearn.model_selection import cross_val_score
scores_1 = cross_val_score(final_pipeline_1,X,y,cv=5,scoring='accuracy')
scores_1.mean()

    0.7912852282814269
scores_2 = cross_val_score(final_pipeline_2,X,y,cv=5,scoring='accuracy')
scores_2.mean()

    0.8115290623015141

We have achieved an accuracy of around 81% using RandomForest model.

XGBoost

from xgboost import XGBClassifier,XGBRFClassifier
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1)
model_XGBClassifier = XGBClassifier(n_estimators=1000, learning_rate=0.05, n_jobs=4)
model_XGBRClassifier = XGBRFClassifier(n_estimators=1000, learning_rate=0.05, n_jobs=4)
#Pipeline #1
pipeline_XGB = Pipeline(steps=[
    ('preprocessor',preprocessor),
    ('model',model_XGBClassifier)
])
#Pipeline #2
pipeline_XGBRF = Pipeline(steps=[
    ('preprocessor',preprocessor),
    ('model',model_XGBRClassifier)
])
params = {'model__early_stopping_rounds':5,
          'model__verbose':False,
          'model__eval_set':[(X_valid,y_valid)]}
scores_XGB = cross_val_score(pipeline_XGB,X_train,y_train,cv=5,scoring='accuracy')
scores_XGB.mean()

    0.8020450232017508
scores_XGBRF = cross_val_score(pipeline_XGBRF,X_train,y_train,cv=5,scoring='accuracy')
scores_XGBRF.mean()

    0.8356706923083909

Using the RandomForest flavor of XGBoost, we were able to achieve the accuracy of around 83.5%.

Leave a comment