Project: Predicting heart diseases

5 minute read

In this project, we are going to use KNN model to predict the heart disease. We are going to use this dataset from Kaggle.

Highlights:

  • Using StandardScaler to scale the features
  • Defining Pipelines
  • Using KNN model for predictions
#Imports
import numpy as np
import pandas as pd
#Load data
df = pd.read_csv('heart.csv')
#Glimpse of data
df.head(3)
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
#Dropping the rows where target has null value
df.dropna(axis=0,subset=['target'],inplace=True)
#Separate predictors and target
X = df.drop('target',axis=1)
y = df['target']
#check for null values
[col for col in df.columns if df[col].isnull().any()]

    []
#Imports
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
#Define pipeline
my_pipline = Pipeline(steps=[
    ('scaler',StandardScaler()),
    ('model',KNeighborsClassifier())
])
#Calculate cross validation scores
scores = cross_val_score(my_pipline,X,y,cv=5,scoring='accuracy')
scores.mean()

    0.8150819672131148

We were able to achieve 81% accuracy with no parameter tuning. Let’s try to tune the parameter n_neighbors and see what results can we achieve.

mean_scores = {}
for i in range(1,50):
    my_pipline = Pipeline(steps=[
        ('scaler',StandardScaler()),
        ('model',KNeighborsClassifier(n_neighbors=i))
    ])
    scores = cross_val_score(my_pipline,X,y,cv=5,scoring='accuracy')
    mean_scores[i] = scores.mean()
# Finding the key with the best value
max(mean_scores,key=lambda x:mean_scores[x])

    28
#Replugging that value into the model and re-calculating the cross-validation scores
my_pipline = Pipeline(steps=[
    ('scaler',StandardScaler()),
    ('model',KNeighborsClassifier(n_neighbors=28))
])
scores = cross_val_score(my_pipline,X,y,cv=5,scoring='accuracy')
scores.mean()

    0.8348633879781421

We were able to achieve the accuracy of 83.5% using parameter tuning.

Leave a comment