Project: Predicting breast cancer

9 minute read

In this project, we are going to use the built-in breast cancer dataset to predict whether the tumor is malignant or benign.

Highlights:

Using Support vector machine model for predictions
Using GridSearchCV to tune hyper-parameters

# Imports
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

# Creating data frame
cancer_features = pd.DataFrame(cancer['data'], columns=cancer['feature_names'])

# Glimpse of data
cancer_features.head(2)

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.8	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	25.38	17.33	184.6	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.9	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	24.99	23.41	158.8	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902

2 rows × 30 columns

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# Defining features X and target y
X = cancer_features
y = cancer['target']
# Creating train/test split
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=1)

First we are going to train / validate the model without hyperparameter tuning. This way we will be better able to understand the importance of GridSearchCV specially with Support Vector Machine model.

# Instantiate model
model = SVC()

model.fit(X_train, y_train)

    SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
      kernel='rbf', max_iter=-1, probability=False, random_state=None,
      shrinking=True, tol=0.001, verbose=False)

predictions = model.predict(X_valid)

from sklearn.metrics import classification_report

print(classification_report(y_valid, predictions))

                  precision    recall  f1 - score   support

               0       0.00      0.00      0.00        42
               1       0.63      1.00      0.77        72
    
       micro avg       0.63      0.63      0.63       114
       macro avg       0.32      0.50      0.39       114
    weighted avg       0.40      0.63      0.49       114

You can see how the SVM model predicted everything in the benign class. Let us now use GridSearchCV to tune hyperparameters and see the difference in our accuracy scores.

from sklearn.model_selection import GridSearchCV

# Create parameters dictionary with different values to train the model on
param_grid = {'C':[1,10,100,1000],'gamma':[1,0.1,0.01,0.001,0.0001]}

# Instantiate the grid
grid = GridSearchCV(SVC(),param_grid,verbose=3)

# Fit the grid
grid.fit(X_train,y_train)

    GridSearchCV(cv='warn', error_score='raise-deprecating',
           estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
      kernel='rbf', max_iter=-1, probability=False, random_state=None,
      shrinking=True, tol=0.001, verbose=False),
           fit_params=None, iid='warn', n_jobs=None,
           param_grid={'C': [1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001]},
           pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
           scoring=None, verbose=3)

# Check the best values of parameters
grid.best_params_

    {'C': 10, 'gamma': 0.0001}

# Generate predictions using those tuned hyperparameteric values
grid_predictions = grid.predict(X_valid)

print(classification_report(y_valid,grid_predictions))

                  precision    recall  f1-score   support
    
               0       0.97      0.90      0.94        42
               1       0.95      0.99      0.97        72
    
       micro avg       0.96      0.96      0.96       114
       macro avg       0.96      0.95      0.95       114
    weighted avg       0.96      0.96      0.96       114

You can clearly see the difference, now our SVM model performed so well with an accuracy of around 96%.

Share on

Twitter Facebook LinkedIn

Muzammil Iftikhar

Project: Predicting breast cancer

Share on

Leave a comment

You may also enjoy

Flask+Pipenv+Postgres+Docker+Nginx+uWSGI

Webscraping sites with infinite scroll

Webscraping using scrapy

Project: Predicting heart diseases