Project: Predicting breast cancer

9 minute read

In this project, we are going to use the built-in breast cancer dataset to predict whether the tumor is malignant or benign.

Highlights:

  • Using Support vector machine model for predictions
  • Using GridSearchCV to tune hyper-parameters
# Imports
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
# Creating data frame
cancer_features = pd.DataFrame(cancer['data'], columns=cancer['feature_names'])
# Glimpse of data
cancer_features.head(2)
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902

2 rows × 30 columns

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
# Defining features X and target y
X = cancer_features
y = cancer['target']
# Creating train/test split
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=1)

First we are going to train / validate the model without hyperparameter tuning. This way we will be better able to understand the importance of GridSearchCV specially with Support Vector Machine model.

# Instantiate model
model = SVC()
model.fit(X_train, y_train)

    SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
      kernel='rbf', max_iter=-1, probability=False, random_state=None,
      shrinking=True, tol=0.001, verbose=False)
predictions = model.predict(X_valid)
from sklearn.metrics import classification_report
print(classification_report(y_valid, predictions))

                  precision    recall  f1 - score   support

               0       0.00      0.00      0.00        42
               1       0.63      1.00      0.77        72
    
       micro avg       0.63      0.63      0.63       114
       macro avg       0.32      0.50      0.39       114
    weighted avg       0.40      0.63      0.49       114

You can see how the SVM model predicted everything in the benign class. Let us now use GridSearchCV to tune hyperparameters and see the difference in our accuracy scores.

from sklearn.model_selection import GridSearchCV
# Create parameters dictionary with different values to train the model on
param_grid = {'C':[1,10,100,1000],'gamma':[1,0.1,0.01,0.001,0.0001]}
# Instantiate the grid
grid = GridSearchCV(SVC(),param_grid,verbose=3)
# Fit the grid
grid.fit(X_train,y_train)

    GridSearchCV(cv='warn', error_score='raise-deprecating',
           estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
      kernel='rbf', max_iter=-1, probability=False, random_state=None,
      shrinking=True, tol=0.001, verbose=False),
           fit_params=None, iid='warn', n_jobs=None,
           param_grid={'C': [1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001]},
           pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
           scoring=None, verbose=3)
# Check the best values of parameters
grid.best_params_

    {'C': 10, 'gamma': 0.0001}
# Generate predictions using those tuned hyperparameteric values
grid_predictions = grid.predict(X_valid)
print(classification_report(y_valid,grid_predictions))

                  precision    recall  f1-score   support
    
               0       0.97      0.90      0.94        42
               1       0.95      0.99      0.97        72
    
       micro avg       0.96      0.96      0.96       114
       macro avg       0.96      0.95      0.95       114
    weighted avg       0.96      0.96      0.96       114

You can clearly see the difference, now our SVM model performed so well with an accuracy of around 96%.

Leave a comment