Project: Predicting Amount spent

10 minute read

For this project, we are going to use this dataset from Kaggle. This data is of ecommerce customers and their usage on app vs website. Our mission is to comeup with a model that will predict the ‘Yearly Amount Spent’ by the customers based on the given features.

Highlights:

  • Exploratory data analysis
  • Creating pipeline
  • Using LinearRegression model
  • Using cross-validation scores to measure model performance
#Imports
import numpy as np
import pandas as pd
#Visualization imports
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
#import data
ecommerce = pd.read_csv('ecommerce')
#Glimpse of data
ecommerce.head(2)
Email Address Avatar Avg. Session Length Time on App Time on Website Length of Membership Yearly Amount Spent
0 mstephenson@fernandez.com 835 Frank Tunnel\nWrightmouth, MI 82180-9605 Violet 34.497268 12.655651 39.577668 4.082621 587.951054
1 hduke@hotmail.com 4547 Archer Common\nDiazchester, CA 06566-8576 DarkGreen 31.926272 11.109461 37.268959 2.664034 392.204933
#Data info
ecommerce.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 500 entries, 0 to 499
    Data columns (total 8 columns):
    Email                   500 non-null object
    Address                 500 non-null object
    Avatar                  500 non-null object
    Avg. Session Length     500 non-null float64
    Time on App             500 non-null float64
    Time on Website         500 non-null float64
    Length of Membership    500 non-null float64
    Yearly Amount Spent     500 non-null float64
    dtypes: float64(5), object(3)
    memory usage: 31.3+ KB
#describe numerical features
ecommerce.describe()
Avg. Session Length Time on App Time on Website Length of Membership Yearly Amount Spent
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean 33.053194 12.052488 37.060445 3.533462 499.314038
std 0.992563 0.994216 1.010489 0.999278 79.314782
min 29.532429 8.508152 33.913847 0.269901 256.670582
25% 32.341822 11.388153 36.349257 2.930450 445.038277
50% 33.082008 11.983231 37.069367 3.533975 498.887875
75% 33.711985 12.753850 37.716432 4.126502 549.313828
max 36.139662 15.126994 40.005182 6.922689 765.518462

Observation:

  • Our features seem to be normally distributed as mean is very close to median values
  • The feature data is spread very close to the mean as standard deviation is very low
#Drop the non-numerical columns
ecommerce.drop(['Address','Avatar','Email'],axis=1,inplace=True)
#Check for null values
[col for col in ecommerce.columns if ecommerce[col].isnull().any()]

    []

Seems like there are no null values. Makes our life easier :)

Exploratory data analysis

Let’s explore our data a bit

sns.distplot(ecommerce['Avg. Session Length'],label='Avg. session length')
sns.distplot(ecommerce['Time on App'],label='Time on app')
sns.distplot(ecommerce['Time on Website'],label='Time on Website')
sns.distplot(ecommerce['Length of Membership'],label='Length of membership')
plt.legend()
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

    <matplotlib.legend.Legend at 0x167f8cb3c50>

png

Distribution plot above proves our first observation that features are normally distributed.

plt.figure(figsize=(10,5))
sns.boxenplot(data=ecommerce.drop('Yearly Amount Spent',axis=1))
plt.tight_layout()

png

Above Boxplot proves our second observation that the features data spread is very close to the mean.

#Pair plot
sns.pairplot(ecommerce)

    <seaborn.axisgrid.PairGrid at 0x167fb0f9a58>

png

There seems to be a very strong relationship b/w length of membership and our target label.

sns.heatmap(ecommerce.corr(),cmap='magma_r',annot=True)

    <matplotlib.axes._subplots.AxesSubplot at 0x167fbcf1748>

png

Heatmap proves that relationship by showing us the pearson’s r value of 0.81. There also seems to be some relation between Time on App and our target variable.

sns.jointplot(x='Length of Membership',y='Time on App',data=ecommerce,kind='hex')

    <seaborn.axisgrid.JointGrid at 0x167fc21a940>

png

sns.lmplot(x='Length of Membership',y='Yearly Amount Spent',data=ecommerce)

    <seaborn.axisgrid.FacetGrid at 0x167fc7c0668>

png

Defining pipeline

X = ecommerce.drop('Yearly Amount Spent',axis=1)
y = ecommerce['Yearly Amount Spent']
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
my_pipeline = make_pipeline(LinearRegression())

Using cross-validation scores

from sklearn.model_selection import cross_val_score
cv_scores = -1 * cross_val_score(my_pipeline,X,y,cv=5,scoring='neg_mean_absolute_error')
cv_scores.mean()

    7.944690345653413

Leave a comment