Project: Predicting Iowa house prices

32 minute read

In this project, we will be predicting the house prices of Iowa houses. Let’s get started
We will be using the steps that were referenced here SML:Supervised Machine Learning workflow
The data set that we are going to use is the Iowa house prices dataset from Kaggle.
Highlights:

Exploratory data analysis using Pandas
Visualizing data using matplotlib and seaborn
Imputing null values
Training DecisionTreeRegressor and retraining with different max_leaf_nodes values
Applying the concept of Bias-Variance Tradeoff
Fitting and validating RandomForestRegressor and comparing the results with that of DecisionTreeRegressor

Define Problem

Since we are to predict the house prices, it is going to be a regression problem.

Acquire Data

Go ahead and download the dataset from the Kaggle link above

Import Data

#Importing necessary libraries
import numpy as np
import pandas as pd

#Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

#Reading csv
iowa = pd.read_csv("train.csv")

iowa.head()

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

Exploratory Data Analysis

Lets explore our data a bit and see what do we have in hand

len(iowa.columns)
  81

There are 81 columns, for the sake of this project and for understanding purposes, we will use only following features: LotFrontage, LotArea, Utilities, BldgType, HouseStyle, YearBuilt, 1stFlrSF, 2ndFlrSF, BedroomAbvGr, YrSold, SaleType, SalePrice

features = ['LotFrontage','LotArea','Utilities','BldgType','HouseStyle','YearBuilt','1stFlrSF','2ndFlrSF','BedroomAbvGr','YrSold','SaleType']
target = 'SalePrice'
#Feature Dataframe
iowa_feat = iowa[features]
#Target Dataframe
iowa_tar = iowa[target]
iowa = pd.concat([iowa_feat,iowa_tar],axis=1)

iowa_feat.head(3)

	LotFrontage	LotArea	Utilities	BldgType	HouseStyle	YearBuilt	1stFlrSF	2ndFlrSF	BedroomAbvGr	YrSold	SaleType
0	65.0	8450	AllPub	1Fam	2Story	2003	856	854	3	2008	WD
1	80.0	9600	AllPub	1Fam	1Story	1976	1262	0	3	2007	WD
2	68.0	11250	AllPub	1Fam	2Story	2001	920	866	3	2008	WD

#get the idea of number of rows and columns and type of data in each
iowa_feat.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1460 entries, 0 to 1459
    Data columns (total 11 columns):
    LotFrontage     1201 non-null float64
    LotArea         1460 non-null int64
    Utilities       1460 non-null object
    BldgType        1460 non-null object
    HouseStyle      1460 non-null object
    YearBuilt       1460 non-null int64
    1stFlrSF        1460 non-null int64
    2ndFlrSF        1460 non-null int64
    BedroomAbvGr    1460 non-null int64
    YrSold          1460 non-null int64
    SaleType        1460 non-null object
    dtypes: float64(1), int64(6), object(4)
    memory usage: 125.5+ KB

iowa_feat.describe()

	LotFrontage	LotArea	YearBuilt	1stFlrSF	2ndFlrSF	BedroomAbvGr	YrSold
count	1201.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000
mean	70.049958	10516.828082	1971.267808	1162.626712	346.992466	2.866438	2007.815753
std	24.284752	9981.264932	30.202904	386.587738	436.528436	0.815778	1.328095
min	21.000000	1300.000000	1872.000000	334.000000	0.000000	0.000000	2006.000000
25%	59.000000	7553.500000	1954.000000	882.000000	0.000000	2.000000	2007.000000
50%	69.000000	9478.500000	1973.000000	1087.000000	0.000000	3.000000	2008.000000
75%	80.000000	11601.500000	2000.000000	1391.250000	728.000000	3.000000	2009.000000
max	313.000000	215245.000000	2010.000000	4692.000000	2065.000000	8.000000	2010.000000

Observations:

LotFrontage
- Has only 1201 values, which means that there are missing values. We will handle that in the next stage
- Has mean of 70 ft.
- Has standard deviation of around 24 ft.
- Max value seems like an outlier as 75% of the data is within 80 ft.
LotArea
- Seems to be highly right skewed distribution
- Max value is definitely an outlier
YearBuilt
- Max is 2010 which means either no house was sold after 2010 or the data was only collected upto 2010
1stFlrSF
- Seems to be evenly distributed
2ndFlrSF
- Upto 50% of the data here has 0 value which means that around 50% of the houses are single storey
BedroomAbvGr
- Min is 0 rooms
YrSold
- Houses were sold from 2006 to 2010

Observing Categorical Variables

iowa_feat.describe(include=['O'])

	Utilities	BldgType	HouseStyle	SaleType
count	1460	1460	1460	1460
unique	2	5	8	9
top	AllPub	1Fam	1Story	WD
freq	1459	1220	726	1267

Utilities: -Almost all of the houses belong to single category. I don’t see it affecting our house prices. We may drop it in next stage
BldgType: -Almost 83% of data points fall in a single category here as well. We might drop this column as well
HouseStyle: -There seems to be somewhat distribution among multiple categories here. We will include this in our predictions
SaleType: -Almost 80% of data points fall in a single category. We will drop this column as well

Visualizing Data

Here we will visualize our categorical and numerical variables and confirm some of the descriptive observations that we made above

sns.countplot(data=iowa,x='Utilities')
<matplotlib.axes._subplots.AxesSubplot at 0x249df9fff60>

png

This confirms our above observation that almost all of the data points belong to single category of Utilities

sns.countplot(data=iowa,x='BldgType')
<matplotlib.axes._subplots.AxesSubplot at 0x249df7dff98>

png

This confirms our observation that ‘BldgType’ doesn’t seem to much impact the target price as well since about 80% of data is in single category. We will drop this also.

sns.countplot(data=iowa,x='HouseStyle')
<matplotlib.axes._subplots.AxesSubplot at 0x249dc69dba8>

png

This confirms our observation about ‘HouseStyle’ that there is some distribution here. We will include it in our predictions

sns.countplot(data=iowa,x='SaleType')
<matplotlib.axes._subplots.AxesSubplot at 0x249dc61f390>

png

This confirms our observation that almost all of the data points fall under a single category in ‘SaleType’

sns.boxplot(data=iowa,x='LotFrontage',palette='rainbow')
<matplotlib.axes._subplots.AxesSubplot at 0x249df89ff98>

png

We were right about our observation that 313 is the outlier. Around 50% of data falls between 50 and 100.

plt.figure(figsize=(15,5))
sns.boxplot(data=iowa,x='LotArea',palette='rainbow')
plt.tight_layout()

png

plt.figure(figsize=(15,5))
sns.distplot(a=iowa['LotArea'],bins=100,kde=False,rug=True)
plt.tight_layout()

png

Above boxplot and hist plot of ‘LotArea’ confirms our observations that the distribution is highly right skewed. Max value is also an outlier

iowa['1stFlrSF'].plot(kind='hist',bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x249e513c0f0>

png

This confirms our observation about ‘1stFlrSF’. The distribution is good. We will include this feature in our predictions

iowa['2ndFlrSF'].plot(kind='hist',bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0x249e4c23c88>

png

This confirms our observation about ‘2ndFlrSF’. Most of the data points are with 0 value

iowa['BedroomAbvGr'].value_counts().sort_index().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x249e50f6e48>

png

iowa['YrSold'].value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x249e5382240>

png

sns.pairplot(data=iowa)
<seaborn.axisgrid.PairGrid at 0x249c7ebcbe0>

png

sns.heatmap(iowa.corr(),cmap='magma_r',annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x249eb4f1438>

png

YrSold has almost 0 pearson’s r value in correlation to SalePrice. We will drop this column also.
LotFrontage has some relation to SalePrice, we will keep it and fill it up

Data Cleaning, Data Completing, Feature Engineering

Lets clean the data by dropping the columns we have decided to drop in our above analysis.

iowa_feat.drop(['Utilities','BldgType','SaleType','YrSold'],axis=1,inplace=True)

Lets complete the data now by filling in the null values

#check for null values in our features
plt.figure(figsize=(10,5))
sns.heatmap(iowa_feat.isnull(),cbar=False,yticklabels=False,cmap='viridis')
<matplotlib.axes._subplots.AxesSubplot at 0x249eca01c18>

png

So, only LotFrontage has null values. Lets complete them by filling them with the average value. Do remember that there are alot of ways to fill in the values, but for now, i will just fill them up with the mean value of the column

iowa_feat['LotFrontage'].fillna(iowa_feat['LotFrontage'].mean(),inplace=True)

#check for null values again
plt.figure(figsize=(10,5))
sns.heatmap(iowa_feat.isnull(),cbar=False,yticklabels=False,cmap='viridis')
<matplotlib.axes._subplots.AxesSubplot at 0x249ef63db00>

png

Great, now that we dont have any null values, let’s proceed

Now, we need to convert one of the features to numerical values as part of feature engineering

iowa_feat = pd.get_dummies(iowa_feat,drop_first=True)

Get Model

We will be experimenting with two models in this project, DecisionTreeRegressor and RandomforestTreeRegressor.

from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor()

Train/Fit Model

Now, for this step we need to first split our dataset into training and testing. Note that when you download the dataset from Kaggle, they have done it for you and you won’t have to do it yourself.
But i want to show you how it’s done. Remember, you always train your model with the training dataset and you test it with the test data. You will never validate/test your model with the training dataset

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iowa_feat, iowa_tar, test_size=0.3, random_state=101)

#Lets fit our model with the training dataset
dtr.fit(X_train,y_train)
  DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
        max_leaf_nodes=None, min_impurity_decrease=0.0,
        min_impurity_split=None, min_samples_leaf=1,
        min_samples_split=2, min_weight_fraction_leaf=0.0,
        presort=False, random_state=None, splitter='best')

Test Model

predictions = dtr.predict(X_test)
#we always test the model with testing dataset

Validate Model

Lets check the performance of our model

from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test,predictions)
30866.23515981735

Which means that our predictions are on an average around 31k USD from the actual values y_test

Now we can make it better by using the concept of bias-variance tradeoff. If you go and have a look at the steps, you will see that once we validate our model, we either go and get a new model or we retrain our model with different parameters to get better predictions. In this project, we will do both. Lets first retrain our DecisionTreeRegressor model with different parameters and find out the most optimal value

max_leaf_nodes = [2,5,10,15,50,100,500,1000]
mae = []
for n in max_leaf_nodes:
    dtr = DecisionTreeRegressor(max_leaf_nodes=n)
    dtr.fit(X_train,y_train)
    predictions = dtr.predict(X_test)
    mae.append(mean_absolute_error(y_test,predictions))

print(mae)
[46104.2537919054,
59606292774,
912703601985,
35800783275,
79881073222,
847258634916,
47982211852,
888127853883]

So we get the best results when max_leaf_nodes = 100 at which the least mae value is 27758 USD

Fitting and Validating a different model

We will now train and test a RandomForestRegressor model and match the resuls with a normal DecisionTreeRegressor model

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()

rf.fit(X_train,y_train)
    RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
               max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
               oob_score=False, random_state=None, verbose=0, warm_start=False)

predictions_rf = rf.predict(X_test)

mean_absolute_error(y_test,predictions_rf)
24607.287214611868

We can see that even without retraining our model, we still get the value that is less than the minimum we achieved on a simple DecisionTreeRegressor

#Retraining our model with different parameters
estimators = [5,10,50,100,200,250,300,350,400]
for estimator in estimators:
    rf = RandomForestRegressor(n_estimators=estimator)
    rf.fit(X_train,y_train)
    predictions_rf = rf.predict(X_test)
    print(round(mean_absolute_error(y_test,predictions_rf)))

    24511.0
    23998.0
    23360.0
    23318.0
    23421.0
    23258.0
    23392.0
    23430.0
    23302.0

Retraining our model gives us even much better results. At n_estimator=250, we got the value of 23258 USD

Share on

Twitter Facebook LinkedIn

Muzammil Iftikhar

Project: Predicting Iowa house prices

Define Problem

Acquire Data

Import Data

Exploratory Data Analysis

Data Cleaning, Data Completing, Feature Engineering

Get Model

Train/Fit Model

Test Model

Validate Model

Share on

Leave a comment

You may also enjoy

Flask+Pipenv+Postgres+Docker+Nginx+uWSGI

Project: Predicting breast cancer

Webscraping sites with infinite scroll

Webscraping using scrapy