Effect of an outlier on Regression Line

5 minute read

Just a small article to show the impact of an outlier in the direction of your dependent variable.

# Necessary imports
import pandas as pd
import seaborn as sns
# Lets define a small random dataset to prove our point
df = pd.DataFrame({'x': [1, 4, 5, 8, 10, 13], 'y': [3, 5, 8, 10, 15, 20]})
df
x y
0 1 3
1 4 5
2 5 8
3 8 10
4 10 15
5 13 20
# Lets plot the regression line b/w x and y where x is your independent variable and y is the dependent variable
sns.lmplot('x', 'y', df)
  <seaborn.axisgrid.FacetGrid at 0x1dd68a89ba8 >

png

# Lets check the correlation b/w x and y
df.corr()
x y
x 1.000000 0.981795
y 0.981795 1.000000

We can see that the correlation is so strong between x and y. Let us now place an outlier in the direction of the dependent variable and see the effect of it on the correlation value

# Lets place an outlier in the direction of x-axis
df = pd.DataFrame({'x': [1, 4, 5, 8, 10, 13, 100],
                   'y': [3, 5, 8, 10, 15, 20, 5]})
df
x y
0 1 3
1 4 5
2 5 8
3 8 10
4 10 15
5 13 20
6 100 5
sns.lmplot('x', 'y', df)
  <seaborn.axisgrid.FacetGrid at 0x1dd68ad92b0 >

png

df.corr()
x y
x 1.000000 -0.211966
y -0.211966 1.000000

The relation has gone from a very strong positive relation to a very weak negative relation. Hence, it is always a good idea to investigate those outliers in case of small datasets, they may point to a potential opportunity or in worst case, just drop them altogether as they can adversely affect the performance of your regression model.

Leave a comment