Project: Analyzing San Francisco Salaries Data

18 minute read

In this project, we will analyze San Francisco employee’s salaries data that is available on Kaggle. We will learn:

Data exploration using Pandas
Converting object columns to numeric
Finding null values
Imputing null values using both fillna method and SimpleImputer

#Imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#Import data from the csv file
salaries = pd.read_csv('Salaries.csv')

#Glimpse of data
salaries.head()

	Id	EmployeeName	JobTitle	BasePay	OvertimePay	OtherPay	Benefits	TotalPay	TotalPayBenefits	Year	Notes	Agency	Status
0	1	NATHANIEL FORD	GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY	167411	0	400184	NaN	567595.43	567595.43	2011	NaN	San Francisco	NaN
1	2	GARY JIMENEZ	CAPTAIN III (POLICE DEPARTMENT)	155966	245132	137811	NaN	538909.28	538909.28	2011	NaN	San Francisco	NaN
2	3	ALBERT PARDINI	CAPTAIN III (POLICE DEPARTMENT)	212739	106088	16452.6	NaN	335279.91	335279.91	2011	NaN	San Francisco	NaN
3	4	CHRISTOPHER CHONG	WIRE ROPE CABLE MAINTENANCE MECHANIC	77916	56120.7	198307	NaN	332343.61	332343.61	2011	NaN	San Francisco	NaN
4	5	PATRICK GARDNER	DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)	134402	9737	182235	NaN	326373.19	326373.19	2011	NaN	San Francisco	NaN

#Data information
salaries.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 148654 entries, 0 to 148653
    Data columns (total 13 columns):
    Id                  148654 non-null int64
    EmployeeName        148654 non-null object
    JobTitle            148654 non-null object
    BasePay             148049 non-null object
    OvertimePay         148654 non-null object
    OtherPay            148654 non-null object
    Benefits            112495 non-null object
    TotalPay            148654 non-null float64
    TotalPayBenefits    148654 non-null float64
    Year                148654 non-null int64
    Notes               0 non-null float64
    Agency              148654 non-null object
    Status              38119 non-null object
    dtypes: float64(3), int64(2), object(8)
    memory usage: 14.7+ MB

#Before we start analyzing the data. Lets see if we have any null values anywhere. We might have to take care of them first.
plt.figure(figsize=(10,6))
sns.heatmap(salaries.isnull(),cmap='viridis',yticklabels=False,cbar=False)

    <matplotlib.axes._subplots.AxesSubplot at 0x242869b5b00>

png

#Find the columns where null value exists and then count the null values
{col:salaries[col].isnull().sum() for col in salaries.columns if salaries[col].isnull().any()}

    {'BasePay': 605, 'Benefits': 36159, 'Notes': 148654, 'Status': 110535}

Looks like we have to drop the Notes and Status column as there are alot of null values.

salaries.drop(['Notes','Status'],axis=1,inplace=True)

Let’s now focus on the BasePay and Benefits column. We will convert both the columns to ‘float64’ type. However, both the columns contain ‘Not Provided’ value which we will convert to nan so that it will be treated with other nan values.

salaries['BasePay'][salaries['BasePay']=='Not Provided'] = np.nan

salaries['BasePay'] = pd.to_numeric(salaries['BasePay'])

salaries['BasePay'].dtype

    dtype('float64')

Let’s repeat the same for ‘Benefits’ column.

salaries['Benefits'][salaries['Benefits']=='Not Provided'] = np.nan

salaries['Benefits'] = pd.to_numeric(salaries['Benefits'])

salaries['BasePay'].dtype

    dtype('float64')

Now that both the columns are numeric, we can take care of the null values.

salaries['Benefits'].fillna(value=0,inplace=True)

We will fill the null values of BasePay column with the mean of that column. However, i want to show you another method of taking care of null values i.e. SimpleImputer

from sklearn.impute import SimpleImputer

si = SimpleImputer(strategy='mean')

salaries['BasePay'] = pd.DataFrame(si.fit_transform(np.array(salaries['BasePay']).reshape(-1,1)))

#Lets check if we have any null values in BasePay column now
salaries['BasePay'].isnull().any()

    False

Now that we have taken care of the null values, lets explore the data.

#Avg BasePay
salaries['BasePay'].mean()

    66325.44884050524

#Job title of DAVID FRANKLIN
salaries[salaries['EmployeeName'] == 'DAVID FRANKLIN']['JobTitle']

    14    BATTALION CHIEF, (FIRE DEPARTMENT)
    Name: JobTitle, dtype: object

#How much does DAVID FRANKLIN make
salaries[salaries['EmployeeName'] == 'DAVID FRANKLIN']['TotalPayBenefits']

    14    286347.05
    Name: TotalPayBenefits, dtype: float64

#Name of highest paid person
salaries[salaries['TotalPayBenefits']==salaries['TotalPayBenefits'].max()]['EmployeeName']

    0    NATHANIEL FORD
    Name: EmployeeName, dtype: object

#Name and pay of lowest paid person
salaries[salaries['TotalPayBenefits']==salaries['TotalPayBenefits'].min()][['EmployeeName','TotalPayBenefits']]

	EmployeeName	TotalPayBenefits
148653	Joe Lopez	-618.13

#Avg BasePay of all employees per year from 2011-2014
salaries.groupby(by='Year').mean()['BasePay']

    Year
    2011    63595.956517
    2012    65436.406857
    2013    69576.866579
    2014    66564.396851
    Name: BasePay, dtype: float64

#Number of Unique job titles
salaries['JobTitle'].nunique()

    2159

#Top 5 most common jobs
salaries['JobTitle'].value_counts().head()

    Transit Operator                7036
    Special Nurse                   4389
    Registered Nurse                3736
    Public Svc Aide-Public Works    2518
    Police Officer 3                2421
    Name: JobTitle, dtype: int64

#Number of Job titles with only one occurence in 2013
sum(salaries[salaries['Year']==2013]['JobTitle'].value_counts()==1)

    202

#To view those 202 Job titles
salaries.groupby(by=['Year','JobTitle']).count().loc[2013,'Id'][salaries.groupby(by=['Year','JobTitle']).count().loc[2013,'Id']==1]

    JobTitle
    Acupuncturist                     1
    Adm, SFGH Medical Center          1
    Administrative Analyst I          1
    Administrative Analyst II         1
    Administrator, DPH                1
    Airport Communications Officer    1
    Airport Mechanical Maint Sprv     1
    Animal Care Asst Supv             1
    Animal Care Supervisor            1
    Animal Control Supervisor         1
    Arborist Technician Supv II       1
    Area Sprv Parks, Squares & Fac    1
    Asphalt Plant Supervisor 1        1
    Assessor                          1
    Assistant Director, Probate       1
    Assistant Industrial Hygienist    1
    Assistant Inspector               1
    Assistant Inspector 2             1
    Assistant Law Librarian           1
    Assistant Power House Operator    1
    Assistant Sheriff                 1
    Assoc Musm Cnsrvt, AAM            1
    Asst Chf, Bur Clm Invest&Admin    1
    Asst Dir of Clinical Svcs 1       1
    Asst Dir, Log Cabin Rnch          1
    Asst Director, Juvenile Hall      1
    Asst Superintendent Rec           1
    Attorney, Tax Collector           1
    Auto Body & Fender Wrk Sprv 1     1
    Baker                             1
                                     ..
    Special Assistant 21              1
    Specialist in Aging 2             1
    Sprv Adult Prob Ofc (SFERS)       1
    Sr Employee Asst Counselor        1
    Sr General Utility Mechanic       1
    Sr Light Rail Veh Equip Eng       1
    Sr Medical Transcriber Typist     1
    Sr Opers Mgr                      1
    Statistician                      1
    Sup Ct Admin Secretary            1
    Sup Welfare Fraud Investigator    1
    Supervising Parts Storekeeper     1
    Supply Room Attendant             1
    Telecommunications Tech Supv      1
    Track Maint Supt, Muni Railway    1
    Traf Signal Electrician Sup II    1
    Traffic Sign Manager              1
    Traffic Signal Operator           1
    Training Coordinator              1
    Training Technician               1
    Transit Paint Shop Sprv1          1
    Treasurer                         1
    Trnst Power Line Wrk Sprv 2       1
    Undersheriff                      1
    Vet Laboratory Technologist       1
    Victim & Witness Technician       1
    Water Meter Shop Supervisor 1     1
    Wharfinger 1                      1
    Window Cleaner Supervisor         1
    Wire Rope Cable Maint Sprv        1
    Name: Id, Length: 202, dtype: int64

#How many people have the word Chief in their job title
len(salaries[salaries['JobTitle'].apply(lambda x:'chief' in x.lower())])

    627

#Is there a relation between length of Jobtitle and Salary
salaries['title_length'] = salaries['JobTitle'].apply(len)

salaries[['title_length','TotalPayBenefits']].corr()

	title_length	TotalPayBenefits
title_length	1.000000	-0.036878
TotalPayBenefits	-0.036878	1.000000

Seems there is no correlation between the two variables.

Share on

Twitter Facebook LinkedIn

Muzammil Iftikhar

Project: Analyzing San Francisco Salaries Data

Share on

Leave a comment

You may also enjoy

Flask+Pipenv+Postgres+Docker+Nginx+uWSGI

Project: Predicting breast cancer

Webscraping sites with infinite scroll

Webscraping using scrapy