Project: Analyzing San Francisco Salaries Data

18 minute read

In this project, we will analyze San Francisco employee’s salaries data that is available on Kaggle. We will learn:

  • Data exploration using Pandas
  • Converting object columns to numeric
  • Finding null values
  • Imputing null values using both fillna method and SimpleImputer
#Imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#Import data from the csv file
salaries = pd.read_csv('Salaries.csv')
#Glimpse of data
salaries.head()
Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411 0 400184 NaN 567595.43 567595.43 2011 NaN San Francisco NaN
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966 245132 137811 NaN 538909.28 538909.28 2011 NaN San Francisco NaN
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739 106088 16452.6 NaN 335279.91 335279.91 2011 NaN San Francisco NaN
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916 56120.7 198307 NaN 332343.61 332343.61 2011 NaN San Francisco NaN
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134402 9737 182235 NaN 326373.19 326373.19 2011 NaN San Francisco NaN
#Data information
salaries.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 148654 entries, 0 to 148653
    Data columns (total 13 columns):
    Id                  148654 non-null int64
    EmployeeName        148654 non-null object
    JobTitle            148654 non-null object
    BasePay             148049 non-null object
    OvertimePay         148654 non-null object
    OtherPay            148654 non-null object
    Benefits            112495 non-null object
    TotalPay            148654 non-null float64
    TotalPayBenefits    148654 non-null float64
    Year                148654 non-null int64
    Notes               0 non-null float64
    Agency              148654 non-null object
    Status              38119 non-null object
    dtypes: float64(3), int64(2), object(8)
    memory usage: 14.7+ MB
#Before we start analyzing the data. Lets see if we have any null values anywhere. We might have to take care of them first.
plt.figure(figsize=(10,6))
sns.heatmap(salaries.isnull(),cmap='viridis',yticklabels=False,cbar=False)

    <matplotlib.axes._subplots.AxesSubplot at 0x242869b5b00>

png

#Find the columns where null value exists and then count the null values
{col:salaries[col].isnull().sum() for col in salaries.columns if salaries[col].isnull().any()}

    {'BasePay': 605, 'Benefits': 36159, 'Notes': 148654, 'Status': 110535}

Looks like we have to drop the Notes and Status column as there are alot of null values.

salaries.drop(['Notes','Status'],axis=1,inplace=True)

Let’s now focus on the BasePay and Benefits column. We will convert both the columns to ‘float64’ type. However, both the columns contain ‘Not Provided’ value which we will convert to nan so that it will be treated with other nan values.

salaries['BasePay'][salaries['BasePay']=='Not Provided'] = np.nan
salaries['BasePay'] = pd.to_numeric(salaries['BasePay'])
salaries['BasePay'].dtype

    dtype('float64')

Let’s repeat the same for ‘Benefits’ column.

salaries['Benefits'][salaries['Benefits']=='Not Provided'] = np.nan
salaries['Benefits'] = pd.to_numeric(salaries['Benefits'])
salaries['BasePay'].dtype

    dtype('float64')

Now that both the columns are numeric, we can take care of the null values.

salaries['Benefits'].fillna(value=0,inplace=True)

We will fill the null values of BasePay column with the mean of that column. However, i want to show you another method of taking care of null values i.e. SimpleImputer

from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='mean')
salaries['BasePay'] = pd.DataFrame(si.fit_transform(np.array(salaries['BasePay']).reshape(-1,1)))
#Lets check if we have any null values in BasePay column now
salaries['BasePay'].isnull().any()

    False

Now that we have taken care of the null values, lets explore the data.

#Avg BasePay
salaries['BasePay'].mean()

    66325.44884050524
#Job title of DAVID FRANKLIN
salaries[salaries['EmployeeName'] == 'DAVID FRANKLIN']['JobTitle']

    14    BATTALION CHIEF, (FIRE DEPARTMENT)
    Name: JobTitle, dtype: object
#How much does DAVID FRANKLIN make
salaries[salaries['EmployeeName'] == 'DAVID FRANKLIN']['TotalPayBenefits']

    14    286347.05
    Name: TotalPayBenefits, dtype: float64
#Name of highest paid person
salaries[salaries['TotalPayBenefits']==salaries['TotalPayBenefits'].max()]['EmployeeName']

    0    NATHANIEL FORD
    Name: EmployeeName, dtype: object
#Name and pay of lowest paid person
salaries[salaries['TotalPayBenefits']==salaries['TotalPayBenefits'].min()][['EmployeeName','TotalPayBenefits']]
EmployeeName TotalPayBenefits
148653 Joe Lopez -618.13
#Avg BasePay of all employees per year from 2011-2014
salaries.groupby(by='Year').mean()['BasePay']

    Year
    2011    63595.956517
    2012    65436.406857
    2013    69576.866579
    2014    66564.396851
    Name: BasePay, dtype: float64

#Number of Unique job titles
salaries['JobTitle'].nunique()

    2159
#Top 5 most common jobs
salaries['JobTitle'].value_counts().head()

    Transit Operator                7036
    Special Nurse                   4389
    Registered Nurse                3736
    Public Svc Aide-Public Works    2518
    Police Officer 3                2421
    Name: JobTitle, dtype: int64
#Number of Job titles with only one occurence in 2013
sum(salaries[salaries['Year']==2013]['JobTitle'].value_counts()==1)

    202
#To view those 202 Job titles
salaries.groupby(by=['Year','JobTitle']).count().loc[2013,'Id'][salaries.groupby(by=['Year','JobTitle']).count().loc[2013,'Id']==1]

    JobTitle
    Acupuncturist                     1
    Adm, SFGH Medical Center          1
    Administrative Analyst I          1
    Administrative Analyst II         1
    Administrator, DPH                1
    Airport Communications Officer    1
    Airport Mechanical Maint Sprv     1
    Animal Care Asst Supv             1
    Animal Care Supervisor            1
    Animal Control Supervisor         1
    Arborist Technician Supv II       1
    Area Sprv Parks, Squares & Fac    1
    Asphalt Plant Supervisor 1        1
    Assessor                          1
    Assistant Director, Probate       1
    Assistant Industrial Hygienist    1
    Assistant Inspector               1
    Assistant Inspector 2             1
    Assistant Law Librarian           1
    Assistant Power House Operator    1
    Assistant Sheriff                 1
    Assoc Musm Cnsrvt, AAM            1
    Asst Chf, Bur Clm Invest&Admin    1
    Asst Dir of Clinical Svcs 1       1
    Asst Dir, Log Cabin Rnch          1
    Asst Director, Juvenile Hall      1
    Asst Superintendent Rec           1
    Attorney, Tax Collector           1
    Auto Body & Fender Wrk Sprv 1     1
    Baker                             1
                                     ..
    Special Assistant 21              1
    Specialist in Aging 2             1
    Sprv Adult Prob Ofc (SFERS)       1
    Sr Employee Asst Counselor        1
    Sr General Utility Mechanic       1
    Sr Light Rail Veh Equip Eng       1
    Sr Medical Transcriber Typist     1
    Sr Opers Mgr                      1
    Statistician                      1
    Sup Ct Admin Secretary            1
    Sup Welfare Fraud Investigator    1
    Supervising Parts Storekeeper     1
    Supply Room Attendant             1
    Telecommunications Tech Supv      1
    Track Maint Supt, Muni Railway    1
    Traf Signal Electrician Sup II    1
    Traffic Sign Manager              1
    Traffic Signal Operator           1
    Training Coordinator              1
    Training Technician               1
    Transit Paint Shop Sprv1          1
    Treasurer                         1
    Trnst Power Line Wrk Sprv 2       1
    Undersheriff                      1
    Vet Laboratory Technologist       1
    Victim & Witness Technician       1
    Water Meter Shop Supervisor 1     1
    Wharfinger 1                      1
    Window Cleaner Supervisor         1
    Wire Rope Cable Maint Sprv        1
    Name: Id, Length: 202, dtype: int64
#How many people have the word Chief in their job title
len(salaries[salaries['JobTitle'].apply(lambda x:'chief' in x.lower())])

    627
#Is there a relation between length of Jobtitle and Salary
salaries['title_length'] = salaries['JobTitle'].apply(len)
salaries[['title_length','TotalPayBenefits']].corr()
title_length TotalPayBenefits
title_length 1.000000 -0.036878
TotalPayBenefits -0.036878 1.000000

Seems there is no correlation between the two variables.

Leave a comment