Analyzing Differences in Salary Among Employees

William Kyeremateng

Introduction

This project uses various SQL functions to explore a mulitnational organization's database to determine the differences that exist among its employees in terms of salary. Using the company's staff database, this analysis will attempt to answer the following questions.

  • Which region pays the highest salary?

  • Is there a gap in salaries based on gender? Is the gender wage gap, to the extent that it exists, consistent with years of service?

  • Are the gender wage gaps consistent across all locations?

  • For each department, is there a substantial gap between the highest and lowest earners in terms of salary and length of service?

The database used in this analysis was created using this script. The database contains three tables which provide various information about the company's employees, including name, gender, department and division they work in, the country and region they work, hire date and salary.

Part 1: Establish the Database Connection

In [2]:
%load_ext sql
from sqlalchemy import create_engine
import psycopg2
In [3]:
%sql postgresql://postgres:KW88193m@localhost:5432/staff
engine = create_engine('postgresql://postgres:KW88193m@localhost/staff')

Part 2: Data Examination and Cleaning

2.1. Inspect the Tables¶

In [4]:
%%sql

SELECT * FROM employee
LIMIT 5;
 * postgresql://postgres:***@localhost:5432/staff
5 rows affected.
Out[4]:
employee_id first_name last_name email hire_date department gender salary region_id
1 Berrie Manueau bmanueau0@dion.ne.jp 2006-04-20 Sports F 154864 4
2 Aeriell McNee amcnee1@google.es 2009-01-26 Tools F 56752 3
3 Sydney Symonds ssymonds2@hhs.gov 2010-05-17 Clothing F 95313 4
4 Avrom Rowantree None 2014-08-02 Phones & Tablets M 119674 7
5 Feliks Morffew fmorffew4@a8.net 2003-01-14 Computers M 55307 5
In [7]:
%%sql

SELECT * FROM department
ORDER BY 1;
 * postgresql://postgres:***@localhost:5432/staff
24 rows affected.
Out[7]:
department division
Automotive Hardware
Beauty Fashion
Books Entertainment
Camping & Fishing Outdoors
Children Clothing Kids
Clothing Home
Computers Electronics
Cosmetics Fashion
Decor Home
Device Repair Electronics
First Aid Health
Furniture Home
Games Entertainment
Garden Outdoors
Grocery Home
Jewelry Fashion
Movies Entertainment
Music Entertainment
Pharmacy Health
Phones & Tablets Electronics
Sports Outdoors
Tools Hardware
Toys Kids
Vitamins Health
In [8]:
%%sql

SELECT * FROM region
 * postgresql://postgres:***@localhost:5432/staff
7 rows affected.
Out[8]:
region_id region country
1 Southwest United States
2 Northeast United States
3 Northwest United States
4 Central Asia
5 East Asia Asia
6 Quebec Canada
7 Nova Scotia Canada

2.2. Check to make sure tables can be joined¶

a. Check to see if the employee and department tables correspond and can be joined.

In [22]:
%%sql

SELECT e.department, division FROM employee e
LEFT JOIN department d
ON e.department = d.department
GROUP BY 2, 1
ORDER BY 2 DESC
LIMIT 5;
 * postgresql://postgres:***@localhost:5432/staff
5 rows affected.
Out[22]:
department division
Sports Outdoors
Garden Outdoors
Camping & Fishing Outdoors
Security Other
Maintenance Other

From the table above, it can be seen that four departments - Camping, Maintenance, Plumbing, and Security - are not assigned to any division. It appears they are included in the employee table but not in the department table.

Investigate and add them to the department table, assigning them to divisions where possible.

  • Camping: In the department table, you have Camping & Fishing as the department, assigned to the Outdoors division (see the table in Section 2.2). Hence, in the employee table, I change Camping to Camping & Fishing as shown below.
In [12]:
%%sql

UPDATE employee
    SET department = 'Camping & Fishing'
    WHERE department = 'Camping';
 * postgresql://postgres:***@localhost:5432/staff
0 rows affected.
Out[12]:
[]
In [16]:
%%sql

SELECT * FROM employee
WHERE department = 'Camping & Fishing'
LIMIT 2;
 * postgresql://postgres:***@localhost:5432/staff
2 rows affected.
Out[16]:
employee_id first_name last_name email hire_date department gender salary region_id
82 Dani Diter dditer29@huffingtonpost.com 2006-01-13 Camping & Fishing M 102516 2
86 Doll Blondel dblondel2d@list-manage.com 2007-03-30 Camping & Fishing F 136773 7
  • Maintenance, Plumbing, and Security: Add Plumbing to the department table under the 'Hardware' division. Also add Security and Maintenance to the department table under a new division called Other.
In [18]:
%%sql

INSERT INTO department VALUES ('Plumbing', 'Hardware');
 * postgresql://postgres:***@localhost:5432/staff
1 rows affected.
Out[18]:
[]
In [19]:
%%sql


INSERT INTO department VALUES ('Security', 'Other')
 * postgresql://postgres:***@localhost:5432/staff
1 rows affected.
Out[19]:
[]
In [20]:
%%sql


INSERT INTO department VALUES ('Maintenance', 'Other')
 * postgresql://postgres:***@localhost:5432/staff
1 rows affected.
Out[20]:
[]

b. Check to make sure employee & region tables can be joined.

In [23]:
%%sql

SELECT e.region_id, region, country FROM employee e
LEFT JOIN region r
ON e.region_id = r.region_id
GROUP BY 1, 2, 3
ORDER BY 1;
 * postgresql://postgres:***@localhost:5432/staff
7 rows affected.
Out[23]:
region_id region country
1 Southwest United States
2 Northeast United States
3 Northwest United States
4 Central Asia
5 East Asia Asia
6 Quebec Canada
7 Nova Scotia Canada

Part 3: Analyze Employee Count

Before analyzing the differences in salary among employees, I take a look at the distribution of the count of employees based on gender country, region and division to determine where there are substantial differences.

3.1. What is the total number of employees?¶

In [24]:
%%sql

SELECT COUNT(DISTINCT(employee_id)) AS total_employees
FROM employee;
 * postgresql://postgres:***@localhost:5432/staff
1 rows affected.
Out[24]:
total_employees
1000

3.2. Breakdown total employees by gender, country, region, and division¶

a) By Gender: There is a near 50-50 split between male and female employees

In [10]:
%%sql

SELECT
   CASE WHEN gender = 'M' THEN 'Male'
   ELSE 'Female' END As gender,
   COUNT(employee_id) AS employee_count
   FROM employee
GROUP BY 1
ORDER BY employee_count;
 * postgresql://postgres:***@localhost:5432/staff
2 rows affected.
Out[10]:
gender employee_count
Male 499
Female 501

b) By Country: Most of the employees are in the U.S., followed by Canada, and then Asia

In [28]:
%%sql

SELECT country, COUNT(employee_id) AS employee_count
FROM
    (SELECT * FROM employee e
     LEFT JOIN region r
     ON e.region_id = r.region_id) AS sub
GROUP BY 1
ORDER BY 2 DESC;
 * postgresql://postgres:***@localhost:5432/staff
3 rows affected.
Out[28]:
country employee_count
United States 438
Canada 298
Asia 264

c) By Region: Nova Scotia and Southwest have the joint-most highest (152), while East Asia has the least (127).

In [30]:
%%sql

SELECT region, country, COUNT(employee_id) AS employee_count
FROM
    (SELECT * FROM employee e
     LEFT JOIN region r
     ON e.region_id = r.region_id) AS sub
GROUP BY 2,1
ORDER BY 3 DESC;
 * postgresql://postgres:***@localhost:5432/staff
7 rows affected.
Out[30]:
region country employee_count
Southwest United States 152
Nova Scotia Canada 152
Quebec Canada 146
Northwest United States 145
Northeast United States 141
Central Asia 137
East Asia Asia 127

d) By Division: Home(159) and Entertainment(158) have the highest number of employees; Other has the least (14).

In [32]:
%%sql

SELECT division, COUNT(*) AS employee_count
FROM
    (SELECT *
     FROM employee e
     LEFT JOIN department d
     ON e.department = d.department) AS sub
GROUP BY 1
ORDER BY 2 DESC;
 * postgresql://postgres:***@localhost:5432/staff
9 rows affected.
Out[32]:
division employee_count
Home 159
Entertainment 158
Electronics 133
Health 133
Fashion 120
Outdoors 111
Kids 94
Hardware 78
Other 14

Part 4: Length of Service

Also, I take a look at who the longest serving employees are across all locations and by division.

4.1. Who are the 5 longest serving employees; how long have they been at the company?¶

In [33]:
%%sql

SELECT first_name, last_name, department,
       hire_date, AGE(hire_date) AS years_served
FROM employee
ORDER BY 4
LIMIT 5;
 * postgresql://postgres:***@localhost:5432/staff
5 rows affected.
Out[33]:
first_name last_name department hire_date years_served
Cassandra Hoston Beauty 2003-01-01 7365 days, 0:00:00
Norbie Bleasdille First Aid 2003-01-01 7365 days, 0:00:00
Rora Brumfitt Children Clothing 2003-01-12 7355 days, 0:00:00
Feliks Morffew Computers 2003-01-14 7353 days, 0:00:00
Cecilius Cottey Vitamins 2003-01-20 7347 days, 0:00:00

4.2. Who are the longest serving employees in each division? Which departments do they work in?¶

In [39]:
%%sql

WITH div_hiredate AS (
     SELECT division, min(hire_date) AS hire_date
     FROM employee as e
     JOIN department as d
     ON e.department = d.department
     GROUP BY 1
     ORDER BY 2) 

SELECT DISTINCT ON (first_name)
    first_name, last_name, division, 
    hire_date, AGE(hire_date) AS years_served
FROM
    (SELECT first_name, last_name, e.department,
     d.division, d.hire_date, AGE(e.hire_date) AS years_served
     FROM div_hiredate d
     INNER JOIN employee AS e
     ON e.hire_date = d.hire_date
     ORDER BY 5) sub
ORDER BY 1, 4 DESC;
 * postgresql://postgres:***@localhost:5432/staff
9 rows affected.
Out[39]:
first_name last_name division hire_date years_served
Arvy Rivilis Outdoors 2003-03-16 7286 days, 0:00:00
Cassandra Hoston Health 2003-01-01 7365 days, 0:00:00
Corabel Scarse Other 2004-03-24 6913 days, 0:00:00
Dorothea Addicott Home 2003-02-27 7307 days, 0:00:00
Feliks Morffew Electronics 2003-01-14 7353 days, 0:00:00
Maryellen Westnedge Hardware 2003-04-19 7252 days, 0:00:00
Nonnah Dict Entertainment 2003-03-08 7294 days, 0:00:00
Norbie Bleasdille Health 2003-01-01 7365 days, 0:00:00
Rora Brumfitt Kids 2003-01-12 7355 days, 0:00:00

Part 5: Employee Salary

In ranking salaries, I use median salary instead of average salary. Median is more robust than average, and not easily influenced by outliers.

5.1. Which region pays employees the most salary?¶

Since median instead of average to assess and compare regions based on salary since it is a more robust measure of central tendency.

In [16]:
%%sql

SELECT country, region, 
       PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM employee e
JOIN region r
ON e.region_id = r.region_id
GROUP BY 1, 2
ORDER BY 3 DESC;
 * postgresql://postgres:***@localhost:5432/staff
7 rows affected.
Out[16]:
country region median_salary
United States Northeast 95398.0
United States Northwest 92218.0
Canada Nova Scotia 90359.0
Asia Central 89961.0
United States Southwest 88624.5
Asia East Asia 86895.0
Canada Quebec 85188.5

Based on median salary, the Northeast region pays the most, with Quebec paying the least.

5.2. Is there a gap in earnings based on gender? If so, how wide is the gap? Is that gap explained by gap in time on the job?¶

Males have higher median salaries than females, even though, the females have been on the job almost a year longer.

In [19]:
%%sql

SELECT CASE WHEN gender = 'M' THEN 'Male' ELSE 'Female' END AS gender,
       PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary,
       PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY AGE(hire_date)) AS median_hiredate
FROM employee
GROUP BY gender
ORDER BY median_salary;
 * postgresql://postgres:***@localhost:5432/staff
2 rows affected.
Out[19]:
gender median_salary median_hiredate
Female 88572.0 4964 days, 0:00:00
Male 89967.0 4653 days, 0:00:00

5.3. Do men make more than women across all regions?¶

Females have higher median salary than Males in East Asia, Nova Scotia, and Quebec. in the other four regions, males are paid more than females.

In [22]:
%%sql

SELECT country, region, 
       CASE WHEN gender = 'M' THEN 'Male' ELSE 'Female' END AS gender,
       PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) median_salary
FROM employee e
INNER JOIN region r
ON e.region_id = r.region_id
GROUP BY 2, 1, 3
ORDER BY 2;
 * postgresql://postgres:***@localhost:5432/staff
14 rows affected.
Out[22]:
country region gender median_salary
Asia Central Female 89706.5
Asia Central Male 91990.0
Asia East Asia Female 92213.5
Asia East Asia Male 83332.0
United States Northeast Female 88572.0
United States Northeast Male 96051.5
United States Northwest Female 91323.0
United States Northwest Male 93499.5
Canada Nova Scotia Female 94635.0
Canada Nova Scotia Male 89252.0
Canada Quebec Female 85965.5
Canada Quebec Male 82548.5
United States Southwest Female 76326.0
United States Southwest Male 94976.5

5.4. Who are the top earners in each division and how long have they been at the company?¶

In [28]:
%%sql

SELECT first_name, last_name, sub2.division, department,
       sub2.salary, AGE(hire_date) AS years_served
FROM
        (SELECT division, MAX(salary) AS salary
         FROM
                (SELECT first_name, last_name, e.department,
                        division, hire_date, salary
                 FROM employee e
                 LEFT JOIN department d
                 ON e.department = d.department) sub1
        GROUP BY division
        ORDER BY salary DESC) sub2
LEFT JOIN employee e
ON sub2.salary = e.salary
ORDER BY 5 DESC;
 * postgresql://postgres:***@localhost:5432/staff
9 rows affected.
Out[28]:
first_name last_name division department salary years_served
Jacklyn Finlator Home Clothing 166976 7006 days, 0:00:00
Carissa Gabbitus Entertainment Music 166765 2360 days, 0:00:00
Riley Yellowlea Outdoors Camping & Fishing 166569 5831 days, 0:00:00
Lauren Bertenshaw Health Pharmacy 166016 6172 days, 0:00:00
Michail Huggard Electronics Device Repair 164355 6638 days, 0:00:00
Hendrik Garbutt Fashion Jewelry 163794 5254 days, 0:00:00
Zeke Keyworth Kids Toys 163688 5064 days, 0:00:00
Mill Makey Hardware Automotive 162522 4447 days, 0:00:00
Kincaid Bader Other Security 162233 4505 days, 0:00:00

5.5. For each department, what is the difference in salary and years of service between the highest earner and lowest earner?¶

In the Vitamins department, the employee with the highest salary was hired about 13 years after the lowest earner. Yet, the lowest earner make $142,002 less than the highest earner. Such a disparity in hire date and salary can only be attributed to position at the company. Most likely the highest earner was hired for a higher position than the lowest earner.

In [9]:
%%sql

CREATE VIEW new_employee AS
SELECT first_name, last_name, department, hire_date, salary,
       FIRST_VALUE(salary) OVER (PARTITION BY department
                                 ORDER BY salary DESC) AS highest_salary,
       salary - FIRST_VALUE(salary) OVER (PARTITION BY department
                                          ORDER BY salary DESC) AS salarydiff,
       FIRST_VALUE(hire_date) OVER (PARTITION BY department
                                    ORDER BY salary DESC) AS hiredate_highestsalary,
       AGE(hire_date, FIRST_VALUE(hire_date) OVER (PARTITION BY department
                                                   ORDER BY salary DESC))AS hiredate_diff
FROM employee;

SELECT * FROM new_employee;

SELECT new_employee.department,
       new_employee.salarydiff,
       hiredate_diff
FROM new_employee
INNER JOIN
    (SELECT department, MIN(salarydiff) AS salarydiff
    FROM new_employee
    GROUP BY department) b
ON new_employee.department = b.department
AND new_employee.salarydiff = b.salarydiff
ORDER BY hiredate_diff;
 * postgresql://postgres:***@localhost:5432/staff
Done.
1000 rows affected.
27 rows affected.
Out[9]:
department salarydiff hiredate_diff
Vitamins -142002 -4749 days, 0:00:00
First Aid -134472 -3696 days, 0:00:00
Security -125592 -2417 days, 0:00:00
Jewelry -136918 -2057 days, 0:00:00
Plumbing -94718 -1656 days, 0:00:00
Beauty -140792 -1420 days, 0:00:00
Phones & Tablets -131695 -1388 days, 0:00:00
Automotive -132770 -1313 days, 0:00:00
Grocery -143969 -1311 days, 0:00:00
Computers -141886 -1141 days, 0:00:00
Music -145989 -897 days, 0:00:00
Movies -133623 -360 days, 0:00:00
Cosmetics -140170 -286 days, 0:00:00
Furniture -134568 -234 days, 0:00:00
Tools -133090 -146 days, 0:00:00
Maintenance -114099 180 days, 0:00:00
Games -129632 264 days, 0:00:00
Device Repair -143691 355 days, 0:00:00
Sports -143925 641 days, 0:00:00
Garden -141035 749 days, 0:00:00
Clothing -137981 775 days, 0:00:00
Camping & Fishing -139822 1559 days, 0:00:00
Children Clothing -135387 1876 days, 0:00:00
Toys -136110 2770 days, 0:00:00
Books -118012 2958 days, 0:00:00
Decor -132825 3244 days, 0:00:00
Pharmacy -145474 3730 days, 0:00:00