This project uses various SQL functions to explore a mulitnational organization's database to determine the differences that exist among its employees in terms of salary. Using the company's staff database, this analysis will attempt to answer the following questions.
Which region pays the highest salary?
Is there a gap in salaries based on gender? Is the gender wage gap, to the extent that it exists, consistent with years of service?
Are the gender wage gaps consistent across all locations?
For each department, is there a substantial gap between the highest and lowest earners in terms of salary and length of service?
The database used in this analysis was created using this script. The database contains three tables which provide various information about the company's employees, including name, gender, department and division they work in, the country and region they work, hire date and salary.
%load_ext sql
from sqlalchemy import create_engine
import psycopg2
%sql postgresql://postgres:KW88193m@localhost:5432/staff
engine = create_engine('postgresql://postgres:KW88193m@localhost/staff')
%%sql
SELECT * FROM employee
LIMIT 5;
* postgresql://postgres:***@localhost:5432/staff 5 rows affected.
| employee_id | first_name | last_name | hire_date | department | gender | salary | region_id | |
|---|---|---|---|---|---|---|---|---|
| 1 | Berrie | Manueau | bmanueau0@dion.ne.jp | 2006-04-20 | Sports | F | 154864 | 4 |
| 2 | Aeriell | McNee | amcnee1@google.es | 2009-01-26 | Tools | F | 56752 | 3 |
| 3 | Sydney | Symonds | ssymonds2@hhs.gov | 2010-05-17 | Clothing | F | 95313 | 4 |
| 4 | Avrom | Rowantree | None | 2014-08-02 | Phones & Tablets | M | 119674 | 7 |
| 5 | Feliks | Morffew | fmorffew4@a8.net | 2003-01-14 | Computers | M | 55307 | 5 |
%%sql
SELECT * FROM department
ORDER BY 1;
* postgresql://postgres:***@localhost:5432/staff 24 rows affected.
| department | division |
|---|---|
| Automotive | Hardware |
| Beauty | Fashion |
| Books | Entertainment |
| Camping & Fishing | Outdoors |
| Children Clothing | Kids |
| Clothing | Home |
| Computers | Electronics |
| Cosmetics | Fashion |
| Decor | Home |
| Device Repair | Electronics |
| First Aid | Health |
| Furniture | Home |
| Games | Entertainment |
| Garden | Outdoors |
| Grocery | Home |
| Jewelry | Fashion |
| Movies | Entertainment |
| Music | Entertainment |
| Pharmacy | Health |
| Phones & Tablets | Electronics |
| Sports | Outdoors |
| Tools | Hardware |
| Toys | Kids |
| Vitamins | Health |
%%sql
SELECT * FROM region
* postgresql://postgres:***@localhost:5432/staff 7 rows affected.
| region_id | region | country |
|---|---|---|
| 1 | Southwest | United States |
| 2 | Northeast | United States |
| 3 | Northwest | United States |
| 4 | Central | Asia |
| 5 | East Asia | Asia |
| 6 | Quebec | Canada |
| 7 | Nova Scotia | Canada |
a. Check to see if the employee and department tables correspond and can be joined.
%%sql
SELECT e.department, division FROM employee e
LEFT JOIN department d
ON e.department = d.department
GROUP BY 2, 1
ORDER BY 2 DESC
LIMIT 5;
* postgresql://postgres:***@localhost:5432/staff 5 rows affected.
| department | division |
|---|---|
| Sports | Outdoors |
| Garden | Outdoors |
| Camping & Fishing | Outdoors |
| Security | Other |
| Maintenance | Other |
From the table above, it can be seen that four departments - Camping, Maintenance, Plumbing, and Security - are not assigned to any division. It appears they are included in the employee table but not in the department table.
Investigate and add them to the department table, assigning them to divisions where possible.
Camping:
In the department table, you have Camping & Fishing as the department, assigned to the Outdoors division (see the table in Section 2.2). Hence, in the employee table, I change Camping to Camping & Fishing as shown below.%%sql
UPDATE employee
SET department = 'Camping & Fishing'
WHERE department = 'Camping';
* postgresql://postgres:***@localhost:5432/staff 0 rows affected.
[]
%%sql
SELECT * FROM employee
WHERE department = 'Camping & Fishing'
LIMIT 2;
* postgresql://postgres:***@localhost:5432/staff 2 rows affected.
| employee_id | first_name | last_name | hire_date | department | gender | salary | region_id | |
|---|---|---|---|---|---|---|---|---|
| 82 | Dani | Diter | dditer29@huffingtonpost.com | 2006-01-13 | Camping & Fishing | M | 102516 | 2 |
| 86 | Doll | Blondel | dblondel2d@list-manage.com | 2007-03-30 | Camping & Fishing | F | 136773 | 7 |
Maintenance, Plumbing, and Security: Add Plumbing to the department table under the 'Hardware' division. Also add Security and Maintenance to the department table under a new division called Other.%%sql
INSERT INTO department VALUES ('Plumbing', 'Hardware');
* postgresql://postgres:***@localhost:5432/staff 1 rows affected.
[]
%%sql
INSERT INTO department VALUES ('Security', 'Other')
* postgresql://postgres:***@localhost:5432/staff 1 rows affected.
[]
%%sql
INSERT INTO department VALUES ('Maintenance', 'Other')
* postgresql://postgres:***@localhost:5432/staff 1 rows affected.
[]
b. Check to make sure employee & region tables can be joined.
%%sql
SELECT e.region_id, region, country FROM employee e
LEFT JOIN region r
ON e.region_id = r.region_id
GROUP BY 1, 2, 3
ORDER BY 1;
* postgresql://postgres:***@localhost:5432/staff 7 rows affected.
| region_id | region | country |
|---|---|---|
| 1 | Southwest | United States |
| 2 | Northeast | United States |
| 3 | Northwest | United States |
| 4 | Central | Asia |
| 5 | East Asia | Asia |
| 6 | Quebec | Canada |
| 7 | Nova Scotia | Canada |
Before analyzing the differences in salary among employees, I take a look at the distribution of the count of employees based on gender country, region and division to determine where there are substantial differences.
%%sql
SELECT COUNT(DISTINCT(employee_id)) AS total_employees
FROM employee;
* postgresql://postgres:***@localhost:5432/staff 1 rows affected.
| total_employees |
|---|
| 1000 |
a) By Gender: There is a near 50-50 split between male and female employees
%%sql
SELECT
CASE WHEN gender = 'M' THEN 'Male'
ELSE 'Female' END As gender,
COUNT(employee_id) AS employee_count
FROM employee
GROUP BY 1
ORDER BY employee_count;
* postgresql://postgres:***@localhost:5432/staff 2 rows affected.
| gender | employee_count |
|---|---|
| Male | 499 |
| Female | 501 |
b) By Country: Most of the employees are in the U.S., followed by Canada, and then Asia
%%sql
SELECT country, COUNT(employee_id) AS employee_count
FROM
(SELECT * FROM employee e
LEFT JOIN region r
ON e.region_id = r.region_id) AS sub
GROUP BY 1
ORDER BY 2 DESC;
* postgresql://postgres:***@localhost:5432/staff 3 rows affected.
| country | employee_count |
|---|---|
| United States | 438 |
| Canada | 298 |
| Asia | 264 |
c) By Region: Nova Scotia and Southwest have the joint-most highest (152), while East Asia has the least (127).
%%sql
SELECT region, country, COUNT(employee_id) AS employee_count
FROM
(SELECT * FROM employee e
LEFT JOIN region r
ON e.region_id = r.region_id) AS sub
GROUP BY 2,1
ORDER BY 3 DESC;
* postgresql://postgres:***@localhost:5432/staff 7 rows affected.
| region | country | employee_count |
|---|---|---|
| Southwest | United States | 152 |
| Nova Scotia | Canada | 152 |
| Quebec | Canada | 146 |
| Northwest | United States | 145 |
| Northeast | United States | 141 |
| Central | Asia | 137 |
| East Asia | Asia | 127 |
d) By Division: Home(159) and Entertainment(158) have the highest number of employees; Other has the least (14).
%%sql
SELECT division, COUNT(*) AS employee_count
FROM
(SELECT *
FROM employee e
LEFT JOIN department d
ON e.department = d.department) AS sub
GROUP BY 1
ORDER BY 2 DESC;
* postgresql://postgres:***@localhost:5432/staff 9 rows affected.
| division | employee_count |
|---|---|
| Home | 159 |
| Entertainment | 158 |
| Electronics | 133 |
| Health | 133 |
| Fashion | 120 |
| Outdoors | 111 |
| Kids | 94 |
| Hardware | 78 |
| Other | 14 |
Also, I take a look at who the longest serving employees are across all locations and by division.
%%sql
SELECT first_name, last_name, department,
hire_date, AGE(hire_date) AS years_served
FROM employee
ORDER BY 4
LIMIT 5;
* postgresql://postgres:***@localhost:5432/staff 5 rows affected.
| first_name | last_name | department | hire_date | years_served |
|---|---|---|---|---|
| Cassandra | Hoston | Beauty | 2003-01-01 | 7365 days, 0:00:00 |
| Norbie | Bleasdille | First Aid | 2003-01-01 | 7365 days, 0:00:00 |
| Rora | Brumfitt | Children Clothing | 2003-01-12 | 7355 days, 0:00:00 |
| Feliks | Morffew | Computers | 2003-01-14 | 7353 days, 0:00:00 |
| Cecilius | Cottey | Vitamins | 2003-01-20 | 7347 days, 0:00:00 |
%%sql
WITH div_hiredate AS (
SELECT division, min(hire_date) AS hire_date
FROM employee as e
JOIN department as d
ON e.department = d.department
GROUP BY 1
ORDER BY 2)
SELECT DISTINCT ON (first_name)
first_name, last_name, division,
hire_date, AGE(hire_date) AS years_served
FROM
(SELECT first_name, last_name, e.department,
d.division, d.hire_date, AGE(e.hire_date) AS years_served
FROM div_hiredate d
INNER JOIN employee AS e
ON e.hire_date = d.hire_date
ORDER BY 5) sub
ORDER BY 1, 4 DESC;
* postgresql://postgres:***@localhost:5432/staff 9 rows affected.
| first_name | last_name | division | hire_date | years_served |
|---|---|---|---|---|
| Arvy | Rivilis | Outdoors | 2003-03-16 | 7286 days, 0:00:00 |
| Cassandra | Hoston | Health | 2003-01-01 | 7365 days, 0:00:00 |
| Corabel | Scarse | Other | 2004-03-24 | 6913 days, 0:00:00 |
| Dorothea | Addicott | Home | 2003-02-27 | 7307 days, 0:00:00 |
| Feliks | Morffew | Electronics | 2003-01-14 | 7353 days, 0:00:00 |
| Maryellen | Westnedge | Hardware | 2003-04-19 | 7252 days, 0:00:00 |
| Nonnah | Dict | Entertainment | 2003-03-08 | 7294 days, 0:00:00 |
| Norbie | Bleasdille | Health | 2003-01-01 | 7365 days, 0:00:00 |
| Rora | Brumfitt | Kids | 2003-01-12 | 7355 days, 0:00:00 |
In ranking salaries, I use median salary instead of average salary. Median is more robust than average, and not easily influenced by outliers.
Since median instead of average to assess and compare regions based on salary since it is a more robust measure of central tendency.
%%sql
SELECT country, region,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM employee e
JOIN region r
ON e.region_id = r.region_id
GROUP BY 1, 2
ORDER BY 3 DESC;
* postgresql://postgres:***@localhost:5432/staff 7 rows affected.
| country | region | median_salary |
|---|---|---|
| United States | Northeast | 95398.0 |
| United States | Northwest | 92218.0 |
| Canada | Nova Scotia | 90359.0 |
| Asia | Central | 89961.0 |
| United States | Southwest | 88624.5 |
| Asia | East Asia | 86895.0 |
| Canada | Quebec | 85188.5 |
Based on median salary, the Northeast region pays the most, with Quebec paying the least.
Males have higher median salaries than females, even though, the females have been on the job almost a year longer.
%%sql
SELECT CASE WHEN gender = 'M' THEN 'Male' ELSE 'Female' END AS gender,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY AGE(hire_date)) AS median_hiredate
FROM employee
GROUP BY gender
ORDER BY median_salary;
* postgresql://postgres:***@localhost:5432/staff 2 rows affected.
| gender | median_salary | median_hiredate |
|---|---|---|
| Female | 88572.0 | 4964 days, 0:00:00 |
| Male | 89967.0 | 4653 days, 0:00:00 |
Females have higher median salary than Males in East Asia, Nova Scotia, and Quebec. in the other four regions, males are paid more than females.
%%sql
SELECT country, region,
CASE WHEN gender = 'M' THEN 'Male' ELSE 'Female' END AS gender,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) median_salary
FROM employee e
INNER JOIN region r
ON e.region_id = r.region_id
GROUP BY 2, 1, 3
ORDER BY 2;
* postgresql://postgres:***@localhost:5432/staff 14 rows affected.
| country | region | gender | median_salary |
|---|---|---|---|
| Asia | Central | Female | 89706.5 |
| Asia | Central | Male | 91990.0 |
| Asia | East Asia | Female | 92213.5 |
| Asia | East Asia | Male | 83332.0 |
| United States | Northeast | Female | 88572.0 |
| United States | Northeast | Male | 96051.5 |
| United States | Northwest | Female | 91323.0 |
| United States | Northwest | Male | 93499.5 |
| Canada | Nova Scotia | Female | 94635.0 |
| Canada | Nova Scotia | Male | 89252.0 |
| Canada | Quebec | Female | 85965.5 |
| Canada | Quebec | Male | 82548.5 |
| United States | Southwest | Female | 76326.0 |
| United States | Southwest | Male | 94976.5 |
%%sql
SELECT first_name, last_name, sub2.division, department,
sub2.salary, AGE(hire_date) AS years_served
FROM
(SELECT division, MAX(salary) AS salary
FROM
(SELECT first_name, last_name, e.department,
division, hire_date, salary
FROM employee e
LEFT JOIN department d
ON e.department = d.department) sub1
GROUP BY division
ORDER BY salary DESC) sub2
LEFT JOIN employee e
ON sub2.salary = e.salary
ORDER BY 5 DESC;
* postgresql://postgres:***@localhost:5432/staff 9 rows affected.
| first_name | last_name | division | department | salary | years_served |
|---|---|---|---|---|---|
| Jacklyn | Finlator | Home | Clothing | 166976 | 7006 days, 0:00:00 |
| Carissa | Gabbitus | Entertainment | Music | 166765 | 2360 days, 0:00:00 |
| Riley | Yellowlea | Outdoors | Camping & Fishing | 166569 | 5831 days, 0:00:00 |
| Lauren | Bertenshaw | Health | Pharmacy | 166016 | 6172 days, 0:00:00 |
| Michail | Huggard | Electronics | Device Repair | 164355 | 6638 days, 0:00:00 |
| Hendrik | Garbutt | Fashion | Jewelry | 163794 | 5254 days, 0:00:00 |
| Zeke | Keyworth | Kids | Toys | 163688 | 5064 days, 0:00:00 |
| Mill | Makey | Hardware | Automotive | 162522 | 4447 days, 0:00:00 |
| Kincaid | Bader | Other | Security | 162233 | 4505 days, 0:00:00 |
In the Vitamins department, the employee with the highest salary was hired about 13 years
after the lowest earner. Yet, the lowest earner make $142,002 less than the highest earner.
Such a disparity in hire date and salary can only be attributed to position at the company.
Most likely the highest earner was hired for a higher position than the lowest earner.
%%sql
CREATE VIEW new_employee AS
SELECT first_name, last_name, department, hire_date, salary,
FIRST_VALUE(salary) OVER (PARTITION BY department
ORDER BY salary DESC) AS highest_salary,
salary - FIRST_VALUE(salary) OVER (PARTITION BY department
ORDER BY salary DESC) AS salarydiff,
FIRST_VALUE(hire_date) OVER (PARTITION BY department
ORDER BY salary DESC) AS hiredate_highestsalary,
AGE(hire_date, FIRST_VALUE(hire_date) OVER (PARTITION BY department
ORDER BY salary DESC))AS hiredate_diff
FROM employee;
SELECT * FROM new_employee;
SELECT new_employee.department,
new_employee.salarydiff,
hiredate_diff
FROM new_employee
INNER JOIN
(SELECT department, MIN(salarydiff) AS salarydiff
FROM new_employee
GROUP BY department) b
ON new_employee.department = b.department
AND new_employee.salarydiff = b.salarydiff
ORDER BY hiredate_diff;
* postgresql://postgres:***@localhost:5432/staff Done. 1000 rows affected. 27 rows affected.
| department | salarydiff | hiredate_diff |
|---|---|---|
| Vitamins | -142002 | -4749 days, 0:00:00 |
| First Aid | -134472 | -3696 days, 0:00:00 |
| Security | -125592 | -2417 days, 0:00:00 |
| Jewelry | -136918 | -2057 days, 0:00:00 |
| Plumbing | -94718 | -1656 days, 0:00:00 |
| Beauty | -140792 | -1420 days, 0:00:00 |
| Phones & Tablets | -131695 | -1388 days, 0:00:00 |
| Automotive | -132770 | -1313 days, 0:00:00 |
| Grocery | -143969 | -1311 days, 0:00:00 |
| Computers | -141886 | -1141 days, 0:00:00 |
| Music | -145989 | -897 days, 0:00:00 |
| Movies | -133623 | -360 days, 0:00:00 |
| Cosmetics | -140170 | -286 days, 0:00:00 |
| Furniture | -134568 | -234 days, 0:00:00 |
| Tools | -133090 | -146 days, 0:00:00 |
| Maintenance | -114099 | 180 days, 0:00:00 |
| Games | -129632 | 264 days, 0:00:00 |
| Device Repair | -143691 | 355 days, 0:00:00 |
| Sports | -143925 | 641 days, 0:00:00 |
| Garden | -141035 | 749 days, 0:00:00 |
| Clothing | -137981 | 775 days, 0:00:00 |
| Camping & Fishing | -139822 | 1559 days, 0:00:00 |
| Children Clothing | -135387 | 1876 days, 0:00:00 |
| Toys | -136110 | 2770 days, 0:00:00 |
| Books | -118012 | 2958 days, 0:00:00 |
| Decor | -132825 | 3244 days, 0:00:00 |
| Pharmacy | -145474 | 3730 days, 0:00:00 |