Source: www.redbubble.com Photo Credit: Nino Marcutti/Alamy Stock

1. Introduction

In mid-2020, New York City (NYC) became the epicenter of the global COVID-19 pandemic as its residents were forced to shelter in place and economic activity came to a grinding halt. The City Department of Health and Mental Hygiene (DOHMH) tracks and provide data on the number the daily and aggregate number of COVID cases in NYC on its Github repository. However, since DOHMH only provides raw data (in CSV format), it makes it difficult to digest and detect case trends in the city.

This work seeks to extract, transform and analyze and the daily and aggregate cases reported by DOHMH. It uses visuals to depict trends in COVID-19 infections, hospitalizations and deaths across the city. It also examines case trends among boroughs, demographics, and neighborhoods to understand which group is being impacted the most by the pandemic.

The analysis will be updated at the beginning of each week as new data become available to allow for continuous monitoring of COVID-19 trends in NYC. This work is also accompanied by this dashboard hosted on Tableau Public.

2. Data

NYC DOHMH publishes an open source COVID-19 database on its Github repository. The database, which is updated daily, contains numerous tables that provides details about COVID cases, testing and vaccinations. This analyses uses uses three data sets from the repository, namely data-by-day, data-by-group and data-by-modzcta. Below are brief descriptions of each of the data sets.

  • data-by-day: Provides a daily summary of all Covid cases, hospitalizations and deaths that happened in the City as a whole, and by borough.

  • data-by-group: Provides a breakdown of total number of cases, hospitalizations and death by different demograpics, including borough, age, gender, and race.

  • data-by-modzcta: Gives a breakdown of aggregate cases by neighborhood and modified zip code. This data can be used to map COVID cases and deaths by neighborhood when combined with the MODZCTA shape files (can be downloaded from DOHMH’s Github or NYC Open Data Portal).

In addition to the three highlighted above, the analysis also extracts and uses shapefile data from the City’s Open Data Portal to map COVID cases in neighborhoods.

Now, let us extract and load the aforementioned data sets (from DOHMH GitHub page and NYC Open Data Portal) and get them ready for the analysis.

## Daily Data
download.file(
  "https://raw.githubusercontent.com/nychealth/coronavirus-data/master/trends/data-by-day.csv", 
  destfile = "./data/daily.csv")
daily <- read.csv("/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/data/daily.csv")

## Group Data
download.file(
  "https://raw.githubusercontent.com/nychealth/coronavirus-data/master/totals/by-group.csv",
  destfile = "./data/group.csv")
group <- read.csv("/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/data/group.csv")

## Modzcta (Zip Code) Data
download.file(
  "https://raw.githubusercontent.com/nychealth/coronavirus-data/master/totals/data-by-modzcta.csv",
  destfile = "./data/modzcta.csv")
modzcta <- read.csv("/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/data/modzcta.csv")


## Shape Files: Create folder to store files, download zip, unzip and load
library(plyr)

dir <- "/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/Shape Files"

dir.create("Shape Files")
do.call(file.remove, list(list.files(path = dir, full.names = TRUE)))
## [1] TRUE TRUE TRUE TRUE TRUE
download.file(
  "https://data.cityofnewyork.us/api/geospatial/pri4-ifjk?method=export&format=Shapefile", 
  destfile = "/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/Shape Files/modzcta_zip.zip")

zip <- list.files(path = dir, pattern = "*.zip", full.names = T)

ldply(.data = zip, .fun = unzip, exdir = dir)
modzcta_shp <- list.files(path = dir, pattern = "*.shp", full.names = T) %>%
               st_read()
## Reading layer `geo_export_21e58705-0848-4164-b459-ba7b3c19f63e' from data source `/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/Shape Files/geo_export_21e58705-0848-4164-b459-ba7b3c19f63e.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 178 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
## Geodetic CRS:  WGS84(DD)
detach(package:plyr, unload = TRUE)

The tables below show the first few rows of each data set.

Data-by-Day

head(daily)

Data-by-Group

head(group)

Data-by-Modzcta

head(modzcta)

3. Data Examination

In this section, I examine the data sets to identify what needs to cleaned.

3.1. Data Structure and Summary

The tables below depict the structure and summary of the three COVID data sets.

Daily Data

daily %>% 
  skim() %>% 
  kbl() %>% 
  kable_classic_2(c("striped", "hovered"), html_font = "Calibri") %>% 
  scroll_box(width = "100%", height = "300px")
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character date_of_interest 0 1 10 10 0 1082 0 NA NA NA NA NA NA NA NA
numeric CASE_COUNT 0 1 NA NA NA NA NA 2498.6672828 5031.1521389 0 622.00 1515.0 2829.00 55004 ▇▁▁▁▁
numeric PROBABLE_CASE_COUNT 0 1 NA NA NA NA NA 488.2412200 630.0617799 0 93.00 371.5 659.75 5882 ▇▁▁▁▁
numeric HOSPITALIZED_COUNT 0 1 NA NA NA NA NA 173.6062847 245.7844853 0 48.25 103.5 187.50 1840 ▇▁▁▁▁
numeric DEATH_COUNT 0 1 NA NA NA NA NA 35.5859519 76.4503582 0 7.00 13.0 30.75 598 ▇▁▁▁▁
numeric PROBABLE_DEATH_COUNT 0 1 NA NA NA NA NA 5.8890943 24.5521343 0 0.00 1.0 2.00 240 ▇▁▁▁▁
numeric CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 2496.5942699 4619.5491798 0 646.50 1569.0 2849.75 39497 ▇▁▁▁▁
numeric ALL_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 2984.1598891 5167.6817532 0 808.50 1973.5 3591.00 43954 ▇▁▁▁▁
numeric HOSP_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 173.4824399 241.4562361 0 48.25 105.0 186.00 1662 ▇▁▁▁▁
numeric DEATH_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 35.5582255 75.5599806 0 8.00 12.0 31.00 566 ▇▁▁▁▁
numeric ALL_DEATH_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 41.4426987 98.7883718 0 8.00 13.0 33.00 775 ▇▁▁▁▁
numeric BX_CASE_COUNT 0 1 NA NA NA NA NA 412.7874307 939.3131444 0 82.00 210.5 433.75 10560 ▇▁▁▁▁
numeric BX_PROBABLE_CASE_COUNT 0 1 NA NA NA NA NA 96.3641405 144.5544961 0 14.00 65.0 129.00 1575 ▇▁▁▁▁
numeric BX_HOSPITALIZED_COUNT 0 1 NA NA NA NA NA 37.3974122 56.8463622 0 9.00 20.0 39.00 390 ▇▁▁▁▁
numeric BX_DEATH_COUNT 0 1 NA NA NA NA NA 6.7125693 16.0237915 0 1.00 2.0 5.00 132 ▇▁▁▁▁
numeric BX_PROBABLE_DEATH_COUNT 0 1 NA NA NA NA NA 1.1441774 5.0716885 0 0.00 0.0 0.00 46 ▇▁▁▁▁
numeric BX_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 412.4251386 843.7401703 0 83.50 236.0 456.00 7480 ▇▁▁▁▁
numeric BX_PROBABLE_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 96.2347505 132.8572658 0 15.00 73.0 134.75 1094 ▇▁▁▁▁
numeric BX_ALL_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 508.6719039 968.2219943 0 109.00 307.0 595.00 8574 ▇▁▁▁▁
numeric BX_HOSPITALIZED_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 37.3632163 55.4264291 0 9.00 21.0 38.00 358 ▇▁▁▁▁
numeric BX_DEATH_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 6.7236599 15.6961309 0 1.00 2.0 5.00 117 ▇▁▁▁▁
numeric BX_ALL_DEATH_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 7.8650647 20.4752123 0 1.00 2.0 5.00 158 ▇▁▁▁▁
numeric BK_CASE_COUNT 0 1 NA NA NA NA NA 753.3373383 1484.5024539 0 215.25 462.0 846.00 16665 ▇▁▁▁▁
numeric BK_PROBABLE_CASE_COUNT 0 1 NA NA NA NA NA 133.5905730 176.0871495 0 30.00 100.0 173.75 1906 ▇▁▁▁▁
numeric BK_HOSPITALIZED_COUNT 0 1 NA NA NA NA NA 52.5609982 72.2766071 0 17.00 31.0 54.00 555 ▇▁▁▁▁
numeric BK_DEATH_COUNT 0 1 NA NA NA NA NA 11.0970425 23.5985310 0 2.00 4.0 10.00 201 ▇▁▁▁▁
numeric BK_PROBABLE_DEATH_COUNT 0 1 NA NA NA NA NA 1.9833641 8.4983375 0 0.00 0.0 1.00 92 ▇▁▁▁▁
numeric BK_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 752.7107209 1369.5307037 0 231.75 475.0 850.25 11587 ▇▁▁▁▁
numeric BK_PROBABLE_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 133.4242144 163.4622405 0 29.00 106.0 175.00 1213 ▇▁▁▁▁
numeric BK_ALL_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 886.1497227 1521.7549061 0 264.00 586.0 1034.50 12787 ▇▁▁▁▁
numeric BK_HOSPITALIZED_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 52.5341959 70.6597378 0 17.00 32.0 53.00 490 ▇▁▁▁▁
numeric BK_DEATH_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 11.0998152 23.1532564 0 2.00 4.0 9.00 178 ▇▁▁▁▁
numeric BK_ALL_DEATH_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 13.0785582 31.0390271 0 3.00 4.0 11.00 252 ▇▁▁▁▁
numeric MN_CASE_COUNT 0 1 NA NA NA NA NA 458.1256932 912.2602953 0 107.00 284.0 492.00 9113 ▇▁▁▁▁
numeric MN_PROBABLE_CASE_COUNT 0 1 NA NA NA NA NA 90.1210721 113.9212735 0 19.00 69.0 124.00 972 ▇▁▁▁▁
numeric MN_HOSPITALIZED_COUNT 0 1 NA NA NA NA NA 26.3049908 35.9481505 0 7.00 16.0 30.00 273 ▇▁▁▁▁
numeric MN_DEATH_COUNT 0 1 NA NA NA NA NA 4.8724584 9.9687949 0 1.00 2.0 5.00 92 ▇▁▁▁▁
numeric MN_PROBABLE_DEATH_COUNT 0 1 NA NA NA NA NA 0.8040665 3.2174452 0 0.00 0.0 0.00 33 ▇▁▁▁▁
numeric MN_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 457.7088725 831.6672105 0 120.00 303.0 481.50 6394 ▇▁▁▁▁
numeric MN_PROBABLE_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 89.9852126 107.4760467 0 19.00 74.0 128.00 766 ▇▁▁▁▁
numeric MN_ALL_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 547.6931608 932.6721680 0 150.00 369.0 600.25 7160 ▇▁▁▁▁
numeric MN_HOSPITALIZED_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 26.2735675 34.9485021 0 7.00 17.0 30.00 228 ▇▁▁▁▁
numeric MN_DEATH_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 4.8641405 9.6639669 0 1.00 2.0 4.00 73 ▇▁▁▁▁
numeric MN_ALL_DEATH_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 5.6580407 12.6055020 0 1.00 2.0 4.00 100 ▇▁▁▁▁
numeric QN_CASE_COUNT 0 1 NA NA NA NA NA 697.4195933 1417.2913827 0 148.00 401.5 792.25 15224 ▇▁▁▁▁
numeric QN_PROBABLE_CASE_COUNT 0 1 NA NA NA NA NA 135.1423290 170.0577474 0 23.00 100.0 192.75 1609 ▇▁▁▁▁
numeric QN_HOSPITALIZED_COUNT 0 1 NA NA NA NA NA 48.7643253 76.0411356 0 13.00 27.0 51.75 609 ▇▁▁▁▁
numeric QN_DEATH_COUNT 0 1 NA NA NA NA NA 10.6645102 24.1619041 0 2.00 4.0 9.00 202 ▇▁▁▁▁
numeric QN_PROBABLE_DEATH_COUNT 0 1 NA NA NA NA NA 1.6996303 7.2754310 0 0.00 0.0 1.00 68 ▇▁▁▁▁
numeric QN_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 696.8604436 1310.3066818 0 152.00 421.5 822.00 11551 ▇▁▁▁▁
numeric QN_PROBABLE_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 134.9602588 159.5306984 0 22.00 103.5 192.00 1219 ▇▁▁▁▁
numeric QN_ALL_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 831.8262477 1455.8981674 0 193.25 537.0 1043.25 12689 ▇▁▁▁▁
numeric QN_HOSPITALIZED_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 48.7375231 74.6385769 0 13.00 28.0 51.00 562 ▇▁▁▁▁
numeric QN_DEATH_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 10.6580407 23.7696990 0 2.00 4.0 9.00 177 ▇▁▁▁▁
numeric QN_ALL_DEATH_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 12.3576710 30.6005052 0 2.00 4.0 9.75 240 ▇▁▁▁▁
numeric SI_CASE_COUNT 0 1 NA NA NA NA NA 176.1543438 329.4849669 0 44.00 110.0 196.00 3720 ▇▁▁▁▁
numeric SI_PROBABLE_CASE_COUNT 0 1 NA NA NA NA NA 32.9602588 36.1776030 0 6.00 26.0 49.00 316 ▇▁▁▁▁
numeric SI_HOSPITALIZED_COUNT 0 1 NA NA NA NA NA 10.6561922 11.6878181 0 3.00 7.0 14.00 83 ▇▂▁▁▁
numeric SI_DEATH_COUNT 0 1 NA NA NA NA NA 2.2375231 3.8281731 0 0.00 1.0 3.00 34 ▇▁▁▁▁
numeric SI_PROBABLE_DEATH_COUNT 0 1 NA NA NA NA NA 0.2578558 0.9737813 0 0.00 0.0 0.00 9 ▇▁▁▁▁
numeric SI_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 176.0083179 302.4864687 0 44.00 115.0 195.00 2687 ▇▁▁▁▁
numeric SI_PROBABLE_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 32.9038817 33.6238375 0 6.00 27.0 48.75 233 ▇▃▁▁▁
numeric SI_ALL_CASE_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 208.9057301 330.9701999 0 52.25 148.0 247.75 2907 ▇▁▁▁▁
numeric SI_HOSPITALIZED_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 10.6404806 11.1542522 0 4.00 8.0 13.00 72 ▇▂▁▁▁
numeric SI_DEATH_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 2.2208872 3.5488149 0 1.00 1.0 2.00 26 ▇▁▁▁▁
numeric SI_ALL_DEATH_COUNT_7DAY_AVG 0 1 NA NA NA NA NA 2.4889094 4.3554348 0 1.00 1.0 2.00 34 ▇▁▁▁▁
numeric INCOMPLETE 0 1 NA NA NA NA NA 394.5101664 4891.1861858 0 0.00 0.0 0.00 60980 ▇▁▁▁▁

Group Data

group %>% 
  skim() %>% 
  kbl() %>% 
  kable_classic_2(c("striped", "hovered"), html_font = "Calibri") %>% 
  scroll_box(width = "100%", height = "300px")
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character group 0 1.0000000 3 9 0 6 0 NA NA NA NA NA NA NA NA
character subgroup 0 1.0000000 0 22 1 27 0 NA NA NA NA NA NA NA NA
numeric CONFIRMED_CASE_RATE 1 0.9629630 NA NA NA NA NA 30794.5458 4695.5503 19244.90 28100.112 30944.675 33354.9950 40029.78 ▁▃▇▅▃
numeric CASE_RATE 1 0.9629630 NA NA NA NA NA 36815.0992 5716.9932 22835.38 33413.850 37064.940 39714.1850 47519.76 ▁▃▇▅▃
numeric HOSPITALIZED_RATE 1 0.9629630 NA NA NA NA NA 2470.8927 2029.8561 246.80 1462.935 2388.545 2618.1925 10815.51 ▇▇▁▁▁
numeric DEATH_RATE 3 0.8888889 NA NA NA NA NA 665.3563 834.5807 2.70 362.535 547.320 611.0375 4287.67 ▇▁▁▁▁
numeric CONFIRMED_CASE_COUNT 1 0.9629630 NA NA NA NA NA 598269.3462 550194.4185 100789.00 274767.000 439960.000 710746.2500 2703558.00 ▇▃▁▁▁
numeric PROBABLE_CASE_COUNT 1 0.9629630 NA NA NA NA NA 116982.3846 107761.1985 18804.00 57311.500 89939.500 143118.0000 528277.00 ▇▂▁▁▁
numeric CASE_COUNT 1 0.9629630 NA NA NA NA NA 715251.7308 657731.5113 119593.00 332605.500 532428.000 857971.2500 3231835.00 ▇▃▁▁▁
numeric HOSPITALIZED_COUNT 1 0.9629630 NA NA NA NA NA 46710.0385 43283.2998 1705.00 17588.500 38038.500 59746.2500 206476.00 ▇▅▁▁▁
numeric DEATH_COUNT 3 0.8888889 NA NA NA NA NA 12327.9583 10791.6671 46.00 5301.500 9434.500 19860.2500 45023.00 ▇▂▅▁▁

Modzcta Data

modzcta %>% 
  skim() %>% 
  kbl() %>% 
  kable_classic_2(c("striped", "hovered"), html_font = "Calibri") %>% 
  scroll_box(width = "100%", height = "300px")
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character NEIGHBORHOOD_NAME 0 1 7 59 0 162 0 NA NA NA NA NA NA NA NA
character BOROUGH_GROUP 0 1 5 13 0 5 0 NA NA NA NA NA NA NA NA
character label 0 1 5 12 0 177 0 NA NA NA NA NA NA NA NA
numeric MODIFIED_ZCTA 0 1 NA NA NA NA NA 10810.37853 5.781733e+02 10001.00000 10301.00000 11109.00000 11361.00000 11697.00000 ▇▃▁▇▇
numeric lat 0 1 NA NA NA NA NA 40.72555 8.364830e-02 40.50777 40.67082 40.72644 40.77643 40.89951 ▁▅▇▇▃
numeric lon 0 1 NA NA NA NA NA -73.91881 9.965940e-02 -74.24227 -73.97870 -73.92405 -73.84698 -73.71091 ▁▁▇▆▃
numeric COVID_CONFIRMED_CASE_COUNT 0 1 NA NA NA NA NA 14647.78531 8.263303e+03 1042.00000 8478.00000 13583.00000 20598.00000 34197.00000 ▆▇▆▅▂
numeric COVID_PROBABLE_CASE_COUNT 0 1 NA NA NA NA NA 2896.77966 1.558973e+03 207.00000 1809.00000 2651.00000 4040.00000 7249.00000 ▅▇▅▃▁
numeric COVID_CASE_COUNT 0 1 NA NA NA NA NA 17544.56497 9.698150e+03 1298.00000 10334.00000 16093.00000 24555.00000 40446.00000 ▆▇▆▅▂
numeric COVID_CONFIRMED_CASE_RATE 0 1 NA NA NA NA NA 31207.45887 4.677745e+03 19173.54000 28168.04000 30583.42000 33709.48000 48736.97000 ▁▇▆▁▁
numeric COVID_CASE_RATE 0 1 NA NA NA NA NA 37588.85480 5.226694e+03 24408.00000 34105.08000 36753.27000 40176.56000 59073.89000 ▁▇▅▁▁
numeric POP_DENOMINATOR 0 1 NA NA NA NA NA 47100.66136 2.615157e+04 2972.12000 27180.77000 42737.28000 66856.31000 110369.78000 ▅▇▅▃▂
numeric COVID_CONFIRMED_DEATH_COUNT 0 1 NA NA NA NA NA 213.62712 1.536210e+02 0.00000 92.00000 172.00000 316.00000 792.00000 ▇▅▃▁▁
numeric COVID_PROBABLE_DEATH_COUNT 0 1 NA NA NA NA NA 35.41243 2.665445e+01 0.00000 15.00000 28.00000 48.00000 109.00000 ▇▇▃▂▂
numeric COVID_DEATH_COUNT 0 1 NA NA NA NA NA 249.03955 1.783268e+02 1.00000 111.00000 202.00000 370.00000 895.00000 ▇▅▃▂▁
numeric COVID_CONFIRMED_DEATH_RATE 0 1 NA NA NA NA NA 430.18492 1.901605e+02 0.00000 331.16000 425.81000 530.95000 1320.81000 ▃▇▃▁▁
numeric COVID_DEATH_RATE 0 1 NA NA NA NA NA 501.75514 2.218349e+02 11.42000 384.33000 490.91000 624.82000 1543.51000 ▃▇▃▁▁
numeric PERCENT_POSITIVE 0 1 NA NA NA NA NA 24.90169 4.653173e+00 7.90000 22.67000 25.47000 27.23000 36.37000 ▁▁▆▇▁
numeric TOTAL_COVID_TESTS 0 1 NA NA NA NA NA 54017.92090 2.947936e+04 4261.00000 31000.00000 48913.00000 75764.00000 129681.00000 ▆▇▆▃▂

From the tables above, the only things that needs to be addressed is missing values in the group data.

3.2. Missing Data

From the data summary tables, only the group data has missing values. Let us check again to make sure.

Daily Data

daily %>% missmap(main = "Missing vs. Observed Values")

Group Data

group %>% missmap(main = "Missing vs. Observed Values")

Modzcta Data

modzcta %>% missmap(main = "Missing vs. Observed Values")

As depicted by the charts, about four percent of the observations in the group data are missing. After further review, it is clear that all the missing observations are from the Age group category under group column. In Section 4.1, I re-code the age groups under the subgroup.

4. Data Wrangling

To get the ready for the analysis, I proceed to clean and manipulate them by executing the following actions.

  • group data: I combined some of the age categories into one to remove the missing values. I also added corrected borough name from StateIsland to Staten Island.

  • daily data: I changed the data type of date-of-interest variable from character to date.

4.1. Consolidate ‘0-17’ age group

Under the Age group category, the 0-17 group has three sub-groupings (0-4, 5-12, 12-17). However, the DEATH_RATE & DEATH_COUNT statistics are only provided for the 0-17 age group. Besides DEATH_RATE and DEATH_COUNT, the other COVID statistics are only provided for the age sub-categories and not the main 0-17 category. This creates missing values in the rows containing the age categories as shown below.

group %>% slice(1:6)

To handle the missing data, I use the rollsumr function in R to aggregate statistics for the three sub-categories (0-4, 5-12, 12-17) under the main category (0-17). The sub-categories are subsequently deleted from the table.

group[4, c(3:5, 7:10)] <- rollsumr(group[1:3, c(3:5, 7:10)], k=3) 

group <- group %>% slice(-c(1:3))

Now, check to see if the re-coding took care of the missing values in the group data.

group %>% missmap(main = "Observed vs. Missing Values")

4.2. Clean the Staten Island subgroup

In the group table, ‘Staten Island’ is written as StatenIsland as shown in the table below.

group %>% 
  filter(group == "Borough") %>% 
  group_by(subgroup) %>%
  select(subgroup) %>% 
  count() %>% 
  kbl(col.names = c("Borough", "Count")) %>% 
  kable_classic_2(c("striped", "hovered"),
                  html_font = "calibri")
Borough Count
Bronx 1
Brooklyn 1
Manhattan 1
Queens 1
StatenIsland 1

I clean it by adding a white space to correct the name of the Borough, as show below.

group <- group %>% 
  mutate(subgroup = case_when(
                            subgroup == "StatenIsland" ~ "Staten Island",
                            TRUE ~ subgroup)) 


group %>% 
  filter(group == "Borough") %>% 
  group_by(subgroup) %>%
  select(subgroup) %>% 
  count() %>% 
  kbl(col.names = c("Borough", "Count")) %>% 
  kable_classic_2(c("striped", "hovered"),
                  html_font = "calibri")
Borough Count
Bronx 1
Brooklyn 1
Manhattan 1
Queens 1
Staten Island 1

4.3. Change data types for date_of_interest

In the daily data, the date_of_interest column is stored as a string variable. I change it to a date variable.

daily <- daily %>% 
        mutate(date_of_interest = as.Date(date_of_interest, 
                                          format = "%m/%d/%Y"))


class(daily$date_of_interest) 
## [1] "Date"

5. Analyzing Citywide Impact

This section analyzes the daily and total number of COVID cases in the City as a whole as of 2023-02-14.

5.1. Citywide: Total Cases

As of 2023-02-14, NYC had recorded approximately a total of 3.23 million COVID cases since the first case was recorded in early 2020. More than 206,400 of those cases have led to hospitalizations, with over 45,000 people losing their lives.

group %>% 
  filter(group == "Citywide") %>% 
  summarise(`Total Infections` = sum(CASE_COUNT),
            `Total Hospitalizations` = sum(HOSPITALIZED_COUNT),
            `Total Deaths` = sum(DEATH_COUNT)) %>% 
  kbl(align = "lcr",
      format.args = list(big.mark = ",")) %>% 
  kable_classic_2()
Total Infections Total Hospitalizations Total Deaths
3,231,835 206,476 45,023

The charts below show the trends in daily Citywide cases since the beginning of the pandemic.

Infections

daily %>% 
  plot_ly(x = ~date_of_interest, y = ~CASE_COUNT, 
          type = 'bar', name = 'Infections Count', marker = list(color = "#5c89c9")) %>% 
  add_trace(X = ~date_of_interest, y = ~CASE_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = '7-Day Avg. Infections',
            marker = list(color = "#1d3557"), line = list(color = "#1e3557")) %>%
  layout(xaxis = list(title = "Date"),
         yaxis = list(title = "Number of People"),
         title = "Daily Citywide Infections")

Hospitalizations

daily %>% 
  plot_ly(x = ~date_of_interest, y = ~HOSPITALIZED_COUNT,
          type = 'bar', name = 'Hosp. Count', marker = list(color = "#e9d8a6")) %>% 
  add_trace(x = ~date_of_interest, y = ~DEATH_COUNT_7DAY_AVG,
            type = 'scatter', mode = 'line', name = '7-Day Avg. Hosp.',
            marker = list(color = "735d1d"), line = list(color = "#735d1d")) %>% 
  layout(xaxis = list(title = NULL),
         yaxis = list(title = "No. of Hospitalizations"),
         title = "Daily Citywide Hospitalizations")

Deaths

daily %>% 
  plot_ly(x = ~date_of_interest, y = ~DEATH_COUNT,
          type = 'bar', name = 'Death Count', marker = list(color = "#f5aaa2")) %>% 
  add_trace(x = ~date_of_interest, y = ~DEATH_COUNT_7DAY_AVG,
            type = 'scatter', mode = 'line', name = '7-Day Avg. Deaths', 
            marker = list(color = "#ae2012"), line = list(color = "#ae2012")) %>% 
  layout(xaxis = list(title = ""), 
         yaxis = list(title = "No. of Deaths"),
         title = "Daily Citywide Deaths")

The charts above show that NYC reached the peak of infection in the beginning of 2022, during the Omicron wave. While there have been three waves in hospitalizations and deaths, most of the hospitalizations and deaths occurred during the initial wave of infections (between March and April of 2020). The availability of vaccines during the Omicron wave appear to have helped reduce the number of hospitalizations and deaths around that time.

5.2. Citywide: New Cases

The table below shows the number of new infections, hospitalizations and deaths recorded on 2023-02-14 - the latest date we have record for.

daily %>% filter(date_of_interest == max(daily$date_of_interest)) %>% 
  group_by(date_of_interest) %>% 
  rename(Date = date_of_interest) %>% 
  summarise(Infections = sum(CASE_COUNT),
            Hospitalizations = sum(HOSPITALIZED_COUNT),
            Deaths = sum(DEATH_COUNT)) %>% 
  kbl(align = "lccr",
      format.args = list(big.mark = ",")) %>% 
  kable_classic_2()
Date Infections Hospitalizations Deaths
2023-02-14 752 0 5

6. Analyzing COVID Impact by Borough

This section disaggregates the daily and total number of COVID cases among the five NYC boroughs.

6.1. Total Cases by Borough

The chart below shows the total number of COVID cases by borough. Because we are looking at raw numbers (and not adjusted for population), densely populated boroughs will show more infections, hospitalizations and deaths. In this case, Brooklyn, the most densely populated borough in NYC, has had more infections, hospitalizations and deaths compared to the other four boroughs.

pal1 <- c("#1e3557", "#e9d8a6", "#ae2012")

group %>% 
  filter(group == "Borough") %>% 
  group_by(subgroup) %>% 
  rename(Borough = subgroup) %>% 
  summarise(`1. Infections` = sum(CASE_COUNT),
            `2. Hospitalizations` = sum(HOSPITALIZED_COUNT),
            `3. Deaths` = sum(DEATH_COUNT)) %>% 
  pivot_longer(cols = 2:4, names_to = "Indicator", values_to = "Case Count") %>% 
  plot_ly(x = ~ Borough, y = ~`Case Count`, type = 'bar', color = ~Indicator, colors = pal1) %>% 
  layout(title = "Total Number of Cases by Borough", barmode = 'stack',
         xaxis=list(title = ""), yaxis=list(title = "Number of People"))

6.2. Daily Average Cases by Borough

The charts below show the trends in the daily average infections, hospitalizations and deaths per borough.

Average Infections by Borough

daily %>% 
  plot_ly(x = ~date_of_interest, y = ~BX_CASE_COUNT_7DAY_AVG, 
          type = 'scatter', mode = 'line', name = 'Bronx') %>% 
  add_trace(x = ~date_of_interest, y= ~BK_CASE_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Brookyln') %>% 
  add_trace(x = ~date_of_interest, y= ~MN_CASE_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Manhattan') %>% 
  add_trace(x = ~date_of_interest, y= ~QN_CASE_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Queens') %>% 
  add_trace(x = ~date_of_interest, y= ~SI_CASE_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Staten Island') %>% 
  layout(xaxis = list(title = ""),
         yaxis = list(title = ""),
         title = "7-Day Avg. Infections by Borough")

Average Hospitalizations by Borough

daily %>% 
  plot_ly(x = ~date_of_interest, y = ~ BX_HOSPITALIZED_COUNT_7DAY_AVG, 
          type = 'scatter', mode = 'line', name = 'Bronx') %>% 
  add_trace(x = ~date_of_interest, y= ~BK_HOSPITALIZED_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Brookyln') %>% 
  add_trace(x = ~date_of_interest, y= ~MN_HOSPITALIZED_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Manhattan') %>% 
  add_trace(x = ~date_of_interest, y= ~QN_HOSPITALIZED_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Queens') %>% 
  add_trace(x = ~date_of_interest, y= ~SI_HOSPITALIZED_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Staten Island') %>% 
  layout(xaxis = list(title = "Date"),
         yaxis = list(title = "Number of Cases"),
         title = "7-Day Avg. Hospitalizations by Borough")

Average Deaths by Borough

daily %>% 
  plot_ly(x = ~date_of_interest, y = ~BX_DEATH_COUNT_7DAY_AVG, 
          type = 'scatter', mode = 'line', name = 'Bronx') %>% 
  add_trace(x = ~date_of_interest, y= ~BK_DEATH_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Brookyln') %>% 
  add_trace(x = ~date_of_interest, y= ~MN_DEATH_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Manhattan') %>% 
  add_trace(x = ~date_of_interest, y= ~QN_DEATH_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Queens') %>% 
  add_trace(x = ~date_of_interest, y= ~SI_DEATH_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Staten Island') %>% 
  layout(xaxis = list(title = ""),
         yaxis = list(title = ""),
         title = "7-Day Avg. Deaths by Borough")

The charts above show that daily infections, hospitalizations and deaths have consistently been highest in Brooklyn and Queen.

6.3. Share of Infections that Turned into Hospitalizations and Deaths?

Let us look at the share of infections in each borough that turn in hospitalizations and deaths. This will help us understand which boroughs are having the most severe symptoms from COVID.

group %>% filter(group == "Borough") %>% 
  group_by(subgroup) %>% 
  summarise(
    `Hospitalized as % of Cases` = (sum(HOSPITALIZED_COUNT)/sum(CASE_COUNT))*100,
    `Deaths as % of Cases` = (sum(DEATH_COUNT)/sum(CASE_COUNT))*100
                                     ) %>% 
plot_ly(x=~subgroup, y=~`Hospitalized as % of Cases`, type = 'bar', name = '% Hospitalized',
        marker = list(color = "#e9d8a6")) %>% 
  add_trace(x=~subgroup, y=~`Deaths as % of Cases`, name = '% Died', marker = list(color = "#ae2012")) %>% 
  layout(title = "Share of Infections that Lead to Hopitalizations & Deaths",
         barmode = 'group',
         xaxis = list(title=''), 
         yaxis = list(title='Share of Infections',
                      ticksuffix = "%"))

The chart above indicates that, even though Brooklyn has had the highest number of COVID cases (see section 6.1), the Bronx has seen the largest share of its cases lead to hospitalizations and deaths. This may be because, although less populated than Brooklyn, the Bronx has a lot more people (per capita) living with underlying medical conditions that exacerbate the effects of COVID. For example, the Bronx is known to have one of the highest asthma hospitalization rate in the New York State.1

6.4. Which Boroughs Have Been Hit the Hardest?

Section 6.1 shows Brooklyn has the highest number of cases, hospitalizations and deaths among all boroughs. This makes sense since Brooklyn is the most populous of the five boroughs. However, to be able to compare boroughs to determine which one has been severely affected, we have to adjust for population. Hence, let us use the rates (per 100,000) statistics to plot total infections, hospitalizations and deaths.

group %>% filter(group == "Borough") %>% 
  group_by(subgroup) %>% 
  rename(Borough = subgroup) %>% 
  summarise(`1. Infection Rate` = sum(CASE_RATE),
            `2. Hospitalization Rate` = sum(HOSPITALIZED_RATE),
            `3. Death Rate` = sum(DEATH_RATE)) %>%
  pivot_longer(cols = 2:4, names_to = "Indicator", values_to = "Rate (per 100K)") %>% 
  plot_ly(x = ~ Borough, y = ~`Rate (per 100K)`, type = 'bar', 
          color = ~Indicator, colors = pal1) %>% 
  layout(title = "Case Rate by Borough", barmode = 'stack',
         xaxis=list(title = ""), yaxis=list(title = "Rate (Per 100K)"))

The chart above indicates that after adjusting for population, Staten Island - the least populated borough - has the highest rate of infections. The Bronx, on the hand, has the highest rate of hospitalizations and deaths.

7. Analysing by Age Group

This section details how COVID-19 has impacted NYC residents of different age groups. The data set breaks down age into eight categories - 0-17, 18-24, 24-34, 35-44, 45-54, 55-64, 65-74, and 75+.

7.1. Case, Hospitalization and Death Rates

The first tab shows the infection, hospitalization and death rates (per 100,000) for the various age groups. The second tab shows hospitalization and death rates as a share of case rates.

Case, Hospitalization and Death Rates

pal2 = c("#007a8b", "#224b5e", "#94b594", "#edc775",  "#e09351", "#df7e66", "#b75347", "#6d2f20")

group %>% 
  filter(group == "Age group") %>% 
  group_by(subgroup) %>% 
  rename(`Age Group` = subgroup) %>% 
  summarise(`1. Infection Rate` = sum(CASE_RATE),
            `2. Hospitalization Rate` = sum(HOSPITALIZED_RATE),
            `3. Death Rate` = sum(DEATH_RATE)) %>%
  mutate_if(is.numeric, ~(./sum(.))) %>% 
  pivot_longer(cols = 2:4,
             names_to = "Statistics",
             values_to = "Rate (per 100K)") %>% 
  ggplot(aes(x = Statistics,
             y = `Rate (per 100K)`,
             fill = `Age Group`), border = "white") +
  geom_bar(position = 'stack', stat = 'identity') +
  scale_fill_manual(values = pal2) +
  scale_y_continuous(label = percent) +
  theme_minimal() +
  labs(x = NULL,
       y = "Share of Rate (per 100K)",
       title = "Case Rate (per 100K) byAge Group")

Share of Cases that Lead to Hospitalization or Death

group %>% 
  filter(group == "Age group") %>% 
  group_by(subgroup) %>% 
  rename(`Age Group` = subgroup) %>% 
  summarise(
    `Hospitalized as % of Cases` = (sum(HOSPITALIZED_COUNT)/sum(CASE_COUNT))*100,
    `Deaths as % of Cases` = (sum(DEATH_COUNT)/sum(CASE_COUNT))*100
            ) %>% 
  plot_ly(x = ~`Age Group`, y = ~`Hospitalized as % of Cases`, type = 'bar', 
          name = '% Hospitalized', marker = list(color = "#e9d8a6")) %>% 
  add_trace(x = ~`Age Group`, y = ~`Deaths as % of Cases`, name = '% Died',
            marker = list(color = "#ae2012")) %>% 
  layout(title = "Share of Infections that Lead to Hopitalizations & Deaths",
         barmode = 'group',
         xaxis = list(title='Age Group'), 
         yaxis = list(title='Share of Infection Rate',
                      ticksuffix = "%"))

The charts in this section indicate that, while young people (under 45 years) are infected at higher rates than any other age group, only a small share are hospitalized and they barely any die from the virus. On the other hand, seniors, especially those 75 year and over, tend to be hospitalized and die at the highest rate even though they have the lowest infection rates. This is consistent with reports that COVID is much more deadly among seniors.

8. Analysis by Race/Ethnicity

This section details how COVID has affected people of different racial and ethnicity background. The data sets breaks race/ethnicity into four categories - Asian/Pacific-Islander, Black/African-American, Hispani/Latino and White.

8.1. Case, Hospitalization and Death Rates

The first tab shows the infections, hospitalizations and deaths rates (per 100,000) for each race/ethnicity.

The second tab shows hospitalization and death rates as a share of case rates.

Case, Hospitalization and Death Rates

library(MetBrewer)

group %>% 
  filter(group == "Race") %>% 
  group_by(subgroup) %>% 
  rename(`Race/Ethnicity` = subgroup) %>% 
  summarise(`1. Infection Rate` = sum(CASE_RATE),
            `2. Hospitalization Rate` = sum(HOSPITALIZED_RATE),
            `3. Death Rate` = sum(DEATH_RATE)) %>% 
pivot_longer(cols = 2:4,
             names_to = "Indicator",
             values_to = "Rate (per 100K)") %>% 
  ggplot(aes(x = Indicator,
             y = `Race/Ethnicity`,
             fill = `Rate (per 100K)`)) +
  geom_tile() +
  scale_fill_gradientn(colors = met.brewer("Hokusai1", type = "continuous")) +
  theme_minimal() +
  labs(x = NULL, y= NULL, title = "Case Rate (per 100K) by Race/Ethnicity")

Share of Cases that Cause Hospitalization or Death

group %>% 
  filter(group == "Race") %>% 
  group_by(subgroup) %>% 
  rename(Race = subgroup) %>% 
  summarise(
    `Hospitalized as % of Cases` = (sum(HOSPITALIZED_RATE)/sum(CASE_RATE))*100,
    `Deaths as % of Cases` = (sum(DEATH_RATE)/sum(CASE_RATE))*100
            ) %>% 
  plot_ly(x = ~Race, y = ~`Hospitalized as % of Cases`, type = 'bar', 
          name = '% Hospitalized', marker = list(color = "#e9d8a6")) %>% 
  add_trace(x = ~Race, y = ~`Deaths as % of Cases`, name = '% Died',
            marker = list(color = "#ae2012")) %>% 
  layout(title = "Share of Infections that Lead to Hopitalizations & Deaths",
         barmode = 'group',
         xaxis = list(title = ""), 
         yaxis = list(title = 'Share of Infection Rate',
         ticksuffix = "%"))

The two charts indicate that, while African-Americans have one of the lowest infections rates, they tend to be hospitalized or die from the virus at the highest rates.

9. Map: COVID-19 Cases by Neighborhood

In this section, I use choropleth maps to visualize and compare infection and death rates (per 100K) among NYC neighborhoods.

To create the maps, i merge the modzcta dataframe (which disaggregates total COVID cases by zip code and neighborhoods) and the modzcta shapefile.

modzcta_merge <- 
  merge(modzcta_shp, modzcta, by.x = "modzcta", by.y = "MODIFIED_ZCTA", all = T)

Infection Rate (per 100K) by Neighborhood

modzcta_merge %>% 
  ggplot() +
  geom_sf(aes(fill = COVID_CASE_RATE),
            color = "white",
            lwd = 0.2) +
    scale_fill_gradientn(
      name = "Infection Rate (per 100K)",
      colors = met.brewer("Pillement", type = "continuous")
    ) +
  theme_void() +
  theme(plot.title.position = 'plot', 
        plot.title = element_text(hjust = 0.5)) +
  theme(panel.grid = element_line(color = "transparent")
          )

Death Rate (Per 100K) by Neighborhood

modzcta_merge %>% 
  ggplot() +
  geom_sf(aes(fill = COVID_DEATH_RATE),
            color = "white",
            lwd = 0.2) +
    scale_fill_gradientn(
      name = "Death Rate (per 100K)",
      colors = met.brewer("OKeeffe2", type = "continuous")
    ) +
  theme_void() +
  theme(plot.title.position = 'plot', 
        plot.title = element_text(hjust = 0.5)) +
  theme(panel.grid = element_line(color = "transparent")
          )

10. Conclusion

The following are the trends in observed reported COVID cases in NYC as of 2023-02-14.

  • Infections peaked in January 2022, during the Omicron wave.

  • However, hospitalizations and deaths reached their peaks during the first wave of the pandemic (April 2020). Because of the availability of vaccines, the Omicron wave did not cause as much hospitalization and was not as deadly as the 2020 wave of infections.

  • Because of the size of its population, Brooklyn has seen the highest number of infections, hospitalizations and deaths since the beginning of the pandemic compared to the other boroughs.

    • However, when you adjust for population, Staten Island has the highest rate of infection (per 100K people), while the Bronx has had the highest rate of hospitalization and death rates.
    • The Bronx has seen the largest share of all cases lead to hospitalization (8 percent) and death (1.6 percent).
  • Brooklyn and Queens have consistently averaged the highest number of infections, hospitalizations and deaths since the beginning of the pandemic per day.

  • In terms of age, young people under 45 years have the highest rate of infection. Yet, seniors over 65 years tend to be hospitalized and die at the highest rates.

    • Those over 76 years have seen the largest share of their cases lead to hospitalization (34.3 percent) and deaths (13.6 percent).
  • Even though African-Americans have one of the lowest infection rates, they tend to be hospitalized and die at higher rates compared other races/ethnicities.

    • African-Americans have seen the largest share of their cases lead to hospitalization (9.6 percent) and deaths (3.3 percent).