Photo Credit:
Nino Marcutti/Alamy Stock
In mid-2020, New York City (NYC) became the epicenter of the global COVID-19 pandemic as its residents were forced to shelter in place and economic activity came to a grinding halt. The City Department of Health and Mental Hygiene (DOHMH) tracks and provide data on the number the daily and aggregate number of COVID cases in NYC on its Github repository. However, since DOHMH only provides raw data (in CSV format), it makes it difficult to digest and detect case trends in the city.
This work seeks to extract, transform and analyze and the daily and aggregate cases reported by DOHMH. It uses visuals to depict trends in COVID-19 infections, hospitalizations and deaths across the city. It also examines case trends among boroughs, demographics, and neighborhoods to understand which group is being impacted the most by the pandemic.
The analysis will be updated at the beginning of each week as new data become available to allow for continuous monitoring of COVID-19 trends in NYC. This work is also accompanied by this dashboard hosted on Tableau Public.
NYC DOHMH publishes an open source COVID-19 database on its Github
repository. The database, which is updated daily, contains numerous
tables that provides details about COVID cases, testing and
vaccinations. This analyses uses uses three data sets from the
repository, namely data-by-day, data-by-group
and data-by-modzcta. Below are brief descriptions of each
of the data sets.
data-by-day: Provides a daily
summary of all Covid cases, hospitalizations and deaths that happened in
the City as a whole, and by borough.
data-by-group: Provides a breakdown
of total number of cases, hospitalizations and death by different
demograpics, including borough, age, gender, and race.
data-by-modzcta: Gives a breakdown
of aggregate cases by neighborhood and modified zip code. This data can
be used to map COVID cases and deaths by neighborhood when combined with
the MODZCTA shape files (can be downloaded from DOHMH’s Github or NYC
Open Data Portal).
In addition to the three highlighted above, the analysis also extracts and uses shapefile data from the City’s Open Data Portal to map COVID cases in neighborhoods.
Now, let us extract and load the aforementioned data sets (from DOHMH GitHub page and NYC Open Data Portal) and get them ready for the analysis.
## Daily Data
download.file(
"https://raw.githubusercontent.com/nychealth/coronavirus-data/master/trends/data-by-day.csv",
destfile = "./data/daily.csv")
daily <- read.csv("/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/data/daily.csv")
## Group Data
download.file(
"https://raw.githubusercontent.com/nychealth/coronavirus-data/master/totals/by-group.csv",
destfile = "./data/group.csv")
group <- read.csv("/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/data/group.csv")
## Modzcta (Zip Code) Data
download.file(
"https://raw.githubusercontent.com/nychealth/coronavirus-data/master/totals/data-by-modzcta.csv",
destfile = "./data/modzcta.csv")
modzcta <- read.csv("/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/data/modzcta.csv")
## Shape Files: Create folder to store files, download zip, unzip and load
library(plyr)
dir <- "/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/Shape Files"
dir.create("Shape Files")
do.call(file.remove, list(list.files(path = dir, full.names = TRUE)))
## [1] TRUE TRUE TRUE TRUE TRUE
download.file(
"https://data.cityofnewyork.us/api/geospatial/pri4-ifjk?method=export&format=Shapefile",
destfile = "/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/Shape Files/modzcta_zip.zip")
zip <- list.files(path = dir, pattern = "*.zip", full.names = T)
ldply(.data = zip, .fun = unzip, exdir = dir)
modzcta_shp <- list.files(path = dir, pattern = "*.shp", full.names = T) %>%
st_read()
## Reading layer `geo_export_21e58705-0848-4164-b459-ba7b3c19f63e' from data source `/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/Shape Files/geo_export_21e58705-0848-4164-b459-ba7b3c19f63e.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 178 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
## Geodetic CRS: WGS84(DD)
detach(package:plyr, unload = TRUE)
The tables below show the first few rows of each data set.
head(daily)
head(group)
head(modzcta)
In this section, I examine the data sets to identify what needs to cleaned.
The tables below depict the structure and summary of the three COVID data sets.
daily %>%
skim() %>%
kbl() %>%
kable_classic_2(c("striped", "hovered"), html_font = "Calibri") %>%
scroll_box(width = "100%", height = "300px")
| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | date_of_interest | 0 | 1 | 10 | 10 | 0 | 1082 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 2498.6672828 | 5031.1521389 | 0 | 622.00 | 1515.0 | 2829.00 | 55004 | ▇▁▁▁▁ |
| numeric | PROBABLE_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 488.2412200 | 630.0617799 | 0 | 93.00 | 371.5 | 659.75 | 5882 | ▇▁▁▁▁ |
| numeric | HOSPITALIZED_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 173.6062847 | 245.7844853 | 0 | 48.25 | 103.5 | 187.50 | 1840 | ▇▁▁▁▁ |
| numeric | DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 35.5859519 | 76.4503582 | 0 | 7.00 | 13.0 | 30.75 | 598 | ▇▁▁▁▁ |
| numeric | PROBABLE_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 5.8890943 | 24.5521343 | 0 | 0.00 | 1.0 | 2.00 | 240 | ▇▁▁▁▁ |
| numeric | CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 2496.5942699 | 4619.5491798 | 0 | 646.50 | 1569.0 | 2849.75 | 39497 | ▇▁▁▁▁ |
| numeric | ALL_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 2984.1598891 | 5167.6817532 | 0 | 808.50 | 1973.5 | 3591.00 | 43954 | ▇▁▁▁▁ |
| numeric | HOSP_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 173.4824399 | 241.4562361 | 0 | 48.25 | 105.0 | 186.00 | 1662 | ▇▁▁▁▁ |
| numeric | DEATH_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 35.5582255 | 75.5599806 | 0 | 8.00 | 12.0 | 31.00 | 566 | ▇▁▁▁▁ |
| numeric | ALL_DEATH_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 41.4426987 | 98.7883718 | 0 | 8.00 | 13.0 | 33.00 | 775 | ▇▁▁▁▁ |
| numeric | BX_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 412.7874307 | 939.3131444 | 0 | 82.00 | 210.5 | 433.75 | 10560 | ▇▁▁▁▁ |
| numeric | BX_PROBABLE_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 96.3641405 | 144.5544961 | 0 | 14.00 | 65.0 | 129.00 | 1575 | ▇▁▁▁▁ |
| numeric | BX_HOSPITALIZED_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 37.3974122 | 56.8463622 | 0 | 9.00 | 20.0 | 39.00 | 390 | ▇▁▁▁▁ |
| numeric | BX_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 6.7125693 | 16.0237915 | 0 | 1.00 | 2.0 | 5.00 | 132 | ▇▁▁▁▁ |
| numeric | BX_PROBABLE_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 1.1441774 | 5.0716885 | 0 | 0.00 | 0.0 | 0.00 | 46 | ▇▁▁▁▁ |
| numeric | BX_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 412.4251386 | 843.7401703 | 0 | 83.50 | 236.0 | 456.00 | 7480 | ▇▁▁▁▁ |
| numeric | BX_PROBABLE_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 96.2347505 | 132.8572658 | 0 | 15.00 | 73.0 | 134.75 | 1094 | ▇▁▁▁▁ |
| numeric | BX_ALL_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 508.6719039 | 968.2219943 | 0 | 109.00 | 307.0 | 595.00 | 8574 | ▇▁▁▁▁ |
| numeric | BX_HOSPITALIZED_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 37.3632163 | 55.4264291 | 0 | 9.00 | 21.0 | 38.00 | 358 | ▇▁▁▁▁ |
| numeric | BX_DEATH_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 6.7236599 | 15.6961309 | 0 | 1.00 | 2.0 | 5.00 | 117 | ▇▁▁▁▁ |
| numeric | BX_ALL_DEATH_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 7.8650647 | 20.4752123 | 0 | 1.00 | 2.0 | 5.00 | 158 | ▇▁▁▁▁ |
| numeric | BK_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 753.3373383 | 1484.5024539 | 0 | 215.25 | 462.0 | 846.00 | 16665 | ▇▁▁▁▁ |
| numeric | BK_PROBABLE_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 133.5905730 | 176.0871495 | 0 | 30.00 | 100.0 | 173.75 | 1906 | ▇▁▁▁▁ |
| numeric | BK_HOSPITALIZED_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 52.5609982 | 72.2766071 | 0 | 17.00 | 31.0 | 54.00 | 555 | ▇▁▁▁▁ |
| numeric | BK_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 11.0970425 | 23.5985310 | 0 | 2.00 | 4.0 | 10.00 | 201 | ▇▁▁▁▁ |
| numeric | BK_PROBABLE_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 1.9833641 | 8.4983375 | 0 | 0.00 | 0.0 | 1.00 | 92 | ▇▁▁▁▁ |
| numeric | BK_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 752.7107209 | 1369.5307037 | 0 | 231.75 | 475.0 | 850.25 | 11587 | ▇▁▁▁▁ |
| numeric | BK_PROBABLE_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 133.4242144 | 163.4622405 | 0 | 29.00 | 106.0 | 175.00 | 1213 | ▇▁▁▁▁ |
| numeric | BK_ALL_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 886.1497227 | 1521.7549061 | 0 | 264.00 | 586.0 | 1034.50 | 12787 | ▇▁▁▁▁ |
| numeric | BK_HOSPITALIZED_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 52.5341959 | 70.6597378 | 0 | 17.00 | 32.0 | 53.00 | 490 | ▇▁▁▁▁ |
| numeric | BK_DEATH_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 11.0998152 | 23.1532564 | 0 | 2.00 | 4.0 | 9.00 | 178 | ▇▁▁▁▁ |
| numeric | BK_ALL_DEATH_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 13.0785582 | 31.0390271 | 0 | 3.00 | 4.0 | 11.00 | 252 | ▇▁▁▁▁ |
| numeric | MN_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 458.1256932 | 912.2602953 | 0 | 107.00 | 284.0 | 492.00 | 9113 | ▇▁▁▁▁ |
| numeric | MN_PROBABLE_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 90.1210721 | 113.9212735 | 0 | 19.00 | 69.0 | 124.00 | 972 | ▇▁▁▁▁ |
| numeric | MN_HOSPITALIZED_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 26.3049908 | 35.9481505 | 0 | 7.00 | 16.0 | 30.00 | 273 | ▇▁▁▁▁ |
| numeric | MN_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 4.8724584 | 9.9687949 | 0 | 1.00 | 2.0 | 5.00 | 92 | ▇▁▁▁▁ |
| numeric | MN_PROBABLE_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 0.8040665 | 3.2174452 | 0 | 0.00 | 0.0 | 0.00 | 33 | ▇▁▁▁▁ |
| numeric | MN_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 457.7088725 | 831.6672105 | 0 | 120.00 | 303.0 | 481.50 | 6394 | ▇▁▁▁▁ |
| numeric | MN_PROBABLE_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 89.9852126 | 107.4760467 | 0 | 19.00 | 74.0 | 128.00 | 766 | ▇▁▁▁▁ |
| numeric | MN_ALL_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 547.6931608 | 932.6721680 | 0 | 150.00 | 369.0 | 600.25 | 7160 | ▇▁▁▁▁ |
| numeric | MN_HOSPITALIZED_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 26.2735675 | 34.9485021 | 0 | 7.00 | 17.0 | 30.00 | 228 | ▇▁▁▁▁ |
| numeric | MN_DEATH_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 4.8641405 | 9.6639669 | 0 | 1.00 | 2.0 | 4.00 | 73 | ▇▁▁▁▁ |
| numeric | MN_ALL_DEATH_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 5.6580407 | 12.6055020 | 0 | 1.00 | 2.0 | 4.00 | 100 | ▇▁▁▁▁ |
| numeric | QN_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 697.4195933 | 1417.2913827 | 0 | 148.00 | 401.5 | 792.25 | 15224 | ▇▁▁▁▁ |
| numeric | QN_PROBABLE_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 135.1423290 | 170.0577474 | 0 | 23.00 | 100.0 | 192.75 | 1609 | ▇▁▁▁▁ |
| numeric | QN_HOSPITALIZED_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 48.7643253 | 76.0411356 | 0 | 13.00 | 27.0 | 51.75 | 609 | ▇▁▁▁▁ |
| numeric | QN_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 10.6645102 | 24.1619041 | 0 | 2.00 | 4.0 | 9.00 | 202 | ▇▁▁▁▁ |
| numeric | QN_PROBABLE_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 1.6996303 | 7.2754310 | 0 | 0.00 | 0.0 | 1.00 | 68 | ▇▁▁▁▁ |
| numeric | QN_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 696.8604436 | 1310.3066818 | 0 | 152.00 | 421.5 | 822.00 | 11551 | ▇▁▁▁▁ |
| numeric | QN_PROBABLE_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 134.9602588 | 159.5306984 | 0 | 22.00 | 103.5 | 192.00 | 1219 | ▇▁▁▁▁ |
| numeric | QN_ALL_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 831.8262477 | 1455.8981674 | 0 | 193.25 | 537.0 | 1043.25 | 12689 | ▇▁▁▁▁ |
| numeric | QN_HOSPITALIZED_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 48.7375231 | 74.6385769 | 0 | 13.00 | 28.0 | 51.00 | 562 | ▇▁▁▁▁ |
| numeric | QN_DEATH_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 10.6580407 | 23.7696990 | 0 | 2.00 | 4.0 | 9.00 | 177 | ▇▁▁▁▁ |
| numeric | QN_ALL_DEATH_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 12.3576710 | 30.6005052 | 0 | 2.00 | 4.0 | 9.75 | 240 | ▇▁▁▁▁ |
| numeric | SI_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 176.1543438 | 329.4849669 | 0 | 44.00 | 110.0 | 196.00 | 3720 | ▇▁▁▁▁ |
| numeric | SI_PROBABLE_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 32.9602588 | 36.1776030 | 0 | 6.00 | 26.0 | 49.00 | 316 | ▇▁▁▁▁ |
| numeric | SI_HOSPITALIZED_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 10.6561922 | 11.6878181 | 0 | 3.00 | 7.0 | 14.00 | 83 | ▇▂▁▁▁ |
| numeric | SI_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 2.2375231 | 3.8281731 | 0 | 0.00 | 1.0 | 3.00 | 34 | ▇▁▁▁▁ |
| numeric | SI_PROBABLE_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 0.2578558 | 0.9737813 | 0 | 0.00 | 0.0 | 0.00 | 9 | ▇▁▁▁▁ |
| numeric | SI_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 176.0083179 | 302.4864687 | 0 | 44.00 | 115.0 | 195.00 | 2687 | ▇▁▁▁▁ |
| numeric | SI_PROBABLE_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 32.9038817 | 33.6238375 | 0 | 6.00 | 27.0 | 48.75 | 233 | ▇▃▁▁▁ |
| numeric | SI_ALL_CASE_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 208.9057301 | 330.9701999 | 0 | 52.25 | 148.0 | 247.75 | 2907 | ▇▁▁▁▁ |
| numeric | SI_HOSPITALIZED_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 10.6404806 | 11.1542522 | 0 | 4.00 | 8.0 | 13.00 | 72 | ▇▂▁▁▁ |
| numeric | SI_DEATH_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 2.2208872 | 3.5488149 | 0 | 1.00 | 1.0 | 2.00 | 26 | ▇▁▁▁▁ |
| numeric | SI_ALL_DEATH_COUNT_7DAY_AVG | 0 | 1 | NA | NA | NA | NA | NA | 2.4889094 | 4.3554348 | 0 | 1.00 | 1.0 | 2.00 | 34 | ▇▁▁▁▁ |
| numeric | INCOMPLETE | 0 | 1 | NA | NA | NA | NA | NA | 394.5101664 | 4891.1861858 | 0 | 0.00 | 0.0 | 0.00 | 60980 | ▇▁▁▁▁ |
group %>%
skim() %>%
kbl() %>%
kable_classic_2(c("striped", "hovered"), html_font = "Calibri") %>%
scroll_box(width = "100%", height = "300px")
| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | group | 0 | 1.0000000 | 3 | 9 | 0 | 6 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | subgroup | 0 | 1.0000000 | 0 | 22 | 1 | 27 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | CONFIRMED_CASE_RATE | 1 | 0.9629630 | NA | NA | NA | NA | NA | 30794.5458 | 4695.5503 | 19244.90 | 28100.112 | 30944.675 | 33354.9950 | 40029.78 | ▁▃▇▅▃ |
| numeric | CASE_RATE | 1 | 0.9629630 | NA | NA | NA | NA | NA | 36815.0992 | 5716.9932 | 22835.38 | 33413.850 | 37064.940 | 39714.1850 | 47519.76 | ▁▃▇▅▃ |
| numeric | HOSPITALIZED_RATE | 1 | 0.9629630 | NA | NA | NA | NA | NA | 2470.8927 | 2029.8561 | 246.80 | 1462.935 | 2388.545 | 2618.1925 | 10815.51 | ▇▇▁▁▁ |
| numeric | DEATH_RATE | 3 | 0.8888889 | NA | NA | NA | NA | NA | 665.3563 | 834.5807 | 2.70 | 362.535 | 547.320 | 611.0375 | 4287.67 | ▇▁▁▁▁ |
| numeric | CONFIRMED_CASE_COUNT | 1 | 0.9629630 | NA | NA | NA | NA | NA | 598269.3462 | 550194.4185 | 100789.00 | 274767.000 | 439960.000 | 710746.2500 | 2703558.00 | ▇▃▁▁▁ |
| numeric | PROBABLE_CASE_COUNT | 1 | 0.9629630 | NA | NA | NA | NA | NA | 116982.3846 | 107761.1985 | 18804.00 | 57311.500 | 89939.500 | 143118.0000 | 528277.00 | ▇▂▁▁▁ |
| numeric | CASE_COUNT | 1 | 0.9629630 | NA | NA | NA | NA | NA | 715251.7308 | 657731.5113 | 119593.00 | 332605.500 | 532428.000 | 857971.2500 | 3231835.00 | ▇▃▁▁▁ |
| numeric | HOSPITALIZED_COUNT | 1 | 0.9629630 | NA | NA | NA | NA | NA | 46710.0385 | 43283.2998 | 1705.00 | 17588.500 | 38038.500 | 59746.2500 | 206476.00 | ▇▅▁▁▁ |
| numeric | DEATH_COUNT | 3 | 0.8888889 | NA | NA | NA | NA | NA | 12327.9583 | 10791.6671 | 46.00 | 5301.500 | 9434.500 | 19860.2500 | 45023.00 | ▇▂▅▁▁ |
modzcta %>%
skim() %>%
kbl() %>%
kable_classic_2(c("striped", "hovered"), html_font = "Calibri") %>%
scroll_box(width = "100%", height = "300px")
| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | NEIGHBORHOOD_NAME | 0 | 1 | 7 | 59 | 0 | 162 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | BOROUGH_GROUP | 0 | 1 | 5 | 13 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | label | 0 | 1 | 5 | 12 | 0 | 177 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | MODIFIED_ZCTA | 0 | 1 | NA | NA | NA | NA | NA | 10810.37853 | 5.781733e+02 | 10001.00000 | 10301.00000 | 11109.00000 | 11361.00000 | 11697.00000 | ▇▃▁▇▇ |
| numeric | lat | 0 | 1 | NA | NA | NA | NA | NA | 40.72555 | 8.364830e-02 | 40.50777 | 40.67082 | 40.72644 | 40.77643 | 40.89951 | ▁▅▇▇▃ |
| numeric | lon | 0 | 1 | NA | NA | NA | NA | NA | -73.91881 | 9.965940e-02 | -74.24227 | -73.97870 | -73.92405 | -73.84698 | -73.71091 | ▁▁▇▆▃ |
| numeric | COVID_CONFIRMED_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 14647.78531 | 8.263303e+03 | 1042.00000 | 8478.00000 | 13583.00000 | 20598.00000 | 34197.00000 | ▆▇▆▅▂ |
| numeric | COVID_PROBABLE_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 2896.77966 | 1.558973e+03 | 207.00000 | 1809.00000 | 2651.00000 | 4040.00000 | 7249.00000 | ▅▇▅▃▁ |
| numeric | COVID_CASE_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 17544.56497 | 9.698150e+03 | 1298.00000 | 10334.00000 | 16093.00000 | 24555.00000 | 40446.00000 | ▆▇▆▅▂ |
| numeric | COVID_CONFIRMED_CASE_RATE | 0 | 1 | NA | NA | NA | NA | NA | 31207.45887 | 4.677745e+03 | 19173.54000 | 28168.04000 | 30583.42000 | 33709.48000 | 48736.97000 | ▁▇▆▁▁ |
| numeric | COVID_CASE_RATE | 0 | 1 | NA | NA | NA | NA | NA | 37588.85480 | 5.226694e+03 | 24408.00000 | 34105.08000 | 36753.27000 | 40176.56000 | 59073.89000 | ▁▇▅▁▁ |
| numeric | POP_DENOMINATOR | 0 | 1 | NA | NA | NA | NA | NA | 47100.66136 | 2.615157e+04 | 2972.12000 | 27180.77000 | 42737.28000 | 66856.31000 | 110369.78000 | ▅▇▅▃▂ |
| numeric | COVID_CONFIRMED_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 213.62712 | 1.536210e+02 | 0.00000 | 92.00000 | 172.00000 | 316.00000 | 792.00000 | ▇▅▃▁▁ |
| numeric | COVID_PROBABLE_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 35.41243 | 2.665445e+01 | 0.00000 | 15.00000 | 28.00000 | 48.00000 | 109.00000 | ▇▇▃▂▂ |
| numeric | COVID_DEATH_COUNT | 0 | 1 | NA | NA | NA | NA | NA | 249.03955 | 1.783268e+02 | 1.00000 | 111.00000 | 202.00000 | 370.00000 | 895.00000 | ▇▅▃▂▁ |
| numeric | COVID_CONFIRMED_DEATH_RATE | 0 | 1 | NA | NA | NA | NA | NA | 430.18492 | 1.901605e+02 | 0.00000 | 331.16000 | 425.81000 | 530.95000 | 1320.81000 | ▃▇▃▁▁ |
| numeric | COVID_DEATH_RATE | 0 | 1 | NA | NA | NA | NA | NA | 501.75514 | 2.218349e+02 | 11.42000 | 384.33000 | 490.91000 | 624.82000 | 1543.51000 | ▃▇▃▁▁ |
| numeric | PERCENT_POSITIVE | 0 | 1 | NA | NA | NA | NA | NA | 24.90169 | 4.653173e+00 | 7.90000 | 22.67000 | 25.47000 | 27.23000 | 36.37000 | ▁▁▆▇▁ |
| numeric | TOTAL_COVID_TESTS | 0 | 1 | NA | NA | NA | NA | NA | 54017.92090 | 2.947936e+04 | 4261.00000 | 31000.00000 | 48913.00000 | 75764.00000 | 129681.00000 | ▆▇▆▃▂ |
From the tables above, the only things that needs to be addressed is
missing values in the group data.
From the data summary tables, only the group data has
missing values. Let us check again to make sure.
daily %>% missmap(main = "Missing vs. Observed Values")
group %>% missmap(main = "Missing vs. Observed Values")
modzcta %>% missmap(main = "Missing vs. Observed Values")
As depicted by the charts, about four percent of the observations in
the group data are missing. After further review, it is
clear that all the missing observations are from the
Age group category under group column. In
Section 4.1, I re-code the age groups under the
subgroup.
To get the ready for the analysis, I proceed to clean and manipulate them by executing the following actions.
group data: I combined some of the
age categories into one to remove the missing values. I also added
corrected borough name from StateIsland to
Staten Island.
daily data: I changed the data type
of date-of-interest variable from character to
date.
Under the Age group category, the 0-17
group has three sub-groupings (0-4, 5-12, 12-17). However, the
DEATH_RATE & DEATH_COUNT statistics are
only provided for the 0-17 age group. Besides
DEATH_RATE and DEATH_COUNT, the other COVID
statistics are only provided for the age sub-categories and not the main
0-17 category. This creates missing values in the rows
containing the age categories as shown below.
group %>% slice(1:6)
To handle the missing data, I use the rollsumr function
in R to aggregate statistics for the three sub-categories (0-4, 5-12,
12-17) under the main category (0-17). The sub-categories are
subsequently deleted from the table.
group[4, c(3:5, 7:10)] <- rollsumr(group[1:3, c(3:5, 7:10)], k=3)
group <- group %>% slice(-c(1:3))
Now, check to see if the re-coding took care of the missing values in
the group data.
group %>% missmap(main = "Observed vs. Missing Values")
Staten Island subgroupIn the group table, ‘Staten Island’ is written as
StatenIsland as shown in the table below.
group %>%
filter(group == "Borough") %>%
group_by(subgroup) %>%
select(subgroup) %>%
count() %>%
kbl(col.names = c("Borough", "Count")) %>%
kable_classic_2(c("striped", "hovered"),
html_font = "calibri")
| Borough | Count |
|---|---|
| Bronx | 1 |
| Brooklyn | 1 |
| Manhattan | 1 |
| Queens | 1 |
| StatenIsland | 1 |
I clean it by adding a white space to correct the name of the Borough, as show below.
group <- group %>%
mutate(subgroup = case_when(
subgroup == "StatenIsland" ~ "Staten Island",
TRUE ~ subgroup))
group %>%
filter(group == "Borough") %>%
group_by(subgroup) %>%
select(subgroup) %>%
count() %>%
kbl(col.names = c("Borough", "Count")) %>%
kable_classic_2(c("striped", "hovered"),
html_font = "calibri")
| Borough | Count |
|---|---|
| Bronx | 1 |
| Brooklyn | 1 |
| Manhattan | 1 |
| Queens | 1 |
| Staten Island | 1 |
date_of_interestIn the daily data, the date_of_interest column is stored
as a string variable. I change it to a date variable.
daily <- daily %>%
mutate(date_of_interest = as.Date(date_of_interest,
format = "%m/%d/%Y"))
class(daily$date_of_interest)
## [1] "Date"
This section analyzes the daily and total number of COVID cases in the City as a whole as of 2023-02-14.
As of 2023-02-14, NYC had recorded approximately a total of 3.23 million COVID cases since the first case was recorded in early 2020. More than 206,400 of those cases have led to hospitalizations, with over 45,000 people losing their lives.
group %>%
filter(group == "Citywide") %>%
summarise(`Total Infections` = sum(CASE_COUNT),
`Total Hospitalizations` = sum(HOSPITALIZED_COUNT),
`Total Deaths` = sum(DEATH_COUNT)) %>%
kbl(align = "lcr",
format.args = list(big.mark = ",")) %>%
kable_classic_2()
| Total Infections | Total Hospitalizations | Total Deaths |
|---|---|---|
| 3,231,835 | 206,476 | 45,023 |
The charts below show the trends in daily Citywide cases since the beginning of the pandemic.
daily %>%
plot_ly(x = ~date_of_interest, y = ~CASE_COUNT,
type = 'bar', name = 'Infections Count', marker = list(color = "#5c89c9")) %>%
add_trace(X = ~date_of_interest, y = ~CASE_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = '7-Day Avg. Infections',
marker = list(color = "#1d3557"), line = list(color = "#1e3557")) %>%
layout(xaxis = list(title = "Date"),
yaxis = list(title = "Number of People"),
title = "Daily Citywide Infections")
daily %>%
plot_ly(x = ~date_of_interest, y = ~HOSPITALIZED_COUNT,
type = 'bar', name = 'Hosp. Count', marker = list(color = "#e9d8a6")) %>%
add_trace(x = ~date_of_interest, y = ~DEATH_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = '7-Day Avg. Hosp.',
marker = list(color = "735d1d"), line = list(color = "#735d1d")) %>%
layout(xaxis = list(title = NULL),
yaxis = list(title = "No. of Hospitalizations"),
title = "Daily Citywide Hospitalizations")
daily %>%
plot_ly(x = ~date_of_interest, y = ~DEATH_COUNT,
type = 'bar', name = 'Death Count', marker = list(color = "#f5aaa2")) %>%
add_trace(x = ~date_of_interest, y = ~DEATH_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = '7-Day Avg. Deaths',
marker = list(color = "#ae2012"), line = list(color = "#ae2012")) %>%
layout(xaxis = list(title = ""),
yaxis = list(title = "No. of Deaths"),
title = "Daily Citywide Deaths")
The charts above show that NYC reached the peak of infection in the beginning of 2022, during the Omicron wave. While there have been three waves in hospitalizations and deaths, most of the hospitalizations and deaths occurred during the initial wave of infections (between March and April of 2020). The availability of vaccines during the Omicron wave appear to have helped reduce the number of hospitalizations and deaths around that time.
The table below shows the number of new infections, hospitalizations and deaths recorded on 2023-02-14 - the latest date we have record for.
daily %>% filter(date_of_interest == max(daily$date_of_interest)) %>%
group_by(date_of_interest) %>%
rename(Date = date_of_interest) %>%
summarise(Infections = sum(CASE_COUNT),
Hospitalizations = sum(HOSPITALIZED_COUNT),
Deaths = sum(DEATH_COUNT)) %>%
kbl(align = "lccr",
format.args = list(big.mark = ",")) %>%
kable_classic_2()
| Date | Infections | Hospitalizations | Deaths |
|---|---|---|---|
| 2023-02-14 | 752 | 0 | 5 |
This section disaggregates the daily and total number of COVID cases among the five NYC boroughs.
The chart below shows the total number of COVID cases by borough. Because we are looking at raw numbers (and not adjusted for population), densely populated boroughs will show more infections, hospitalizations and deaths. In this case, Brooklyn, the most densely populated borough in NYC, has had more infections, hospitalizations and deaths compared to the other four boroughs.
pal1 <- c("#1e3557", "#e9d8a6", "#ae2012")
group %>%
filter(group == "Borough") %>%
group_by(subgroup) %>%
rename(Borough = subgroup) %>%
summarise(`1. Infections` = sum(CASE_COUNT),
`2. Hospitalizations` = sum(HOSPITALIZED_COUNT),
`3. Deaths` = sum(DEATH_COUNT)) %>%
pivot_longer(cols = 2:4, names_to = "Indicator", values_to = "Case Count") %>%
plot_ly(x = ~ Borough, y = ~`Case Count`, type = 'bar', color = ~Indicator, colors = pal1) %>%
layout(title = "Total Number of Cases by Borough", barmode = 'stack',
xaxis=list(title = ""), yaxis=list(title = "Number of People"))
The charts below show the trends in the daily average infections, hospitalizations and deaths per borough.
daily %>%
plot_ly(x = ~date_of_interest, y = ~BX_CASE_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Bronx') %>%
add_trace(x = ~date_of_interest, y= ~BK_CASE_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Brookyln') %>%
add_trace(x = ~date_of_interest, y= ~MN_CASE_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Manhattan') %>%
add_trace(x = ~date_of_interest, y= ~QN_CASE_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Queens') %>%
add_trace(x = ~date_of_interest, y= ~SI_CASE_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Staten Island') %>%
layout(xaxis = list(title = ""),
yaxis = list(title = ""),
title = "7-Day Avg. Infections by Borough")
daily %>%
plot_ly(x = ~date_of_interest, y = ~ BX_HOSPITALIZED_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Bronx') %>%
add_trace(x = ~date_of_interest, y= ~BK_HOSPITALIZED_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Brookyln') %>%
add_trace(x = ~date_of_interest, y= ~MN_HOSPITALIZED_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Manhattan') %>%
add_trace(x = ~date_of_interest, y= ~QN_HOSPITALIZED_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Queens') %>%
add_trace(x = ~date_of_interest, y= ~SI_HOSPITALIZED_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Staten Island') %>%
layout(xaxis = list(title = "Date"),
yaxis = list(title = "Number of Cases"),
title = "7-Day Avg. Hospitalizations by Borough")
daily %>%
plot_ly(x = ~date_of_interest, y = ~BX_DEATH_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Bronx') %>%
add_trace(x = ~date_of_interest, y= ~BK_DEATH_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Brookyln') %>%
add_trace(x = ~date_of_interest, y= ~MN_DEATH_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Manhattan') %>%
add_trace(x = ~date_of_interest, y= ~QN_DEATH_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Queens') %>%
add_trace(x = ~date_of_interest, y= ~SI_DEATH_COUNT_7DAY_AVG,
type = 'scatter', mode = 'line', name = 'Staten Island') %>%
layout(xaxis = list(title = ""),
yaxis = list(title = ""),
title = "7-Day Avg. Deaths by Borough")
The charts above show that daily infections, hospitalizations and deaths have consistently been highest in Brooklyn and Queen.
Section 6.1 shows Brooklyn has the highest number of cases, hospitalizations and deaths among all boroughs. This makes sense since Brooklyn is the most populous of the five boroughs. However, to be able to compare boroughs to determine which one has been severely affected, we have to adjust for population. Hence, let us use the rates (per 100,000) statistics to plot total infections, hospitalizations and deaths.
group %>% filter(group == "Borough") %>%
group_by(subgroup) %>%
rename(Borough = subgroup) %>%
summarise(`1. Infection Rate` = sum(CASE_RATE),
`2. Hospitalization Rate` = sum(HOSPITALIZED_RATE),
`3. Death Rate` = sum(DEATH_RATE)) %>%
pivot_longer(cols = 2:4, names_to = "Indicator", values_to = "Rate (per 100K)") %>%
plot_ly(x = ~ Borough, y = ~`Rate (per 100K)`, type = 'bar',
color = ~Indicator, colors = pal1) %>%
layout(title = "Case Rate by Borough", barmode = 'stack',
xaxis=list(title = ""), yaxis=list(title = "Rate (Per 100K)"))
The chart above indicates that after adjusting for population, Staten Island - the least populated borough - has the highest rate of infections. The Bronx, on the hand, has the highest rate of hospitalizations and deaths.
This section details how COVID-19 has impacted NYC residents of
different age groups. The data set breaks down age into eight categories
- 0-17, 18-24, 24-34,
35-44, 45-54, 55-64,
65-74, and 75+.
The first tab shows the infection, hospitalization and death rates (per 100,000) for the various age groups. The second tab shows hospitalization and death rates as a share of case rates.
pal2 = c("#007a8b", "#224b5e", "#94b594", "#edc775", "#e09351", "#df7e66", "#b75347", "#6d2f20")
group %>%
filter(group == "Age group") %>%
group_by(subgroup) %>%
rename(`Age Group` = subgroup) %>%
summarise(`1. Infection Rate` = sum(CASE_RATE),
`2. Hospitalization Rate` = sum(HOSPITALIZED_RATE),
`3. Death Rate` = sum(DEATH_RATE)) %>%
mutate_if(is.numeric, ~(./sum(.))) %>%
pivot_longer(cols = 2:4,
names_to = "Statistics",
values_to = "Rate (per 100K)") %>%
ggplot(aes(x = Statistics,
y = `Rate (per 100K)`,
fill = `Age Group`), border = "white") +
geom_bar(position = 'stack', stat = 'identity') +
scale_fill_manual(values = pal2) +
scale_y_continuous(label = percent) +
theme_minimal() +
labs(x = NULL,
y = "Share of Rate (per 100K)",
title = "Case Rate (per 100K) byAge Group")
The charts in this section indicate that, while young people (under 45 years) are infected at higher rates than any other age group, only a small share are hospitalized and they barely any die from the virus. On the other hand, seniors, especially those 75 year and over, tend to be hospitalized and die at the highest rate even though they have the lowest infection rates. This is consistent with reports that COVID is much more deadly among seniors.
This section details how COVID has affected people of different
racial and ethnicity background. The data sets breaks race/ethnicity
into four categories - Asian/Pacific-Islander,
Black/African-American, Hispani/Latino and
White.
The first tab shows the infections, hospitalizations and deaths rates (per 100,000) for each race/ethnicity.
The second tab shows hospitalization and death rates as a share of case rates.
library(MetBrewer)
group %>%
filter(group == "Race") %>%
group_by(subgroup) %>%
rename(`Race/Ethnicity` = subgroup) %>%
summarise(`1. Infection Rate` = sum(CASE_RATE),
`2. Hospitalization Rate` = sum(HOSPITALIZED_RATE),
`3. Death Rate` = sum(DEATH_RATE)) %>%
pivot_longer(cols = 2:4,
names_to = "Indicator",
values_to = "Rate (per 100K)") %>%
ggplot(aes(x = Indicator,
y = `Race/Ethnicity`,
fill = `Rate (per 100K)`)) +
geom_tile() +
scale_fill_gradientn(colors = met.brewer("Hokusai1", type = "continuous")) +
theme_minimal() +
labs(x = NULL, y= NULL, title = "Case Rate (per 100K) by Race/Ethnicity")
The two charts indicate that, while African-Americans have one of the lowest infections rates, they tend to be hospitalized or die from the virus at the highest rates.
In this section, I use choropleth maps to visualize and compare infection and death rates (per 100K) among NYC neighborhoods.
To create the maps, i merge the modzcta dataframe (which
disaggregates total COVID cases by zip code and neighborhoods) and the
modzcta shapefile.
modzcta_merge <-
merge(modzcta_shp, modzcta, by.x = "modzcta", by.y = "MODIFIED_ZCTA", all = T)
modzcta_merge %>%
ggplot() +
geom_sf(aes(fill = COVID_CASE_RATE),
color = "white",
lwd = 0.2) +
scale_fill_gradientn(
name = "Infection Rate (per 100K)",
colors = met.brewer("Pillement", type = "continuous")
) +
theme_void() +
theme(plot.title.position = 'plot',
plot.title = element_text(hjust = 0.5)) +
theme(panel.grid = element_line(color = "transparent")
)
modzcta_merge %>%
ggplot() +
geom_sf(aes(fill = COVID_DEATH_RATE),
color = "white",
lwd = 0.2) +
scale_fill_gradientn(
name = "Death Rate (per 100K)",
colors = met.brewer("OKeeffe2", type = "continuous")
) +
theme_void() +
theme(plot.title.position = 'plot',
plot.title = element_text(hjust = 0.5)) +
theme(panel.grid = element_line(color = "transparent")
)
The following are the trends in observed reported COVID cases in NYC as of 2023-02-14.
Infections peaked in January 2022, during the Omicron wave.
However, hospitalizations and deaths reached their peaks during the first wave of the pandemic (April 2020). Because of the availability of vaccines, the Omicron wave did not cause as much hospitalization and was not as deadly as the 2020 wave of infections.
Because of the size of its population, Brooklyn has seen the highest number of infections, hospitalizations and deaths since the beginning of the pandemic compared to the other boroughs.
Brooklyn and Queens have consistently averaged the highest number of infections, hospitalizations and deaths since the beginning of the pandemic per day.
In terms of age, young people under 45 years have the highest rate of infection. Yet, seniors over 65 years tend to be hospitalized and die at the highest rates.
Even though African-Americans have one of the lowest infection rates, they tend to be hospitalized and die at higher rates compared other races/ethnicities.