Tracking COVID-19 Cases in NYC

1. Introduction

In mid-2020, New York City (NYC) became the epicenter of the global COVID-19 pandemic as its residents were forced to shelter in place and economic activity came to a grinding halt. The City Department of Health and Mental Hygiene (DOHMH) tracks and provide data on the number the daily and aggregate number of COVID cases in NYC on its Github repository. However, since DOHMH only provides raw data (in CSV format), it makes it difficult to digest and detect case trends in the city.

This work seeks to extract, transform and analyze and the daily and aggregate cases reported by DOHMH. It uses visuals to depict trends in COVID-19 infections, hospitalizations and deaths across the city. It also examines case trends among boroughs, demographics, and neighborhoods to understand which group is being impacted the most by the pandemic.

The analysis will be updated at the beginning of each week as new data become available to allow for continuous monitoring of COVID-19 trends in NYC. This work is also accompanied by this dashboard hosted on Tableau Public.

2. Data

NYC DOHMH publishes an open source COVID-19 database on its Github repository. The database, which is updated daily, contains numerous tables that provides details about COVID cases, testing and vaccinations. This analyses uses uses three data sets from the repository, namely data-by-day, data-by-group and data-by-modzcta. Below are brief descriptions of each of the data sets.

data-by-day: Provides a daily summary of all Covid cases, hospitalizations and deaths that happened in the City as a whole, and by borough.
data-by-group: Provides a breakdown of total number of cases, hospitalizations and death by different demograpics, including borough, age, gender, and race.
data-by-modzcta: Gives a breakdown of aggregate cases by neighborhood and modified zip code. This data can be used to map COVID cases and deaths by neighborhood when combined with the MODZCTA shape files (can be downloaded from DOHMH’s Github or NYC Open Data Portal).

In addition to the three highlighted above, the analysis also extracts and uses shapefile data from the City’s Open Data Portal to map COVID cases in neighborhoods.

Now, let us extract and load the aforementioned data sets (from DOHMH GitHub page and NYC Open Data Portal) and get them ready for the analysis.

## Daily Data
download.file(
  "https://raw.githubusercontent.com/nychealth/coronavirus-data/master/trends/data-by-day.csv", 
  destfile = "./data/daily.csv")
daily <- read.csv("/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/data/daily.csv")

## Group Data
download.file(
  "https://raw.githubusercontent.com/nychealth/coronavirus-data/master/totals/by-group.csv",
  destfile = "./data/group.csv")
group <- read.csv("/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/data/group.csv")

## Modzcta (Zip Code) Data
download.file(
  "https://raw.githubusercontent.com/nychealth/coronavirus-data/master/totals/data-by-modzcta.csv",
  destfile = "./data/modzcta.csv")
modzcta <- read.csv("/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/data/modzcta.csv")


## Shape Files: Create folder to store files, download zip, unzip and load
library(plyr)

dir <- "/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/Shape Files"

dir.create("Shape Files")
do.call(file.remove, list(list.files(path = dir, full.names = TRUE)))

## [1] TRUE TRUE TRUE TRUE TRUE

download.file(
  "https://data.cityofnewyork.us/api/geospatial/pri4-ifjk?method=export&format=Shapefile", 
  destfile = "/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/Shape Files/modzcta_zip.zip")

zip <- list.files(path = dir, pattern = "*.zip", full.names = T)

ldply(.data = zip, .fun = unzip, exdir = dir)

modzcta_shp <- list.files(path = dir, pattern = "*.shp", full.names = T) %>%
               st_read()

## Reading layer `geo_export_21e58705-0848-4164-b459-ba7b3c19f63e' from data source `/Users/aly_will_mac/Desktop/OLD PC/WILL/LEARNING/1. ALL PROJECTS/R-NYC-COVID-Analysis/Shape Files/geo_export_21e58705-0848-4164-b459-ba7b3c19f63e.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 178 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
## Geodetic CRS:  WGS84(DD)

detach(package:plyr, unload = TRUE)

The tables below show the first few rows of each data set.

Data-by-Day

head(daily)

Data-by-Group

head(group)

Data-by-Modzcta

head(modzcta)

3. Data Examination

In this section, I examine the data sets to identify what needs to cleaned.

3.1. Data Structure and Summary

The tables below depict the structure and summary of the three COVID data sets.

Daily Data

daily %>% 
  skim() %>% 
  kbl() %>% 
  kable_classic_2(c("striped", "hovered"), html_font = "Calibri") %>% 
  scroll_box(width = "100%", height = "300px")

skim_type	skim_variable	complete_rate	character.min	character.max	character.empty	character.n_unique	character.whitespace	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
character	date_of_interest	1	10	10	0	1082	0	NA	NA	NA	NA	NA	NA	NA	NA
numeric	CASE_COUNT	1	NA	NA	NA	NA	NA	2498.6672828	5031.1521389	0	622.00	1515.0	2829.00	55004	▇▁▁▁▁
numeric	PROBABLE_CASE_COUNT	1	NA	NA	NA	NA	NA	488.2412200	630.0617799	0	93.00	371.5	659.75	5882	▇▁▁▁▁
numeric	HOSPITALIZED_COUNT	1	NA	NA	NA	NA	NA	173.6062847	245.7844853	0	48.25	103.5	187.50	1840	▇▁▁▁▁
numeric	DEATH_COUNT	1	NA	NA	NA	NA	NA	35.5859519	76.4503582	0	7.00	13.0	30.75	598	▇▁▁▁▁
numeric	PROBABLE_DEATH_COUNT	1	NA	NA	NA	NA	NA	5.8890943	24.5521343	0	0.00	1.0	2.00	240	▇▁▁▁▁
numeric	CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	2496.5942699	4619.5491798	0	646.50	1569.0	2849.75	39497	▇▁▁▁▁
numeric	ALL_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	2984.1598891	5167.6817532	0	808.50	1973.5	3591.00	43954	▇▁▁▁▁
numeric	HOSP_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	173.4824399	241.4562361	0	48.25	105.0	186.00	1662	▇▁▁▁▁
numeric	DEATH_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	35.5582255	75.5599806	0	8.00	12.0	31.00	566	▇▁▁▁▁
numeric	ALL_DEATH_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	41.4426987	98.7883718	0	8.00	13.0	33.00	775	▇▁▁▁▁
numeric	BX_CASE_COUNT	1	NA	NA	NA	NA	NA	412.7874307	939.3131444	0	82.00	210.5	433.75	10560	▇▁▁▁▁
numeric	BX_PROBABLE_CASE_COUNT	1	NA	NA	NA	NA	NA	96.3641405	144.5544961	0	14.00	65.0	129.00	1575	▇▁▁▁▁
numeric	BX_HOSPITALIZED_COUNT	1	NA	NA	NA	NA	NA	37.3974122	56.8463622	0	9.00	20.0	39.00	390	▇▁▁▁▁
numeric	BX_DEATH_COUNT	1	NA	NA	NA	NA	NA	6.7125693	16.0237915	0	1.00	2.0	5.00	132	▇▁▁▁▁
numeric	BX_PROBABLE_DEATH_COUNT	1	NA	NA	NA	NA	NA	1.1441774	5.0716885	0	0.00	0.0	0.00	46	▇▁▁▁▁
numeric	BX_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	412.4251386	843.7401703	0	83.50	236.0	456.00	7480	▇▁▁▁▁
numeric	BX_PROBABLE_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	96.2347505	132.8572658	0	15.00	73.0	134.75	1094	▇▁▁▁▁
numeric	BX_ALL_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	508.6719039	968.2219943	0	109.00	307.0	595.00	8574	▇▁▁▁▁
numeric	BX_HOSPITALIZED_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	37.3632163	55.4264291	0	9.00	21.0	38.00	358	▇▁▁▁▁
numeric	BX_DEATH_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	6.7236599	15.6961309	0	1.00	2.0	5.00	117	▇▁▁▁▁
numeric	BX_ALL_DEATH_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	7.8650647	20.4752123	0	1.00	2.0	5.00	158	▇▁▁▁▁
numeric	BK_CASE_COUNT	1	NA	NA	NA	NA	NA	753.3373383	1484.5024539	0	215.25	462.0	846.00	16665	▇▁▁▁▁
numeric	BK_PROBABLE_CASE_COUNT	1	NA	NA	NA	NA	NA	133.5905730	176.0871495	0	30.00	100.0	173.75	1906	▇▁▁▁▁
numeric	BK_HOSPITALIZED_COUNT	1	NA	NA	NA	NA	NA	52.5609982	72.2766071	0	17.00	31.0	54.00	555	▇▁▁▁▁
numeric	BK_DEATH_COUNT	1	NA	NA	NA	NA	NA	11.0970425	23.5985310	0	2.00	4.0	10.00	201	▇▁▁▁▁
numeric	BK_PROBABLE_DEATH_COUNT	1	NA	NA	NA	NA	NA	1.9833641	8.4983375	0	0.00	0.0	1.00	92	▇▁▁▁▁
numeric	BK_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	752.7107209	1369.5307037	0	231.75	475.0	850.25	11587	▇▁▁▁▁
numeric	BK_PROBABLE_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	133.4242144	163.4622405	0	29.00	106.0	175.00	1213	▇▁▁▁▁
numeric	BK_ALL_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	886.1497227	1521.7549061	0	264.00	586.0	1034.50	12787	▇▁▁▁▁
numeric	BK_HOSPITALIZED_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	52.5341959	70.6597378	0	17.00	32.0	53.00	490	▇▁▁▁▁
numeric	BK_DEATH_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	11.0998152	23.1532564	0	2.00	4.0	9.00	178	▇▁▁▁▁
numeric	BK_ALL_DEATH_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	13.0785582	31.0390271	0	3.00	4.0	11.00	252	▇▁▁▁▁
numeric	MN_CASE_COUNT	1	NA	NA	NA	NA	NA	458.1256932	912.2602953	0	107.00	284.0	492.00	9113	▇▁▁▁▁
numeric	MN_PROBABLE_CASE_COUNT	1	NA	NA	NA	NA	NA	90.1210721	113.9212735	0	19.00	69.0	124.00	972	▇▁▁▁▁
numeric	MN_HOSPITALIZED_COUNT	1	NA	NA	NA	NA	NA	26.3049908	35.9481505	0	7.00	16.0	30.00	273	▇▁▁▁▁
numeric	MN_DEATH_COUNT	1	NA	NA	NA	NA	NA	4.8724584	9.9687949	0	1.00	2.0	5.00	92	▇▁▁▁▁
numeric	MN_PROBABLE_DEATH_COUNT	1	NA	NA	NA	NA	NA	0.8040665	3.2174452	0	0.00	0.0	0.00	33	▇▁▁▁▁
numeric	MN_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	457.7088725	831.6672105	0	120.00	303.0	481.50	6394	▇▁▁▁▁
numeric	MN_PROBABLE_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	89.9852126	107.4760467	0	19.00	74.0	128.00	766	▇▁▁▁▁
numeric	MN_ALL_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	547.6931608	932.6721680	0	150.00	369.0	600.25	7160	▇▁▁▁▁
numeric	MN_HOSPITALIZED_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	26.2735675	34.9485021	0	7.00	17.0	30.00	228	▇▁▁▁▁
numeric	MN_DEATH_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	4.8641405	9.6639669	0	1.00	2.0	4.00	73	▇▁▁▁▁
numeric	MN_ALL_DEATH_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	5.6580407	12.6055020	0	1.00	2.0	4.00	100	▇▁▁▁▁
numeric	QN_CASE_COUNT	1	NA	NA	NA	NA	NA	697.4195933	1417.2913827	0	148.00	401.5	792.25	15224	▇▁▁▁▁
numeric	QN_PROBABLE_CASE_COUNT	1	NA	NA	NA	NA	NA	135.1423290	170.0577474	0	23.00	100.0	192.75	1609	▇▁▁▁▁
numeric	QN_HOSPITALIZED_COUNT	1	NA	NA	NA	NA	NA	48.7643253	76.0411356	0	13.00	27.0	51.75	609	▇▁▁▁▁
numeric	QN_DEATH_COUNT	1	NA	NA	NA	NA	NA	10.6645102	24.1619041	0	2.00	4.0	9.00	202	▇▁▁▁▁
numeric	QN_PROBABLE_DEATH_COUNT	1	NA	NA	NA	NA	NA	1.6996303	7.2754310	0	0.00	0.0	1.00	68	▇▁▁▁▁
numeric	QN_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	696.8604436	1310.3066818	0	152.00	421.5	822.00	11551	▇▁▁▁▁
numeric	QN_PROBABLE_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	134.9602588	159.5306984	0	22.00	103.5	192.00	1219	▇▁▁▁▁
numeric	QN_ALL_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	831.8262477	1455.8981674	0	193.25	537.0	1043.25	12689	▇▁▁▁▁
numeric	QN_HOSPITALIZED_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	48.7375231	74.6385769	0	13.00	28.0	51.00	562	▇▁▁▁▁
numeric	QN_DEATH_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	10.6580407	23.7696990	0	2.00	4.0	9.00	177	▇▁▁▁▁
numeric	QN_ALL_DEATH_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	12.3576710	30.6005052	0	2.00	4.0	9.75	240	▇▁▁▁▁
numeric	SI_CASE_COUNT	1	NA	NA	NA	NA	NA	176.1543438	329.4849669	0	44.00	110.0	196.00	3720	▇▁▁▁▁
numeric	SI_PROBABLE_CASE_COUNT	1	NA	NA	NA	NA	NA	32.9602588	36.1776030	0	6.00	26.0	49.00	316	▇▁▁▁▁
numeric	SI_HOSPITALIZED_COUNT	1	NA	NA	NA	NA	NA	10.6561922	11.6878181	0	3.00	7.0	14.00	83	▇▂▁▁▁
numeric	SI_DEATH_COUNT	1	NA	NA	NA	NA	NA	2.2375231	3.8281731	0	0.00	1.0	3.00	34	▇▁▁▁▁
numeric	SI_PROBABLE_DEATH_COUNT	1	NA	NA	NA	NA	NA	0.2578558	0.9737813	0	0.00	0.0	0.00	9	▇▁▁▁▁
numeric	SI_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	176.0083179	302.4864687	0	44.00	115.0	195.00	2687	▇▁▁▁▁
numeric	SI_PROBABLE_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	32.9038817	33.6238375	0	6.00	27.0	48.75	233	▇▃▁▁▁
numeric	SI_ALL_CASE_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	208.9057301	330.9701999	0	52.25	148.0	247.75	2907	▇▁▁▁▁
numeric	SI_HOSPITALIZED_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	10.6404806	11.1542522	0	4.00	8.0	13.00	72	▇▂▁▁▁
numeric	SI_DEATH_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	2.2208872	3.5488149	0	1.00	1.0	2.00	26	▇▁▁▁▁
numeric	SI_ALL_DEATH_COUNT_7DAY_AVG	1	NA	NA	NA	NA	NA	2.4889094	4.3554348	0	1.00	1.0	2.00	34	▇▁▁▁▁
numeric	INCOMPLETE	1	NA	NA	NA	NA	NA	394.5101664	4891.1861858	0	0.00	0.0	0.00	60980	▇▁▁▁▁

Group Data

group %>% 
  skim() %>% 
  kbl() %>% 
  kable_classic_2(c("striped", "hovered"), html_font = "Calibri") %>% 
  scroll_box(width = "100%", height = "300px")

skim_type	skim_variable	n_missing	complete_rate	character.min	character.max	character.empty	character.n_unique	character.whitespace	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
character	group	0	1.0000000	3	9	0	6	0	NA	NA	NA	NA	NA	NA	NA	NA
character	subgroup	0	1.0000000	0	22	1	27	0	NA	NA	NA	NA	NA	NA	NA	NA
numeric	CONFIRMED_CASE_RATE	1	0.9629630	NA	NA	NA	NA	NA	30794.5458	4695.5503	19244.90	28100.112	30944.675	33354.9950	40029.78	▁▃▇▅▃
numeric	CASE_RATE	1	0.9629630	NA	NA	NA	NA	NA	36815.0992	5716.9932	22835.38	33413.850	37064.940	39714.1850	47519.76	▁▃▇▅▃
numeric	HOSPITALIZED_RATE	1	0.9629630	NA	NA	NA	NA	NA	2470.8927	2029.8561	246.80	1462.935	2388.545	2618.1925	10815.51	▇▇▁▁▁
numeric	DEATH_RATE	3	0.8888889	NA	NA	NA	NA	NA	665.3563	834.5807	2.70	362.535	547.320	611.0375	4287.67	▇▁▁▁▁
numeric	CONFIRMED_CASE_COUNT	1	0.9629630	NA	NA	NA	NA	NA	598269.3462	550194.4185	100789.00	274767.000	439960.000	710746.2500	2703558.00	▇▃▁▁▁
numeric	PROBABLE_CASE_COUNT	1	0.9629630	NA	NA	NA	NA	NA	116982.3846	107761.1985	18804.00	57311.500	89939.500	143118.0000	528277.00	▇▂▁▁▁
numeric	CASE_COUNT	1	0.9629630	NA	NA	NA	NA	NA	715251.7308	657731.5113	119593.00	332605.500	532428.000	857971.2500	3231835.00	▇▃▁▁▁
numeric	HOSPITALIZED_COUNT	1	0.9629630	NA	NA	NA	NA	NA	46710.0385	43283.2998	1705.00	17588.500	38038.500	59746.2500	206476.00	▇▅▁▁▁
numeric	DEATH_COUNT	3	0.8888889	NA	NA	NA	NA	NA	12327.9583	10791.6671	46.00	5301.500	9434.500	19860.2500	45023.00	▇▂▅▁▁

Modzcta Data

modzcta %>% 
  skim() %>% 
  kbl() %>% 
  kable_classic_2(c("striped", "hovered"), html_font = "Calibri") %>% 
  scroll_box(width = "100%", height = "300px")

skim_type	skim_variable	complete_rate	character.min	character.max	character.empty	character.n_unique	character.whitespace	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
character	NEIGHBORHOOD_NAME	1	7	59	0	162	0	NA	NA	NA	NA	NA	NA	NA	NA
character	BOROUGH_GROUP	1	5	13	0	5	0	NA	NA	NA	NA	NA	NA	NA	NA
character	label	1	5	12	0	177	0	NA	NA	NA	NA	NA	NA	NA	NA
numeric	MODIFIED_ZCTA	1	NA	NA	NA	NA	NA	10810.37853	5.781733e+02	10001.00000	10301.00000	11109.00000	11361.00000	11697.00000	▇▃▁▇▇
numeric	lat	1	NA	NA	NA	NA	NA	40.72555	8.364830e-02	40.50777	40.67082	40.72644	40.77643	40.89951	▁▅▇▇▃
numeric	lon	1	NA	NA	NA	NA	NA	-73.91881	9.965940e-02	-74.24227	-73.97870	-73.92405	-73.84698	-73.71091	▁▁▇▆▃
numeric	COVID_CONFIRMED_CASE_COUNT	1	NA	NA	NA	NA	NA	14647.78531	8.263303e+03	1042.00000	8478.00000	13583.00000	20598.00000	34197.00000	▆▇▆▅▂
numeric	COVID_PROBABLE_CASE_COUNT	1	NA	NA	NA	NA	NA	2896.77966	1.558973e+03	207.00000	1809.00000	2651.00000	4040.00000	7249.00000	▅▇▅▃▁
numeric	COVID_CASE_COUNT	1	NA	NA	NA	NA	NA	17544.56497	9.698150e+03	1298.00000	10334.00000	16093.00000	24555.00000	40446.00000	▆▇▆▅▂
numeric	COVID_CONFIRMED_CASE_RATE	1	NA	NA	NA	NA	NA	31207.45887	4.677745e+03	19173.54000	28168.04000	30583.42000	33709.48000	48736.97000	▁▇▆▁▁
numeric	COVID_CASE_RATE	1	NA	NA	NA	NA	NA	37588.85480	5.226694e+03	24408.00000	34105.08000	36753.27000	40176.56000	59073.89000	▁▇▅▁▁
numeric	POP_DENOMINATOR	1	NA	NA	NA	NA	NA	47100.66136	2.615157e+04	2972.12000	27180.77000	42737.28000	66856.31000	110369.78000	▅▇▅▃▂
numeric	COVID_CONFIRMED_DEATH_COUNT	1	NA	NA	NA	NA	NA	213.62712	1.536210e+02	0.00000	92.00000	172.00000	316.00000	792.00000	▇▅▃▁▁
numeric	COVID_PROBABLE_DEATH_COUNT	1	NA	NA	NA	NA	NA	35.41243	2.665445e+01	0.00000	15.00000	28.00000	48.00000	109.00000	▇▇▃▂▂
numeric	COVID_DEATH_COUNT	1	NA	NA	NA	NA	NA	249.03955	1.783268e+02	1.00000	111.00000	202.00000	370.00000	895.00000	▇▅▃▂▁
numeric	COVID_CONFIRMED_DEATH_RATE	1	NA	NA	NA	NA	NA	430.18492	1.901605e+02	0.00000	331.16000	425.81000	530.95000	1320.81000	▃▇▃▁▁
numeric	COVID_DEATH_RATE	1	NA	NA	NA	NA	NA	501.75514	2.218349e+02	11.42000	384.33000	490.91000	624.82000	1543.51000	▃▇▃▁▁
numeric	PERCENT_POSITIVE	1	NA	NA	NA	NA	NA	24.90169	4.653173e+00	7.90000	22.67000	25.47000	27.23000	36.37000	▁▁▆▇▁
numeric	TOTAL_COVID_TESTS	1	NA	NA	NA	NA	NA	54017.92090	2.947936e+04	4261.00000	31000.00000	48913.00000	75764.00000	129681.00000	▆▇▆▃▂

From the tables above, the only things that needs to be addressed is missing values in the group data.

3.2. Missing Data

From the data summary tables, only the group data has missing values. Let us check again to make sure.

Daily Data

daily %>% missmap(main = "Missing vs. Observed Values")

Group Data

group %>% missmap(main = "Missing vs. Observed Values")

Modzcta Data

modzcta %>% missmap(main = "Missing vs. Observed Values")

As depicted by the charts, about four percent of the observations in the group data are missing. After further review, it is clear that all the missing observations are from the Age group category under group column. In Section 4.1, I re-code the age groups under the subgroup.

4. Data Wrangling

To get the ready for the analysis, I proceed to clean and manipulate them by executing the following actions.

group data: I combined some of the age categories into one to remove the missing values. I also added corrected borough name from StateIsland to Staten Island.
daily data: I changed the data type of date-of-interest variable from character to date.

4.1. Consolidate ‘0-17’ age group

Under the Age group category, the 0-17 group has three sub-groupings (0-4, 5-12, 12-17). However, the DEATH_RATE & DEATH_COUNT statistics are only provided for the 0-17 age group. Besides DEATH_RATE and DEATH_COUNT, the other COVID statistics are only provided for the age sub-categories and not the main 0-17 category. This creates missing values in the rows containing the age categories as shown below.

group %>% slice(1:6)

To handle the missing data, I use the rollsumr function in R to aggregate statistics for the three sub-categories (0-4, 5-12, 12-17) under the main category (0-17). The sub-categories are subsequently deleted from the table.

group[4, c(3:5, 7:10)] <- rollsumr(group[1:3, c(3:5, 7:10)], k=3) 

group <- group %>% slice(-c(1:3))

Now, check to see if the re-coding took care of the missing values in the group data.

group %>% missmap(main = "Observed vs. Missing Values")

4.2. Clean the `Staten Island` subgroup

In the group table, ‘Staten Island’ is written as StatenIsland as shown in the table below.

group %>% 
  filter(group == "Borough") %>% 
  group_by(subgroup) %>%
  select(subgroup) %>% 
  count() %>% 
  kbl(col.names = c("Borough", "Count")) %>% 
  kable_classic_2(c("striped", "hovered"),
                  html_font = "calibri")

Borough	Count
Bronx	1
Brooklyn	1
Manhattan	1
Queens	1
StatenIsland	1

I clean it by adding a white space to correct the name of the Borough, as show below.

group <- group %>% 
  mutate(subgroup = case_when(
                            subgroup == "StatenIsland" ~ "Staten Island",
                            TRUE ~ subgroup)) 


group %>% 
  filter(group == "Borough") %>% 
  group_by(subgroup) %>%
  select(subgroup) %>% 
  count() %>% 
  kbl(col.names = c("Borough", "Count")) %>% 
  kable_classic_2(c("striped", "hovered"),
                  html_font = "calibri")

Borough	Count
Bronx	1
Brooklyn	1
Manhattan	1
Queens	1
Staten Island	1

4.3. Change data types for `date_of_interest`

In the daily data, the date_of_interest column is stored as a string variable. I change it to a date variable.

daily <- daily %>% 
        mutate(date_of_interest = as.Date(date_of_interest, 
                                          format = "%m/%d/%Y"))


class(daily$date_of_interest)

## [1] "Date"

5. Analyzing Citywide Impact

This section analyzes the daily and total number of COVID cases in the City as a whole as of 2023-02-14.

5.1. Citywide: Total Cases

As of 2023-02-14, NYC had recorded approximately a total of 3.23 million COVID cases since the first case was recorded in early 2020. More than 206,400 of those cases have led to hospitalizations, with over 45,000 people losing their lives.

group %>% 
  filter(group == "Citywide") %>% 
  summarise(`Total Infections` = sum(CASE_COUNT),
            `Total Hospitalizations` = sum(HOSPITALIZED_COUNT),
            `Total Deaths` = sum(DEATH_COUNT)) %>% 
  kbl(align = "lcr",
      format.args = list(big.mark = ",")) %>% 
  kable_classic_2()

Total Infections	Total Hospitalizations	Total Deaths
3,231,835	206,476	45,023

The charts below show the trends in daily Citywide cases since the beginning of the pandemic.

Infections

daily %>% 
  plot_ly(x = ~date_of_interest, y = ~CASE_COUNT, 
          type = 'bar', name = 'Infections Count', marker = list(color = "#5c89c9")) %>% 
  add_trace(X = ~date_of_interest, y = ~CASE_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = '7-Day Avg. Infections',
            marker = list(color = "#1d3557"), line = list(color = "#1e3557")) %>%
  layout(xaxis = list(title = "Date"),
         yaxis = list(title = "Number of People"),
         title = "Daily Citywide Infections")

Hospitalizations

daily %>% 
  plot_ly(x = ~date_of_interest, y = ~HOSPITALIZED_COUNT,
          type = 'bar', name = 'Hosp. Count', marker = list(color = "#e9d8a6")) %>% 
  add_trace(x = ~date_of_interest, y = ~DEATH_COUNT_7DAY_AVG,
            type = 'scatter', mode = 'line', name = '7-Day Avg. Hosp.',
            marker = list(color = "735d1d"), line = list(color = "#735d1d")) %>% 
  layout(xaxis = list(title = NULL),
         yaxis = list(title = "No. of Hospitalizations"),
         title = "Daily Citywide Hospitalizations")

Deaths

daily %>% 
  plot_ly(x = ~date_of_interest, y = ~DEATH_COUNT,
          type = 'bar', name = 'Death Count', marker = list(color = "#f5aaa2")) %>% 
  add_trace(x = ~date_of_interest, y = ~DEATH_COUNT_7DAY_AVG,
            type = 'scatter', mode = 'line', name = '7-Day Avg. Deaths', 
            marker = list(color = "#ae2012"), line = list(color = "#ae2012")) %>% 
  layout(xaxis = list(title = ""), 
         yaxis = list(title = "No. of Deaths"),
         title = "Daily Citywide Deaths")

The charts above show that NYC reached the peak of infection in the beginning of 2022, during the Omicron wave. While there have been three waves in hospitalizations and deaths, most of the hospitalizations and deaths occurred during the initial wave of infections (between March and April of 2020). The availability of vaccines during the Omicron wave appear to have helped reduce the number of hospitalizations and deaths around that time.

5.2. Citywide: New Cases

The table below shows the number of new infections, hospitalizations and deaths recorded on 2023-02-14 - the latest date we have record for.

daily %>% filter(date_of_interest == max(daily$date_of_interest)) %>% 
  group_by(date_of_interest) %>% 
  rename(Date = date_of_interest) %>% 
  summarise(Infections = sum(CASE_COUNT),
            Hospitalizations = sum(HOSPITALIZED_COUNT),
            Deaths = sum(DEATH_COUNT)) %>% 
  kbl(align = "lccr",
      format.args = list(big.mark = ",")) %>% 
  kable_classic_2()

Date	Infections	Hospitalizations	Deaths
2023-02-14	752	0	5

6. Analyzing COVID Impact by Borough

This section disaggregates the daily and total number of COVID cases among the five NYC boroughs.

6.1. Total Cases by Borough

The chart below shows the total number of COVID cases by borough. Because we are looking at raw numbers (and not adjusted for population), densely populated boroughs will show more infections, hospitalizations and deaths. In this case, Brooklyn, the most densely populated borough in NYC, has had more infections, hospitalizations and deaths compared to the other four boroughs.

pal1 <- c("#1e3557", "#e9d8a6", "#ae2012")

group %>% 
  filter(group == "Borough") %>% 
  group_by(subgroup) %>% 
  rename(Borough = subgroup) %>% 
  summarise(`1. Infections` = sum(CASE_COUNT),
            `2. Hospitalizations` = sum(HOSPITALIZED_COUNT),
            `3. Deaths` = sum(DEATH_COUNT)) %>% 
  pivot_longer(cols = 2:4, names_to = "Indicator", values_to = "Case Count") %>% 
  plot_ly(x = ~ Borough, y = ~`Case Count`, type = 'bar', color = ~Indicator, colors = pal1) %>% 
  layout(title = "Total Number of Cases by Borough", barmode = 'stack',
         xaxis=list(title = ""), yaxis=list(title = "Number of People"))

6.2. Daily Average Cases by Borough

The charts below show the trends in the daily average infections, hospitalizations and deaths per borough.

Average Infections by Borough

daily %>% 
  plot_ly(x = ~date_of_interest, y = ~BX_CASE_COUNT_7DAY_AVG, 
          type = 'scatter', mode = 'line', name = 'Bronx') %>% 
  add_trace(x = ~date_of_interest, y= ~BK_CASE_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Brookyln') %>% 
  add_trace(x = ~date_of_interest, y= ~MN_CASE_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Manhattan') %>% 
  add_trace(x = ~date_of_interest, y= ~QN_CASE_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Queens') %>% 
  add_trace(x = ~date_of_interest, y= ~SI_CASE_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Staten Island') %>% 
  layout(xaxis = list(title = ""),
         yaxis = list(title = ""),
         title = "7-Day Avg. Infections by Borough")

Average Hospitalizations by Borough

daily %>% 
  plot_ly(x = ~date_of_interest, y = ~ BX_HOSPITALIZED_COUNT_7DAY_AVG, 
          type = 'scatter', mode = 'line', name = 'Bronx') %>% 
  add_trace(x = ~date_of_interest, y= ~BK_HOSPITALIZED_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Brookyln') %>% 
  add_trace(x = ~date_of_interest, y= ~MN_HOSPITALIZED_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Manhattan') %>% 
  add_trace(x = ~date_of_interest, y= ~QN_HOSPITALIZED_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Queens') %>% 
  add_trace(x = ~date_of_interest, y= ~SI_HOSPITALIZED_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Staten Island') %>% 
  layout(xaxis = list(title = "Date"),
         yaxis = list(title = "Number of Cases"),
         title = "7-Day Avg. Hospitalizations by Borough")

Average Deaths by Borough

daily %>% 
  plot_ly(x = ~date_of_interest, y = ~BX_DEATH_COUNT_7DAY_AVG, 
          type = 'scatter', mode = 'line', name = 'Bronx') %>% 
  add_trace(x = ~date_of_interest, y= ~BK_DEATH_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Brookyln') %>% 
  add_trace(x = ~date_of_interest, y= ~MN_DEATH_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Manhattan') %>% 
  add_trace(x = ~date_of_interest, y= ~QN_DEATH_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Queens') %>% 
  add_trace(x = ~date_of_interest, y= ~SI_DEATH_COUNT_7DAY_AVG, 
            type = 'scatter', mode = 'line', name = 'Staten Island') %>% 
  layout(xaxis = list(title = ""),
         yaxis = list(title = ""),
         title = "7-Day Avg. Deaths by Borough")

The charts above show that daily infections, hospitalizations and deaths have consistently been highest in Brooklyn and Queen.

6.3. Share of Infections that Turned into Hospitalizations and Deaths?

Let us look at the share of infections in each borough that turn in hospitalizations and deaths. This will help us understand which boroughs are having the most severe symptoms from COVID.

group %>% filter(group == "Borough") %>% 
  group_by(subgroup) %>% 
  summarise(
    `Hospitalized as % of Cases` = (sum(HOSPITALIZED_COUNT)/sum(CASE_COUNT))*100,
    `Deaths as % of Cases` = (sum(DEATH_COUNT)/sum(CASE_COUNT))*100
                                     ) %>% 
plot_ly(x=~subgroup, y=~`Hospitalized as % of Cases`, type = 'bar', name = '% Hospitalized',
        marker = list(color = "#e9d8a6")) %>% 
  add_trace(x=~subgroup, y=~`Deaths as % of Cases`, name = '% Died', marker = list(color = "#ae2012")) %>% 
  layout(title = "Share of Infections that Lead to Hopitalizations & Deaths",
         barmode = 'group',
         xaxis = list(title=''), 
         yaxis = list(title='Share of Infections',
                      ticksuffix = "%"))

The chart above indicates that, even though Brooklyn has had the highest number of COVID cases (see section 6.1), the Bronx has seen the largest share of its cases lead to hospitalizations and deaths. This may be because, although less populated than Brooklyn, the Bronx has a lot more people (per capita) living with underlying medical conditions that exacerbate the effects of COVID. For example, the Bronx is known to have one of the highest asthma hospitalization rate in the New York State.¹

6.4. Which Boroughs Have Been Hit the Hardest?

Section 6.1 shows Brooklyn has the highest number of cases, hospitalizations and deaths among all boroughs. This makes sense since Brooklyn is the most populous of the five boroughs. However, to be able to compare boroughs to determine which one has been severely affected, we have to adjust for population. Hence, let us use the rates (per 100,000) statistics to plot total infections, hospitalizations and deaths.

group %>% filter(group == "Borough") %>% 
  group_by(subgroup) %>% 
  rename(Borough = subgroup) %>% 
  summarise(`1. Infection Rate` = sum(CASE_RATE),
            `2. Hospitalization Rate` = sum(HOSPITALIZED_RATE),
            `3. Death Rate` = sum(DEATH_RATE)) %>%
  pivot_longer(cols = 2:4, names_to = "Indicator", values_to = "Rate (per 100K)") %>% 
  plot_ly(x = ~ Borough, y = ~`Rate (per 100K)`, type = 'bar', 
          color = ~Indicator, colors = pal1) %>% 
  layout(title = "Case Rate by Borough", barmode = 'stack',
         xaxis=list(title = ""), yaxis=list(title = "Rate (Per 100K)"))

The chart above indicates that after adjusting for population, Staten Island - the least populated borough - has the highest rate of infections. The Bronx, on the hand, has the highest rate of hospitalizations and deaths.

7. Analysing by Age Group

This section details how COVID-19 has impacted NYC residents of different age groups. The data set breaks down age into eight categories - 0-17, 18-24, 24-34, 35-44, 45-54, 55-64, 65-74, and 75+.

7.1. Case, Hospitalization and Death Rates

The first tab shows the infection, hospitalization and death rates (per 100,000) for the various age groups. The second tab shows hospitalization and death rates as a share of case rates.

Case, Hospitalization and Death Rates

pal2 = c("#007a8b", "#224b5e", "#94b594", "#edc775",  "#e09351", "#df7e66", "#b75347", "#6d2f20")

group %>% 
  filter(group == "Age group") %>% 
  group_by(subgroup) %>% 
  rename(`Age Group` = subgroup) %>% 
  summarise(`1. Infection Rate` = sum(CASE_RATE),
            `2. Hospitalization Rate` = sum(HOSPITALIZED_RATE),
            `3. Death Rate` = sum(DEATH_RATE)) %>%
  mutate_if(is.numeric, ~(./sum(.))) %>% 
  pivot_longer(cols = 2:4,
             names_to = "Statistics",
             values_to = "Rate (per 100K)") %>% 
  ggplot(aes(x = Statistics,
             y = `Rate (per 100K)`,
             fill = `Age Group`), border = "white") +
  geom_bar(position = 'stack', stat = 'identity') +
  scale_fill_manual(values = pal2) +
  scale_y_continuous(label = percent) +
  theme_minimal() +
  labs(x = NULL,
       y = "Share of Rate (per 100K)",
       title = "Case Rate (per 100K) byAge Group")

Share of Cases that Lead to Hospitalization or Death

group %>% 
  filter(group == "Age group") %>% 
  group_by(subgroup) %>% 
  rename(`Age Group` = subgroup) %>% 
  summarise(
    `Hospitalized as % of Cases` = (sum(HOSPITALIZED_COUNT)/sum(CASE_COUNT))*100,
    `Deaths as % of Cases` = (sum(DEATH_COUNT)/sum(CASE_COUNT))*100
            ) %>% 
  plot_ly(x = ~`Age Group`, y = ~`Hospitalized as % of Cases`, type = 'bar', 
          name = '% Hospitalized', marker = list(color = "#e9d8a6")) %>% 
  add_trace(x = ~`Age Group`, y = ~`Deaths as % of Cases`, name = '% Died',
            marker = list(color = "#ae2012")) %>% 
  layout(title = "Share of Infections that Lead to Hopitalizations & Deaths",
         barmode = 'group',
         xaxis = list(title='Age Group'), 
         yaxis = list(title='Share of Infection Rate',
                      ticksuffix = "%"))

The charts in this section indicate that, while young people (under 45 years) are infected at higher rates than any other age group, only a small share are hospitalized and they barely any die from the virus. On the other hand, seniors, especially those 75 year and over, tend to be hospitalized and die at the highest rate even though they have the lowest infection rates. This is consistent with reports that COVID is much more deadly among seniors.

8. Analysis by Race/Ethnicity

This section details how COVID has affected people of different racial and ethnicity background. The data sets breaks race/ethnicity into four categories - Asian/Pacific-Islander, Black/African-American, Hispani/Latino and White.

8.1. Case, Hospitalization and Death Rates

The first tab shows the infections, hospitalizations and deaths rates (per 100,000) for each race/ethnicity.

The second tab shows hospitalization and death rates as a share of case rates.

Case, Hospitalization and Death Rates

library(MetBrewer)

group %>% 
  filter(group == "Race") %>% 
  group_by(subgroup) %>% 
  rename(`Race/Ethnicity` = subgroup) %>% 
  summarise(`1. Infection Rate` = sum(CASE_RATE),
            `2. Hospitalization Rate` = sum(HOSPITALIZED_RATE),
            `3. Death Rate` = sum(DEATH_RATE)) %>% 
pivot_longer(cols = 2:4,
             names_to = "Indicator",
             values_to = "Rate (per 100K)") %>% 
  ggplot(aes(x = Indicator,
             y = `Race/Ethnicity`,
             fill = `Rate (per 100K)`)) +
  geom_tile() +
  scale_fill_gradientn(colors = met.brewer("Hokusai1", type = "continuous")) +
  theme_minimal() +
  labs(x = NULL, y= NULL, title = "Case Rate (per 100K) by Race/Ethnicity")

Share of Cases that Cause Hospitalization or Death

group %>% 
  filter(group == "Race") %>% 
  group_by(subgroup) %>% 
  rename(Race = subgroup) %>% 
  summarise(
    `Hospitalized as % of Cases` = (sum(HOSPITALIZED_RATE)/sum(CASE_RATE))*100,
    `Deaths as % of Cases` = (sum(DEATH_RATE)/sum(CASE_RATE))*100
            ) %>% 
  plot_ly(x = ~Race, y = ~`Hospitalized as % of Cases`, type = 'bar', 
          name = '% Hospitalized', marker = list(color = "#e9d8a6")) %>% 
  add_trace(x = ~Race, y = ~`Deaths as % of Cases`, name = '% Died',
            marker = list(color = "#ae2012")) %>% 
  layout(title = "Share of Infections that Lead to Hopitalizations & Deaths",
         barmode = 'group',
         xaxis = list(title = ""), 
         yaxis = list(title = 'Share of Infection Rate',
         ticksuffix = "%"))

The two charts indicate that, while African-Americans have one of the lowest infections rates, they tend to be hospitalized or die from the virus at the highest rates.

9. Map: COVID-19 Cases by Neighborhood

In this section, I use choropleth maps to visualize and compare infection and death rates (per 100K) among NYC neighborhoods.

To create the maps, i merge the modzcta dataframe (which disaggregates total COVID cases by zip code and neighborhoods) and the modzcta shapefile.

modzcta_merge <- 
  merge(modzcta_shp, modzcta, by.x = "modzcta", by.y = "MODIFIED_ZCTA", all = T)

Infection Rate (per 100K) by Neighborhood

modzcta_merge %>% 
  ggplot() +
  geom_sf(aes(fill = COVID_CASE_RATE),
            color = "white",
            lwd = 0.2) +
    scale_fill_gradientn(
      name = "Infection Rate (per 100K)",
      colors = met.brewer("Pillement", type = "continuous")
    ) +
  theme_void() +
  theme(plot.title.position = 'plot', 
        plot.title = element_text(hjust = 0.5)) +
  theme(panel.grid = element_line(color = "transparent")
          )

Death Rate (Per 100K) by Neighborhood

modzcta_merge %>% 
  ggplot() +
  geom_sf(aes(fill = COVID_DEATH_RATE),
            color = "white",
            lwd = 0.2) +
    scale_fill_gradientn(
      name = "Death Rate (per 100K)",
      colors = met.brewer("OKeeffe2", type = "continuous")
    ) +
  theme_void() +
  theme(plot.title.position = 'plot', 
        plot.title = element_text(hjust = 0.5)) +
  theme(panel.grid = element_line(color = "transparent")
          )

10. Conclusion

The following are the trends in observed reported COVID cases in NYC as of 2023-02-14.

Infections peaked in January 2022, during the Omicron wave.
However, hospitalizations and deaths reached their peaks during the first wave of the pandemic (April 2020). Because of the availability of vaccines, the Omicron wave did not cause as much hospitalization and was not as deadly as the 2020 wave of infections.
Because of the size of its population, Brooklyn has seen the highest number of infections, hospitalizations and deaths since the beginning of the pandemic compared to the other boroughs.
- However, when you adjust for population, Staten Island has the highest rate of infection (per 100K people), while the Bronx has had the highest rate of hospitalization and death rates.
- The Bronx has seen the largest share of all cases lead to hospitalization (8 percent) and death (1.6 percent).
Brooklyn and Queens have consistently averaged the highest number of infections, hospitalizations and deaths since the beginning of the pandemic per day.
In terms of age, young people under 45 years have the highest rate of infection. Yet, seniors over 65 years tend to be hospitalized and die at the highest rates.
- Those over 76 years have seen the largest share of their cases lead to hospitalization (34.3 percent) and deaths (13.6 percent).
Even though African-Americans have one of the lowest infection rates, they tend to be hospitalized and die at higher rates compared other races/ethnicities.
- African-Americans have seen the largest share of their cases lead to hospitalization (9.6 percent) and deaths (3.3 percent).

Tracking COVID-19 Cases in NYC

William Kyeremateng

February 17, 2023

1. Introduction

2. Data

Data-by-Day

Data-by-Group

Data-by-Modzcta

3. Data Examination

3.1. Data Structure and Summary

Daily Data

Group Data

Modzcta Data

3.2. Missing Data

Daily Data

Group Data

Modzcta Data

4. Data Wrangling

4.1. Consolidate ‘0-17’ age group

4.2. Clean the Staten Island subgroup

4.3. Change data types for date_of_interest

5. Analyzing Citywide Impact

5.1. Citywide: Total Cases

Infections

Hospitalizations

Deaths

5.2. Citywide: New Cases

6. Analyzing COVID Impact by Borough

6.1. Total Cases by Borough

6.2. Daily Average Cases by Borough

Average Infections by Borough

Average Hospitalizations by Borough

Average Deaths by Borough

6.3. Share of Infections that Turned into Hospitalizations and Deaths?

6.4. Which Boroughs Have Been Hit the Hardest?

7. Analysing by Age Group

7.1. Case, Hospitalization and Death Rates

Case, Hospitalization and Death Rates

Share of Cases that Lead to Hospitalization or Death

8. Analysis by Race/Ethnicity

8.1. Case, Hospitalization and Death Rates

Case, Hospitalization and Death Rates

Share of Cases that Cause Hospitalization or Death

9. Map: COVID-19 Cases by Neighborhood

Infection Rate (per 100K) by Neighborhood

Death Rate (Per 100K) by Neighborhood

10. Conclusion

4.2. Clean the `Staten Island` subgroup

4.3. Change data types for `date_of_interest`