To support New York City city managers' plans for workload and resource distribution it is important to understand where and when issues will arise across the city. NYC receives roughly 2 million non-emergency service requests (or complaints) via 311 (the city’s non-emergency service line) each year. The city is committed to resolving each complaint received in a timely manner, which means that it must be prepared to anticipate what service requests will be received and to activate and support the appropriate responding city agency. As NYC is a large city with very diverse neighborhoods, the types of requests, the volume of those requests, and the time in which they are resolved are likely to vary greatly in different areas. To help city managers prepare for the future we will visualize and analyze historical 311 data to identify trends in call volume, response times, and geographic concentration of issues.
There is a lot of data! NYC 311 has over 10 years worth of service request data that is updated daily via NYC Open Data (see link below), which currently includes over 20 million unique records. There are 40 values for each record that roughly contain time, location, responsible city agency, complaint type, community board, and complaint method (online, phone, etc.). Most of the data is categorical, however a number of continuous variables may be derived.
To augment this data we will combine it with data from the 2018 U.S. Census American Community Survey 5-year summary (see link and library below). The demographic data will hopefully help us see if there are complaint trends that have a relationship with specific demographic factors. We will use zip codes to specify geographic groups, as that is the clearest geographic grouping in the 311 data still providing for a significant number of different values. As it turns out, the census does not regularly use zip codes, and therefore only estimates zip code approximations for the 5-year summary. These are known as Zip Code Tabulation Areas (ZCTA). After removing zip codes that do not appear in both sets of data, and removing zip codes with a population of zero, we are left with 184 unique NYC zip codes.
For this analysis we've also reduced the dataset down to the 6 most common agancies and removed incomplete records, records with dummy values, or values that are non-sensical (e.g. closed date earlier than created date). After cleaning and down selecting our initial dataset is comrised of 24 variables. Select variables include:
Select Variables in Cosideration (see full list in link below) | ||||
---|---|---|---|---|
Variable | Description | Data type | Unique values | Source |
Created date | Date service request was created | Datetime - year-month-day hour:min:sec | 11,822,621 | NYC 311 |
Agency | Date service request was created | Categorical - string - ex. 'NYPD' or 'HPD' | 6 | NYC 311 |
Complaint Type | This is the first level of a hierarchy identifying the topic of the incident or condition. Complaint Type may have a corresponding Descriptor or may stand alone. | Categorical - string - ex. 'Noise - Residential', or 'Heat/Hot Water' | 162 | NYC 311 |
Borough | NYC Borough where request submitted | Categorical - string - ex. 'BRONX', or 'BROOKLYN' | 6 | NYC 311 |
Zip code | 5-digit zip code derived from the reported incident zip and compared to available zip codes (with population > 0) provided by the Census | Categorical - string - ex. '11101' | 184 | NYC 311 and US Census |
Total population | Total estimated population within the ZCTA | Continuous - integer | 184 | US Census |
Median Income | Estimated median income within the ZCTA | Continuous - integer | 183 | US Census |
HS or above | Estimated % with high school degree or above | Continuous - integer | 184 | US Census |
Response time | Duration of time between service request created data and closed date - in days | Continuous - integer | >10 million | Calculated from 311 data |
We used the OSEMN (Obtain, Scrub, Explore, Model, INterpret) framework for planning and executing our project. The primary questions we sought to answer were:
Can we reliably predict daily service request volume? Can we reliably predict service request response times? Are there noticeable differences in service request types, volumes, or responses across different areas (boroughs or zipcodes)?
To predict daily service request volume and response times we compared the results from a number of modeling approaches including Seasonal Autoregressive Integrated Moving Average (SARIMA), Random Forest Regression, Linear Regression, and Long Short-Term Memory (LSTM) neural networks.
To identify and analyze geographic trends we visualized the date and conducted hypothesis tests.
Before moving into our ML analysis predicting response times and call volumes, we explored and visualized the data to observe a few high level trends. We observed that different complaint types and areas of the city express very different trends. This ultimately led to the useful insight that our models need to be pointed at fairly specific slices of the data. Since our models in this instance are univariate, that's not too surprising. Model performance was not as strong when it attempted to predict something like citiy-wide respose time. Considering that we were looking at ~30 different types of complaints from across a very diverse metropolitan area, again, that's not too surprising. The response for a noise complaint is likely more timely that something like a pothole or a dead tree, as the issues occur over different time frames and have very different responses.
For the sake of simplicity for this project we modeled only city-wide trends and trend concerning the most populat complaint types. But first we were intrested to see how demographics and geography may relate to how residents experience response time or types of issues.
Zip codes in the top income range experience statistically significantly different response times than zip codes in lower income rangesOur first observation was that median income of a zipcode does appear to have some relationship with response times for residential noise complaints, as there is a statistically significant differrence in mean response times across income ranges. Futher analysis is required to look into this, as a number of other factors likely contribute more directly to the mean response time for these differnt areas, such as volume of complaints or geographic concentration/dispersion, but it is an interesting first insight.
Most common complaint type by zipcodeAnother interesting thing to observe was which compliants occured most frequently in each zipcode. The most interesting insight was that although heat and hot water complaints are the second most common complaint type by volume (by a large margin), it is only the most common complaint in ~8 of the ~160 zip codes we analyzed. It would be interesteing to research this further to determine why these are so tightly clustered and how this may be addressed.
LSTM slighly outperforms SARIMA with an RMSE of 449 vs 459When it came to modeling we used an LSTM and SARIMA to constuct univariate models of call volumes and response times based on historical data. The LSTM only slightly outperfomed SARIMA with ~10% margin of error when predicting call volumes and responses. As mentioned previously, the models become more accurate the more you refine the data selection down to specific complaint types and geographies.
In order to make more affirmative recommendations we recommend that further refined modeling be done, and a persistent modeling pipeline be developed that updates on a daily bases. It is clear from our outcomes that the most useful and reliably predicted city-wide metric is total call volume. This model (or combination of models) can be used to project and plan for flexible staffing needs and contingent workforces. Additionally, the city may conduct further research into the driving factors beyond observed differences across boroughs and zip codes and suggest and prioritize future public works projects and public funding requests.
- Multi-step LSTM Modeling
- Multivariate LSTM Modeling
- K-Means Cluster Analysis and Demographic Trends
- Interactive Mapping and Dashboarding