-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCapstone_dashboard.Rmd
615 lines (435 loc) · 30.6 KB
/
Capstone_dashboard.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
---
title: "Baltimore Crime"
output:
flexdashboard::flex_dashboard:
theme: cerulean
source_code: https://github.com/crsimmons1/capstone
---
```{r setup, include=FALSE}
#loading in needed packages
library(tidyverse)
library(DT)
library(plotly)
library(zoo)
library(ggmap) # NOTE: to run this file, register your computer with an API key through https://cloud.google.com/maps-platform/
library(ggrepel)
library(ggthemes)
```
Introduction to the Data {data-orientation=rows data-icon="fas fa-info-circle"}
===============================================================================
Row {data-height=700}
---------------------
### **Data**
The Baltimore Crime data was retrieved from Data.gov and is updated every Monday. Find the link to the original dataset [here]("https://data.baltimorecity.gov/"). The rows represent one offense with information such as the location, the type of offense, the neighborhood where the offense took place, and many more. The outputted dataset below depicts the final dataset used for this analysis.
```{r, echo=FALSE, warning=FALSE}
balt_clean <- read.csv("baltimore_cleaned.csv") #reading in baltimore dataset
headed <- head(balt_clean, n=50) # viewing first 50 rows
datatable(headed, options=list(bPaginate=FALSE))
```
Row {data-height=300}
--------------------
### **Added Variables**
Added Variables include:
1. **Weather Data**: precipitation, snow, temperature average, snow, snow depth
2. **Socioeconomic Data**: unemployment rate for every month from 2015-2019
### *Image retrieved from www.baltimorepolice.org*
![](https://www.baltimorepolice.org/sites/default/files/New_BPD_Official_Seal_2017.png)
Graphs {data-orientation=rows data-icon="fas fa-chart-area"}
======================================
Row {.tabset}
------------
### Time Series Graph
```{R,echo=FALSE}
#Total Crime over years by season- load in ggthemes package
#adding season variable
yq <- as.yearqtr(as.yearmon(balt_clean$CrimeDate, "%m/%d/%Y") + 1/12) #changing
balt_clean$Season <- factor(format(yq, "%q"),
levels=1:4,
labels=c("Winter","Spring", "Summer", "Fall"))
tot_c <-balt_clean %>% group_by(CrimeDate, Season) %>% summarise(Count= n()) #obtaining total crime count
tot_c$CrimeDate <- as.Date(tot_c$CrimeDate, format="%m/%d/%Y")
fig0 <-ggplot(tot_c)+ #plotting crime count
geom_line(aes(CrimeDate, Count, group=1)) +
scale_x_date(date_labels="%b %Y", date_breaks="1 year") +
ggtitle("Total Crime Throughout the Years") +
theme_economist()+
theme(plot.title=element_text(hjust=0.5))
ggplotly(fig0, dynamicTicks = TRUE) %>% # animating visualization with ggplotly
rangeslider() %>%
layout(hovermode = "x")
```
### Line Graph of Daypart
```{r,echo=FALSE}
daypart_crime <- balt_clean%>% group_by(DayPart, Year) %>% summarise(count=n()) # obtaining count for every day part
ggplot(data=daypart_crime, aes(x=Year, y=count)) + #plotting day part count
geom_line(aes(linetype=DayPart)) +
directlabels::geom_dl(aes(label=DayPart), method="smart.grid") +
theme_minimal() +
theme(legend.position="none") +
ylab("Total Crime")+
ggtitle("Total Crime Over the Years by Time of Day in Baltimore")
```
### Bar Graph of Districts
```{r,echo=FALSE}
district1 <- balt_clean %>% # getting violent and nonviolent count for each district
group_by(District, Violent, Nonviolent) %>%
summarise(Count=n()) %>%
subset(District != "UNKNOWN") %>%
mutate(Type=(Violent==1)) %>%
mutate(Type= replace(Type, Violent==1, "Violent")) %>%
mutate(Type = replace(Type, Nonviolent==1, "Non-Violent"))
ggplot(district1, aes(x= District, y=Count, group=Type, color=Type, fill=Type)) + #plotting counts
coord_flip()+
geom_bar(stat="identity", position=position_dodge()) +
theme_minimal() +
geom_text(aes(label=Count), position=position_dodge(0.9), hjust=-0.2) +
scale_y_continuous(limits=c(0,35000)) +
xlab("Districts of Baltimore") +
ggtitle("Crime Count Throughout the Districts")
```
### Line Graph of Seasons
```{r,echo=FALSE}
library(gghighlight)
yq <- as.yearqtr(as.yearmon(balt_clean$CrimeDate, "%m/%d/%Y") + 1/12) # changing date
balt_clean$Season <- factor(format(yq, "%q"), levels=1:4, labels=c("Winter","Spring", "Summer", "Fall")) # adding season variable
try2 <- balt_clean %>% # obtaining violent and nonviolent counts every year per season
group_by(Year, Season) %>%
count(Violent, Nonviolent) %>%
mutate(ViolentorNot = (Violent == 1))
plot2 <- ggplot(try2, aes(x=Year, y=n, color=Season)) + # plotting violent and nonviolent counts per year per season
geom_line(aes(group=interaction(Season, ... = ViolentorNot))) +
scale_color_discrete(labels=c('Summer', "Fall", "Spring", "Winter")) +
ggtitle("Yearly Count of Violent vs. Non-Violent Offenses by the Seasons") +
ylab("Count of Offenses") + theme_classic() +
geom_text(x=2018.65, y=9280, label="Non-Violent", color="black", show.legend=FALSE) +
geom_text(x=2018.8, y=4870, label="Violent", color="black", show.legend=FALSE)
ggplotly(plot2, tooltip=c("y")) #wrapping plot with plotly function to animate
```
Row
------
### Percent of Violent Crimes
```{R,echo=FALSE}
library(flexdashboard)
crimeperday <- balt_clean %>% # obtaining value for valueBox
count(Violent)
val <- round(crimeperday[crimeperday$Violent == "1","n"]/(sum(crimeperday[crimeperday$Violent == "1","n"] + crimeperday[crimeperday$Violent == "0", "n"])) * 100, 2) # storing value
valueBox(val$n, icon="fas fa-percent") # generating valueBox
```
### Average Unemployment Rate
```{r, echo=FALSE}
val2 <- round(mean(balt_clean$Unemployment),2) #obtaining value for valueBox
valueBox(val2, icon="fas fa-exclamation-circle") # generating valueBOx
```
### District with most Number of Violent Crime
```{r}
manip <- balt_clean %>% #obtaining value for valueBox
group_by(District) %>%
subset(Violent==1) %>%
summarise(count=n())
max_dist <- manip[manip$count==max(manip$count), "District"] # storing value
valueBox(value=max_dist$District, icon="fas fa-map-marked-alt") # generating valueBox
```
Analysis {data-icon="fas fa-align-justify"}
===========================================
Columns
--------
Note: The tabs to the right include relatively in-depth explanation of analyses run. For summaries of the models, see Results page.
### Motivation
**Objectives**
Our primary goal is to find the best way to convey Baltimore’s crime data in a way that is helpful to the Baltimore Police Department. Our secondary goals were:
* To predict what type of crime occured (violent or nonviolent)
* To predict how many crimes will occur in a given day
* To create visualizations that present interesting, informative findings
**Motivations**
Predicting and analyzing crime is a field that has been around for a long time, and for good reason. Finding the motivations and situations that lead to a crime occuring could lead to a safer, more secure world. The "routine activity approach" is a theory which emphasizes the circumstances in which a crime occurs, rather than the traits of the perpetrator, to analyze crime trends (Cohen & Felson, 1979).
There is evidence that property crimes are driven by pleasant weather, which is consistent with this approach (Hipp & Curran, 2004). Property crimes are nonviolent crimes which involve the theft or destruction of someone else’s property.
There is also empirical evidence that crime rates increase in hotter years, and that crime is more prevalent in hotter parts of the year. Furthermore, the effect of temperature is stronger on violent crimes than it is on nonviolent crimes (Anderson, 1987). Conversely, there is a strong positive effect of unemployment on property crimes, while the evidence is much weaker for violent crimes (Raphael & Ebmer, 2001).
For this reason crimes in the dataset were divided into two broad categories: violent and nonviolent. By predicting which category a single crime falls into, this provides insight into what factors are the most influential on that type of crime occurring.
Predicting the amount of daily crime can be useful for many reasons, the most important of which is staffing. By taking into account the weather, current unemployment rate, or last week's numbers, the Baltimore Police Department might be able to use their officer’s time more effectively by staffing only the officers they need. It may also reveal which factors are the most important.
By creating interesting data visualizations it is easier to convey complicated trends and patterns in a straightforward fashion. The apprehensibility of the graphics is especially true considering the primary audience for these findings, the BPD.
**Sources:**
* Anderson, C. A. (1987). Temperature and aggression: Effects on quarterly, yearly, and city rates of violent and nonviolent crime. Journal of personality and social psychology, 52(6), 1161.
* Cohen, L. E., & Felson, M. (1979). Social change and crime rate trends: A routine activity approach. American sociological review, 588-608.
* Hipp, J. R., Curran, P. J., Bollen, K. A., & Bauer, D. J. (2004). Crimes of opportunity or crimes of emotion? Testing two explanations of seasonal change in crime. Social Forces, 82(4), 1333-1372.
* Newberg, L. (2005). [Some Useful Statistics Definitions.]("https://www.cs.rpi.edu/~leen/misc-publications/SomeStatDefs.html")
* Raphael, S., & Winter-Ebmer, R. (2001). Identifying the effect of unemployment on crime. The Journal of Law and Economics, 44(1), 259-283.
Column {.tabset .tabset-fade}
-----------------------------
### Logistic Regression
**Building Model 1:**
* We initially formed a logistic regression model including all of our predictors (except for Hour, which violated the multicollinearity assumption) to determine which were insignificant. The output showed that all were significant except for the predictors Snow and Snow Depth.
* We then split the data 80:20 into training and testing data. We ran logistic regression to form a model based on the training data set, not including Hour, Snow, and Snow Depth.
* Then we used the model to predict values based on our testing data. Our cutoff probability for a positive value or a Violent Crime occuring was 50%.
* Using the probabilities output, we decided that if a probability was greater than or equal to 50%, we would consider that as a "1" (Violent Crime Occured) and a probability less than 50% would be considered a "0" (Non-Violent Crime Occured).
**Building Model 2:**
* We continued with our logistic regression model evaluation, but this time used an automated variable selection process to further tune the variables within our model.
* After running stepwise regression, the variable, Day, was removed on top of the variables removed from the first model.
* Following the process of our first model, we then split the data 80:20 into training and testing data and ran logistic regression on the training data.
* We then predicted values using testing data, which output probabilities that a violent crime would occur.
* To improve from our first model, we lowered the threshold of a positive result, or a "1", to be greater than or equal to 35% instead of 50% from the initial model to be more conservative.
See the exact metrics of both confusion matrix outputs below.
**Model Comparison**
* From model 1 to model 2, the threshold of a positive output was lowered to increase sensitivity rate (thus decreasing type II error) since we want to be as accurate as possible when it comes to predicting if a violent crime will occur or not.
* Overall, the second model is a better fit for our interest in predicting if a violent crime will occur or not even if our accuracy rate is lower than our initial model.
Metrics| Model 1 | Model 2
------------- | ------------- | -------------
**AUC** | 0.58 | 0.59
**Accuracy** | 70% | 65%
**Specificity Rate** | 99.93% | 79%
**Type I Error** | 0.07% | 20%
**Sensitivity Rate** | 0.1% | 31%
**Type II Error** | 99.9% | 69%
### Tree Models
**Initial Model**
* We split the dataset into half to create a training dataset to build a model and a testing dataset to test our model.
* The temperature average is the most important variable.
* For a day with temperature less than 46.5 F, the next most significant variable was unemployment, followed by year and snow depth.
* For a day with temperature more than 46.5 F, then Month was the next significant variable, followed by unemployment.
**After Bagging**
* For bagging, we used all predictors to be considered for bootstrap.
* We generated 2000 trees in total.
* Because it is hard to interpret a large number of trees, we obtained an overall summary of the importance of each predictor using the RSS.
* Temperature average is the most important predictor, followed by unemployment, month, and year.
**After Random Forest**
* Usually, when using a random forest method to build a regression tree, we would use the p/3. In this case it would be 2.33, so we tried both 2 and 3. Using 2 gave a better test MSE, so we ended up going with 2. We also generated 2000 trees.
**After Boosting**
* Boosting is another way to improve the predictive performance of tree-based methods on test data by building trees sequentially where a tree is built given the information provided by previous trees.
* We also generated 2000 trees.
**Model Comparison**
The best model, since is has the lowest MSE, is boosting.
Model | Test MSE | % Variance Explained
------------- | ------------- | -------------
**Initial** | 22.93837 |
**Bagging** | 18.83332 | 28.98%
**Random Forest** | 17.33683 | 33.77%
**Boosting** | 16.98436 |
### Linear Regression
**Model Building**
* The primary goal of running linear regression was to get an overall sense of our data.
* We used the function lm to create a model with the chosen variables.
* The significant variables were 'Month', 'Temperature', and 'Precipitation'.
* The Adjusted R-Squared was 0.3041
* The final model:
$$\text{ Total Crime} = 1506.4027 + (-0.7054)Year + (1.3848)Month + (0.0863)Day \\+ (0.5819)Temperature + ( -6.8273)Precipitation + (1.0690)Unemployment$$
### Time Series
**Model Building**
* There is a clear overall trend and seasonal trend which must be removed to satisfy the seasonality assumption.
* The periodogram indicates that the data should be smoothed. A Daniel kernel with L=15 was used to determine the period. On the smoothed periodogram, you can see spikes at around 0.15 and 0.29, which indicates the frequency is 0.15 and the period is 6.67, so m=7.
* To remove the trends, the first difference was taken, in addition to the seasonal difference with a period of 7.
* From there, several possible parameter values for a SARIMA model were chosen from the ACF and PACF plots.
* For the nonseasonal AR component, note that the PACF is significant for the few first lags. This indicates that p=1,2, 3, or 4.
* For the nonseasonal MA component, note that there is a decay present in the first few lags of the PACF plot, and that the ACF is significant at lag 1 and 3. Lag 3 might be a false positive, or it might be significant so q=1 or 3.
* For the seasonal AR component, note that the ACF has a decay at lag 7, and the PACF is significant at several multiples of 7. P=1,2, 3, 4, or 5.
* For the seasonal MA component, note that there is a decay present after each multiple of 7 in the PACF plot, and that the ACF is significant at lag 7 so Q=1.
* Of the many models tried, only two models had significant higher order coefficients as well as sufficient diagnostic plots. Both models satisfy the diagnostic plot requirements.
* Model 1: SARIMA(1,1,1)x(0,1,1)7
* Model 2: SARIMA(0,1,3)x(0,1,1)7
* Both models were fitted to the data until Dec. 31st, 2019. Then it was used to predict for 31 days.
**Model Comparison**
Both models had prediction intervals that contained all of the real values for January of 2020. All of the accuracy metrics select model 2, as well as the AIC and AICc. Only the BIC selects model 1, indicating that model 2 is superior.
Metrics| Model 1 | Model 2
------------- | ------------- | -------------
**AIC** | 8.449309 | 8.448679
**AICc** | 8.449316 | 8.448692
**BIC** | 8.461382 | 8.463771
**MAE** | 12.21906 | 12.22173
**RSME** | 14.5193 | 14.53334
**ME** | 0.4678299 | -0.7078554
**MPE** | -2.50532 | -2.73881
**MAPE** | 12.08573 | 12.11812
Results {data-icon="fas fa-clone"}
===================================
Column
-------
### Summary
Our analysis focused on two types of predictive modeling: predicting the category of the crime, and predicting the total number of crimes. We summarize the results of these methods here.
For more detail on the methodology of these methods, see the tabs on the Analysis Page.
**Predicting Violent vs. Nonviolent Crime** <br>
Crimes in the dataset were divided into two broad categories: *violent* and *nonviolent*. By predicting which category a single crime falls into, this provides insight into what factors are the most influential on that type of crime occurring. To do this we used logistic regression.
*Logistic Regression* <br>
After checking that logistic regression's assumptions were met, we conducted two versions of logistic regression. The second version was created to improve upon our first by balancing the specificity and sensitivity rates, as well as make our model more interpretable. Our second version significantly improved upon our first model and we were able to determine that the most significant predictors used to predict if a violent crime will occur are the month, district, precipitation, the average temperature, the unemployment rate, and the part of day. Lastly, so that our models could potentially be used for predicting violent crime occurrences with more accuracy and providing Baltimore PD with a better understanding of how to allocate their resources, we want to try to improve upon these models by finding more datasets with predictors that could potentially have a greater influence on whether a crime will occur.
**Predicting Daily Crime Count** <br>
Predicting the amount of daily crime can be useful for many reasons, the most important of which is staffing. By taking into account the weather, current unemployment rate, or last week's numbers, the Baltimore Police Department might be able to use their officer's time more effectively by staffing only the officers they need. It may also reveal which factors are the most important. To do this we used linear regression, tree diagrams, and time series predictive modeling.
*Decision Tree* <br>
For the decision tree, we first built an initial tree. This tree is visually easy to read but is very vulnerable to changes in the dataset. Every time we make a new train dataset, the tree looks different. In order to overcome this, we borrow the idea of bootstrapping (random forest and bagging). For each method, 2000 trees were created, and the average is taken. Each method differs in how variables are chosen for each split - bagging uses all of them, and random forest only uses some of them. In our case, random forest with using only two predictors for each split made the model perform significantly better. The last method we used was Boosting. Boosting is similar to bagging/random forest, but the trees are built sequentially with information provided by previous trees. Boosting gave us the best model performance. After the analysis, we could confirm that the temperature was the most significant variable in determining the total number of crimes for each day. The next important variables were month and unemployment rate. Then precipitation and year followed.
*Time Series* <br>
Time series models use prior values about an event to forecast future ones, often taking into account repetitive cycles in the data referred to as seasonality. In order to perform time series analysis, the overall trend and seasonality must be removed. This was done by subtracting the prior day’s value from the current value, which is called taking the first difference. To remove the seasonality, in this case a weekly trend, the value from a week ago is subtracted. Then this data was used to successfully forecast values into the month of January. This tells us that the weekly cycle of crime is important – that is, we can successfully use past values to predict future ones with no further information.
*Linear Regression* <br>
See tab to the right.
Columns {.tabset .tabset-fade}
--------------------------------
### Logistic Regression
**Key Variables**
* Month
* District
* Precipitation
* Average temperature
* Unemployment
* Part of day
**Interpretation**
* In the end, the most significant variables used to predict if a violent crime would occur or not in Baltimore for logistic regression are month, district, precipitation, average temperature , unemployment, and part of day.
**Future Improvements**
* We would run more iterations of the model to try to further balance sensitivity, specificity and accuracy.
* If we were to have more time, we would try to search for more datasets that include more variables regarding human behavior that could have a correlation with crime, such as:
* Additional indicators of economic condition, such as poverty level, cost of living, or consumer price index.
* Information by district, such as literacy rates, socioeconomic levels, demographics (ethnic and racial makeup, age composition, gender composition).
* Information by neighborhood, such as population density, degree of urbanization, median income, youth concentration, crime reporting practices of the residents,
* Information about the offender provided by the BPD such as age or gender
* If the crime occured during a holiday, festival, and/or school vacation period.
### Tree Models
**Variable Importance Ranked**
1. Average temperature
2. Unemployment
3. Month
4. Precipitation
5. Year
6. Snow Depth
7. Snow
**Interpretation**
* The average temperature of the day is the most important variable in predicting total number of crimes, and unemployment is a close second.
* The next important variable turns out to be Month, followed by precipitation.
* Year, Snow, and Snow depth are relatively unimportant.
**Future Improvements**
If we were to have more time, we would try to search for more datasets that include more variables regarding human behavior that could have a correlation with crime, such as:
* Additional indicators of economic condition, such as:
* Poverty level
* Cost of living
* Consumer price index.
* Information by district, such as:
* Literacy rates
* Socioeconomic levels
* Demographics (ethnic and racial makeup, age composition, gender composition).
* Information by neighborhood, such as:
* Population density
* Degree of urbanization
* Median income
* Youth concentration
* Crime reporting practices of the residents
* Information about the offender provided by the BPD such as age or gender
* If the crime occured during a holiday, festival, and/or school vacation period.
### Linear Regression
**Key Variables**
* Month
* Precipitation
* Average temperature
**Interpretation**
* The most significant predictors used to predict if a violent crime will occur are the month, precipitation, and average temperature.
* The Adjusted R-Squared indicates that the model does not do a good job in predicting the outcome, but this is expected as it is hard to predict people's behavior with only a few variables.
**Future Improvements**
If we were to have more time, we would try to search for more datasets that include more variables regarding human behavior that could have a correlation with crime, such as:
* Additional indicators of economic condition, such as:
* Poverty level
* Cost of living
* Consumer price index.
* Information by district, such as:
* Literacy rates
* Socioeconomic levels
* Demographics (ethnic and racial makeup, age composition, gender composition).
* Information by neighborhood, such as:
* Population density
* Degree of urbanization
* Median income
* Youth concentration
* Crime reporting practices of the residents
* Information about the offender provided by the BPD such as age or gender
* If the crime occured during a holiday, festival, and/or school vacation period.
### Time Series
**Key Trends**
* Overall yearly trend
* Weekly cycle
**Interpretation**
* The weekly cycle of crime is important – that is, we can successfully use past values to predict future ones with no further information.
**Future Improvements**
* Both models performed reasonably well, however, a more accurate model might exist. In particular, lowering the standard error will provide narrower prediction intervals.
* Additional algorithms might yield more insight, such as:
* Holt Winters Exponential Smoothing is used for univariate time series with trend and seasonal components, which makes it a good fit since crime is a scalar.
* Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX) is an extension of SARIMA modeling, and includes exogenous variables.
* Another possibility is the use of multiple seasonality trends. No viable models with the lag of 365 were found, but it might be that the weekly trend was obscuring the yearly trend. Modeling with the yearly and weekly trend might produce better results.
Map Visualizations {data-orientation=columns data-icon="fas fa-map"}
====================================================================
Column {.tabset .tabset-fade data-width=600}
----------------------------------------------
### Violent Crime in the Northeast District
```{r,echo=FALSE, warning=FALSE}
library(leaflet)
com_case <- balt_clean[complete.cases(balt_clean),] #subsetting data to rows without na
ne <- com_case %>% #obtaining mean location for neighborhoods in northeast distrct with violent crimes
filter(District == "NORTHEAST" & Violent == 1) %>%
group_by(Neighborhood, ) %>%
summarise(count=n(), meanl = mean(Longitude), meanla =mean(Latitude))
ne <- ne[complete.cases(ne),] #extracting data to rows without na
leaflet() %>% addTiles() %>% #plotting location using leaflet
addCircles(data=ne, ~meanl, ~meanla, radius = ~count, popup=paste("Neighborhood:", ne$Neighborhood, "<br>",
"Violent Crime Count:",ne$count), color="blue")
```
### 2019 Crime & the Top 5 Dangerous Neighborhoods of 2020
```{r, warnings=FALSE}
lab <- data.frame(read.csv("neighborhood_balt.csv")) #loading in top 5 neighborhood data of 2020
attach(lab)
lab$location <- as.character(lab$location) #changing class
lab$rank <- as.character(lab$rank) #changing class
lab$labels <- paste(rank, location, sep="-") #creating new label for visualization
oneyear <- balt_clean %>% filter(Year == 2019) %>% #obtaining violent and nonviolent crimes of 2019
mutate(Type=(Violent==1)) %>%
mutate(Type= replace(Type, Violent==1, "Violent")) %>%
mutate(Type = replace(Type, Nonviolent==1, "Non Violent"))
oneyear <- oneyear[complete.cases(oneyear),] # extracting rows with no na's
map1 <- ggmap(get_map("Baltimore", zoom=12)) + #plotting using ggmap
stat_density2d(data=oneyear,
aes(x=Longitude, y=Latitude, fill=..level.., alpha=..level..),
bins=30,
geom="polygon") +
geom_density2d(data = oneyear,
aes(x = Longitude, y = Latitude), size = 0.3)+
geom_point(aes(x=longitude, y=latitude),
data=lab,
size=2,
color="red") +
geom_label_repel(aes(longitude, latitude, label=labels),
hjust=0,
nudge_x = 0.1,
direction="y",
data=lab,
size=1.5,
family="Times") +
facet_wrap(~Type) +
ggtitle("Crime in 2019") +
theme(plot.title=element_text(hjust=0.5)) +
guides(alpha=FALSE)
map1 #prints ggmap visualization
```
Column {.tabset .tabset-fade data-width=400}
---------------------------------------------
### Explanation for Visualization 1
<br>
**Description** <br>
From the analysis, the *Northeast* district had the most number of violent crimes. The visualization to the left depicts violent crime count for the neighborhoods in the northeast district with the size of the circle representing the neighborhood with the most violent crime count.
<br>
**Take Away** <br>
The top neighborhoods with the highest violent crime count are shown below in the table.
```{r}
table <- data.frame("Neighborhood"= c("Frankford", "Belair-Edison",
"Coldstream Homestead"),
"Violent Crime Count" = c(1639, 1600, 1044))
library(knitr)
kable(table)
```
**How to Use** <br>
To view the neighborhood and respective violent crime count on the map, click on the circle.
Zoom in and out of the map to click on the smaller circles.
Drag along the map at a desired zoom level to view the violent crime counts of the surrounding neiborhoods.
<br>
**What Can Be Done in the Future?** <br>
Further analyses could be performed to look into why these neighborhoods have the highest number of violent crimes for the northeast region of Baltimore.
### Explanation for Visualization 2
<br>
**Description** <br>
The visualizations to the left depict density maps that represent violent and nonviolent crime in all of Baltimore for 2019 along with labels of the top 5 most dangerous neighborhoods of Baltimore in 2020. See link for the article where the neighborhood data was extracted [here.](https://www.roadsnacks.net/these-are-the-10-worst-baltimore-neighborhoods/)
From the start, one main goal of this analysis was to be able to predict crime with the most relevant information available. Viewing just crime levels in 2019 and comparing them with the top 5 dangerous neighborhoods in 2020 can give insight into how well the data can possibly predict crime as well as view trends in crimes in just the span of one year.
<br>
**Take Away** <br>
From the graphs, one can see that the volume of violent crimes exceeded the volume of nonviolent in 2019. Additionally, the most crime, either violent or nonviolent, occured near the central district of Baltimore. This conclusion is slightly different from the result of the analysis that stated the northeast district had the most violent crimes for the years 2015-2019. One reason for the difference could be a result of the natural shift of crime trends from year to year.
Three of the five top dangerous neighborhoods of 2020 lie within the lighter colored areas which represent the higher crime counts. Dundalk and the Fairfield Area, the top two most dangerous neighborhoods, do not overlap with any of the data in this dataframe, which is surprising. Keep in mind, the location of these two neighborhoods could cover a wider area than represented on the map.
<br>
**What Can Be Done in the Future?** <br>
Further analyses can be performed to output all the years 2015-2019 to see how crime trends have changed.