-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMMA_Exploratory_Analysis.Rmd
604 lines (460 loc) · 32 KB
/
MMA_Exploratory_Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
---
title: "Exploratory Analysis of Professional MMA Data"
author: "Nicky Sanders"
date: "`r Sys.Date()`"
always_allow_html: true
output:
html_document: default
pdf_document: default
header-includes:
- \usepackage{geometry}
- \geometry{left=2cm,right=2cm,top=2cm,bottom=2cm}
---
```{r setup, include=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
```{r, echo=FALSE}
library(pacman)
pacman::p_load(readr, here, nortest, corrplot, kernlab, fastDummies, magrittr, forcats, ggfortify, kableExtra, janitor, dplyr, stringr, ggplot2, RColorBrewer, knitr, AssocTests, vcd, tidyverse, DT)
```
## Ask (1)
### Exploring MMA Weight Classes and Fight Outcomes over time
- Topics to explore:
- When did the rise in MMA begin? What were some galvanizing events for the sport?
- What weight classes produce the most KO/TKO outcomes?
- Is there any difference between the Male/Female divisions?
- How many significant strikes can he/she expect to take before being KO'd again
- Does time off between fights affect the chance of being KO'd again?
- What metrics will you use to measure your data to achieve your objective? Who are the stakeholders?
- Time measured from the night of the KO to the night of the next KO.
- Significant strikes since the night of the KO to the night of the next KO.
- Audience:
- The analysis is for the fighters, coaches, fans, and leaders of Pro MMA organizations.
- How this data will help the fighters and coaches make decisions:
- Fighters and coaches will be more confident when deciding to return to fighting again after being knocked out.
- Topic of exploration:
- The evolution of fight outcomes as the sport progresses
- The after-effects of MMA fighters being KO'd.
## Prepare (2) and Process (3)
### Data Source
- Data sourced from kaggle [here](https://www.kaggle.com/datasets/danmcinerney/mma-differentials-and-elo)
- Data comes from reliable source and is original, comprehensive, current, and cited.
### Using Google BigQuery
- Original data contains 530 variables; With Google BigQuery and SQL script, we created a few engineered variables
```sql
## create temp table adding sig_strikes_absorbed and total_strikes_absorbed
WITH master_table AS (
SELECT *,
(SELECT sig_strikes_landed
FROM `capstone-project-mma.mma_master.mma_fighter_data` t2
WHERE t1.opponent = t2.fighter AND t1.fight_url = t2.fight_url) AS sig_strikes_absorbed,
(SELECT total_strikes_landed
FROM `capstone-project-mma.mma_master.mma_fighter_data` t2
WHERE t1.opponent = t2.fighter AND t1.fight_url = t2.fight_url) AS total_strikes_absorbed
FROM `capstone-project-mma.mma_master.mma_fighter_data` t1
)
## create 'days_since_last_knockout'
SELECT *
FROM (
SELECT
mt.*,
IF(mt.result = 1, 0, ## if result was a win (mt.result = 1),
## days_since_last_knockout = 0 for that observation
DATE_DIFF(mt.date, (SELECT MAX(date) FROM master_table t2
WHERE t2.fighter = mt.fighter AND t2.result = 0 AND t2.date < mt.date),
DAY) ## else (loss/draw) days_since_last_knockout =
## difference between current date (outer-query's date) and
## most recent KO/TKO loss (subquery's date)
) AS days_since_last_knockout
FROM master_table mt
)
ORDER BY date, fight_url, fighter
```
### Further organizing in R
- Organizing and trimming data
```{r, echo=FALSE}
mma = read_csv("mma_data_filtered.csv", show_col_types = FALSE)
mma_master_df = as.data.frame(mma)
cols_to_keep = c('date', 'result', 'fighter', 'opponent', 'division', 'method',
'sig_strikes_landed', 'sig_strikes_attempts', 'sig_strikes_absorbed',
'total_strikes_landed', 'total_strikes_attempts', 'total_strikes_absorbed',
'days_since_last_knockout', 'age', 'total_comp_time', 'round', 'time', 'fight_url')
mma_df = mma_master_df[, cols_to_keep]
```
- Engineering additional fields to filter on later
```{r}
# Adding column 'sex' to define whether fighters are male or female
mma_df <- mma_df %>%
mutate(sex = ifelse(grepl("Women's", division), "Female", "Male"))
# Adding column 'weightclass' corresponding to max weight in division
mma_df = mma_df %>%
mutate(weightclass = case_when(
division == "Women's Strawweight" ~ 115,
division == "Women's Flyweight" ~ 125,
division == "Women's Bantamweight" ~ 135,
division == "Women's Featherweight" ~ 145,
division == "Flyweight" ~ 125,
division == "Bantamweight" ~ 135,
division == "Featherweight" ~ 145,
division == "Lightweight" ~ 155,
division == "Welterweight" ~ 170,
division == "Middleweight" ~ 185,
division == "Light Heavyweight" ~ 205,
division == "Heavyweight" ~ 265,
TRUE ~ 0
))
# `unique(mma_df$method)` returns all different outcomes of a fight:
# ("KO/TKO" "SUB" "DRAW" "U-DEC" "S-DEC" "M-DEC" "DQ")
# We combine all three types of decisions into one "DEC" category in 'method_new' variable
mma_df$method_new = ifelse(mma_df$method %in% c("U-DEC", "S-DEC", "M-DEC"), "DEC", mma_df$method)
```
Adding a field describing the gender of the fighter will help us more easily sort the data between male and female fights. The 'division' column is helpful in conversation and when reading, but associating the divisions with their maximum allowable weight is an easy way to transform categorical data into its quantitative equivalent and allow us to do numeric-specific analysis. We'll group the different types of "decision" fight outcomes into one category ('DEC') to make the data easier to work with and understand as we do not need that category to be so granular.
## Analyze (4) and Share (5)
## Why the surge in fights in the 2000s?
### Notable events 2000-2005:
- Fighters debuting in 2000:
- Anderson Silva
- Tito Ortiz
- Fabricio Werdum
- Wanderlei Silva
- Fighters debuting in 2001:
- B.J. Penn
- Matt Serra
- Fighters debuting in 2002:
- Robbie Lawler
- Fighters debuting in 2003:
- Nick Diaz
- Matt Serra
- Fighters debuting in 2004:
- Georges St-Pierre
- Michael Bisping
- Nate Diaz
- Historical TUF 1 Finale -- Forrest Griffin vs. Stephan Bonnar (2005)
- Fighters debuting in 2005:
- Forrest Griffin
- Stephan Bonnar
- Chael Sonnen
- Rashad Evans
```{r, echo=FALSE, fig.width=10, fig.height=7}
ggplot(data = mma_df) +
geom_bar(mapping = aes(x = cut(date, "12 months"), fill=method_new)) +
scale_x_discrete(labels = function(x) format(as.Date(x), "%Y")) +
theme(axis.text.x = element_text(angle = 45)) +
scale_y_continuous(limits = c(0, 1220)) +
annotate("text", x = 7, y = 340, label = "Multiple future stars debut:\nAnderson Silva (2000)\nB.J. Penn (2001)\nGeorges St-Pierre & Nate Diaz (2004)", color = "black", size = 3, angle = 0, hjust = 0.5, vjust = 0.5) +
geom_segment(aes(x = 7.2, y = 260, xend = 7, yend = 130), color = "black", arrow = arrow(length = unit(0.3, "cm"), type = "closed")) +
geom_segment(aes(x = 7.2, y = 250, xend = 8, yend = 110), color = "black", arrow = arrow(length = unit(0.3, "cm"), type = "closed")) +
geom_segment(aes(x = 7.2, y = 250, xend = 10.85, yend = 110), color = "black", arrow = arrow(length = unit(0.3, "cm"), type = "closed")) +
annotate("text", x = 10, y = 560, label = "TUF 1 Finale (2005):\nForrest Griffin vs. Stephan Bonnar", color = "black", size = 3, angle = 0, hjust = 0.5, vjust = 0.5) +
geom_segment(aes(x = 10, y = 510, xend = 12, yend = 200), color = "black", arrow = arrow(length = unit(0.3, "cm"), type = "closed")) +
annotate("text", x = 15, y = 810, label = "Conor McGregor &\n Jon Jones debut (2008)", color = "black", size = 3, angle = 0, hjust = 0.5, vjust = 0.5) +
geom_segment(aes(x = 15, y = 760, xend = 15, yend = 480), color = "black", arrow = arrow(length = unit(0.3, "cm"), type = "closed")) +
annotate("text", x = 27, y = 1220, label = "COVID-19 Pandemic\nBegins", color = "black", size = 3, angle = 0, hjust = 0.5, vjust = 0.5) +
geom_segment(aes(x = 27, y = 1180, xend = 27, yend = 1000), color = "black", arrow = arrow(length = unit(0.3, "cm"), type = "closed")) +
labs(title = "MMA Fights by Year and Outcome", x = "Year", y = "Number of Fights", fill = "Fight Outcome")
```
With so many future stars debuting from 2000-2008, one can see how the positive trajectory of their careers in the years following coincides with the growing popularity of the sport.
The year 2008 had two of arguably the most popular fighters the sport has ever seen in Conor McGregor and Jon Jones (The GOAT?). As their careers took off en route to multiple titles and title defenses, we can see the number of fights continue to rise steeply. Surely the number of fights will approach a natural limit, right?
### Enter: Professional Female MMA.
```{r, echo=FALSE}
mma_df %>%
filter(sex == "Female") %>%
ggplot() +
geom_bar(mapping = aes(x = cut(date, "12 months"), fill=method_new)) +
scale_x_discrete(labels = function(x) format(as.Date(x), "%Y")) +
theme(axis.text.x = element_text(angle = 45)) +
scale_y_continuous(limits = c(0, 230)) +
annotate("text", x = 2, y = 135, label = "Ronda Rousey 3-Year\nChampionship Reign Begins", color = "black", size = 3, angle = 0, hjust = 0.5, vjust = 0.5) +
geom_segment(aes(x = 1.5, y = 125, xend = 1, yend = 45), color = "black", arrow = arrow(length = unit(0.3, "cm"), type = "closed")) +
annotate("text", x = 8, y = 210, label = "COVID-19 Pandemic\nBegins", color = "black", size = 3, angle = 0, hjust = 0.5, vjust = 0.5) +
geom_segment(aes(x = 8, y = 200, xend = 8, yend = 170), color = "black", arrow = arrow(length = unit(0.3, "cm"), type = "closed")) +
labs(title = "Female MMA Fights by Year and Outcome", x = "Year", y = "Number of Fights", fill = "Fight Outcome")
```
We can see that Women's MMA started in 2013. The inclusion of women in the sport increased the population of fighters and fans alike, demanding more of the sport and its events.
While there are multiple professional MMA leagues, they put on events in similar fashion with multiple fights per event. Each event is comprised of different "cards" with several fights on each of them e.g. under card, preliminary card, main card. The UFC has consistently held more events than any other organization, peaking in 2019 with a staggering 45 events. While there is essentially no off-season in mixed martial arts, there is an assumed hard ceiling of 52 events annually (one per week) and a limited number of weight classes with fighters within which fighters fight. Another constraint limiting the number of annual MMA fights is the existence of other events in sports (e.g. the Superbowl, the NBA Finals) that MMA might not want to compete with to avoid diluted viewership.
## Men's MMA Analysis
### Fight Outcomes and Their Relationship with Divisions
```{r, echo=FALSE, fig.width=10, fig.height=7}
# looking at male fighter data
mma_df %>%
filter(sex == "Male") %>%
ggplot() +
geom_bar(mapping = aes(x = cut(date, "12 months"), fill=method_new)) +
scale_x_discrete(labels = function(x) format(as.Date(x), "%Y")) +
theme(axis.text.x = element_text(angle = 45)) +
labs(title = "Male MMA Fights by Year and Outcome", x = "Year",
y = "Number of Fights", fill = "Fight Outcome")
```
The above chart shows a consistent proportion of fight outcomes over the years, but does the population's fight outcome reflect that of each division? Let's take a closer look at what divisions we have first.
#### Proportion of all male fights by weightclass (division)
```{r}
# Super Heavyweight, Catch Weight, and Open Weight divisions represent a small
# portion of male MMA fights
mma_male_df = mma_df %>% filter(sex == "Male")
division_df = data.frame(Proportion = c(round(prop.table(table(mma_male_df$division)) * 100, 2)))
division_df %>%
arrange(division_df$Proportion) %>%
kable("pipe") %>%
kable_styling(full_width = FALSE, font_size = 12)
```
A ***'catch weight'*** is used when two fighters agree to fight at a nonstandard weight for a variety of reasons, such as one fighter being unable to make weight or both fighters agreeing to meet at a mutually agreed-upon weight that falls between two established weight classes. This happens in < 0.74% of Male MMA data.
Since the early 2000s, the ***Super Heavyweight*** and ***Open Weight*** divisions have not been recognized by most of the professional MMA organizations. In an Open Weight fight, fighters can have completely different weights, confusing our data. We'll filter this out from our visual analysis.
Since the Super Heavyweight division is similarly no longer active and only represents a mere 0.02% of our Male MMA data, we'll also exclude this rarity for this next visual analysis.
```{r, fig.width=10, fig.height=7}
# Arranging the divisions by ascending order, with weight increasing
mma_male_df$division = factor(mma_male_df$division,
levels = c("Flyweight", "Bantamweight", "Featherweight",
"Lightweight", "Welterweight", "Middleweight",
"Light Heavyweight", "Heavyweight"))
# Plot of Men's MMA Fights by Year and Outcome, excluding three (3) aforementioned divisions
ggplot(data = na.omit(mma_male_df)) +
geom_bar(mapping = aes(x = cut(date, "12 months"), fill=method_new)) +
facet_wrap(~division, ncol = 4, drop = T) +
scale_x_discrete(labels = function(x) format(as.Date(x), "%Y")) +
theme(axis.text.x = element_blank()) +
labs(title = "Men's MMA Fight Outcomes Over Time", x = "",
y = "Number of Fights", fill = "Fight Outcome")
```
#### Let's take a look at the average KO/TKO result in each men's division, plotted in ascending weightclass order
```{r, echo=FALSE, fig.width=10, fig.height=7, fig.align='left'}
mma_df %>%
filter(sex == "Male") %>%
filter(!division %in% c("Open Weight", "Super Heavyweight", "Catch Weight")) %>%
select(division, method_new, weightclass) %>%
group_by(division, weightclass) %>%
summarize(pct_ko_tko = (sum(method_new == 'KO/TKO'))/n()) %>%
arrange(pct_ko_tko) %>%
ggplot(aes(x = reorder(division, weightclass), y = pct_ko_tko)) +
geom_bar(stat = "identity", fill = "turquoise3") +
scale_y_continuous(limits = c(0, 1), labels = scales::percent_format()) +
labs(title = "Percentage of Male MMA Fights Won by KO/TKO by Weight Division",
x = "", y = "% of KO/TKO Outcomes")
```
The percentage of fights ending by KO/TKO increases as the divisions get heavier with the Heavyweight division KO/TKO outcome doubling that of the Flyweight division's. More weight = more power? Perhaps. Let's take a look at the women's divisions next.
## Women's MMA Analysis
### Fight Outcomes and Their Relationship with Divisions
```{r echo=FALSE}
# looking at female fighter data
mma_df %>%
filter(sex == "Female") %>%
ggplot() +
geom_bar(mapping = aes(x = cut(date, "12 months"), fill=method_new)) +
scale_x_discrete(labels = function(x) format(as.Date(x), "%Y")) +
theme(axis.text.x = element_text(angle = 45)) +
labs(title = "Female MMA Fights by Year and Outcome", x = "Year", y = "Number of Fights", fill = "Fight Outcome")
```
Compared to the earlier men's plot above, the female fight outcomes have fewer KO/TKO finishes proportionally.
Female weightclasses also mirror the lower range of the male weightclasses.
### All weightclasses in professional MMA
```{r, echo=FALSE}
weightclass_df = data.frame(division = c("Women's Strawweight", "Women's Flyweight", "Women's Bantamweight", "Women's Featherweight", "Flyweight", "Bantamweight", "Featherweight", "Lightweight", "Welterweight", "Middleweight", "Light Heavyweight", "Heavyweight"), sex = c("Female", "Female", "Female", "Female", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male"), weight = c(115, 125, 135, 145, 125, 135, 145, 155, 170, 185, 205, 265))
weightclass_df$division = factor(weightclass_df$division, levels = weightclass_df$division[order(weightclass_df$weight)])
ggplot(weightclass_df, aes(x = weight, y = division, color = sex, fill = sex)) +
geom_point(stat = "identity", size = 3, shape = 21) +
geom_segment(x = weightclass_df$weight, xend = weightclass_df$weight, y = weightclass_df$division, yend = weightclass_df$division, size = 1.5) +
scale_x_continuous(limits = c(100, 300), breaks = seq(100, 300, by = 15)) +
labs(x = "Weight (lbs)", y = "", title = "Weight Limits of MMA Divisions") +
theme_minimal() +
theme(panel.spacing.y = unit(0.1, "lines"))
```
Note the four (4) female MMA divisions on the lower end of the x-axis. Female MMA divisions only share a weight
limit with the three (3) lightest male weightclasses. One (1) female weight class (Women's Strawweight) doesn't
even have a male counterpart. Given that the entire female fighting population overlaps with the lightest male
divisions, one might expect fewer KO/TKO finishes in women's divisions if also assuming more weight is associated
with more knockout power.
### Different women's results by weightclass (division)
```{r, echo=FALSE, fig.width=9, fig.height=6.5, fig.align='left'}
# Arranging the women's divisions by ascending order, with weight increasing
mma_female_df = mma_df %>% filter(sex == "Female")
mma_female_df$division = factor(mma_female_df$division,
levels = c("Women's Strawweight", "Women's Flyweight",
"Women's Bantamweight", "Women's Featherweight"))
# Plot of Women's MMA Fights by Year and Outcome
ggplot(data = na.omit(mma_female_df)) +
geom_bar(mapping = aes(x = cut(date, "12 months"), fill=method_new)) +
facet_wrap(~division, ncol = 2, drop = T) +
scale_x_discrete(labels = function(x) format(as.Date(x), "%Y")) +
labs(title = "Women's MMA Fights by Year and Outcome", x = "Year",
y = "Number of Fights", fill = "Fight Outcome")
```
The plots are not as cut and dry as we'd like, especially with the women's featherweight division having so few observations. However, as a proportion of total fights, there does seem to be moderate correlation with KO/TKO finishes and weightclass. The lightest weightclass - women's strawweight - has the smallest proportion of KO/TKO finishes relative to total fights.
### Average KO/TKO result in each women's division, plotted in ascending weight
```{r, echo=FALSE}
mma_df %>%
filter(sex == "Female") %>%
select(division, method_new, weightclass) %>%
group_by(division, weightclass) %>%
summarize(pct_ko_tko = (sum(method_new == 'KO/TKO'))/n()) %>%
arrange(pct_ko_tko) %>%
ggplot(aes(x = reorder(division, weightclass), y = pct_ko_tko)) +
geom_bar(stat = "identity", fill = "salmon") +
scale_y_continuous(limits = c(0, 1), labels = scales::percent_format()) +
labs(title = "Percentage of Female MMA Fights Won by KO/TKO by Weight Division",
x = "", y = "% of KO/TKO Outcomes")
```
Plotting the percentage of Female MMA Fights Won by KO/TKO by weight division shows this moderately positive correlation more easily.
## Male and Female data
### Looking at correlation between weight of the fighters and likelihood of a knockout
```{r, echo=FALSE}
mma_df %>%
filter(weightclass != 0) %>%
group_by(weightclass) %>%
summarize(pct_ko_tko = mean(method_new == "KO/TKO")) %>%
ggplot() +
geom_point(mapping = aes(x = weightclass, y = pct_ko_tko), size = 3) +
scale_x_continuous(limits = c(100, 300), breaks = seq(100, 300, by = 15)) +
scale_y_continuous(limits = c(0, 1), labels = scales::percent_format()) +
labs(title = "Weight of All Fighters vs. KO/TKO Probability",
x = "Weight of Fighters", y = "% of KO/TKO Outcomes")
```
Combining the male and female data and plotting the weight of fighters vs. the percentage of KO/TKO Outcomes supports our initial thoughts of positive correlation. Let's see how men and women compare.
### Comparing male and female fight outcomes side-by-side
```{r, echo=FALSE}
mma_df_prop <- mma_df %>%
group_by(sex, method_new) %>%
summarize(n = n()) %>%
mutate(prop = n / sum(n)) %>%
ungroup() %>%
mutate(method_new = reorder(method_new, -n)) # reorder the method_new variable based on frequency
ggplot(mma_df_prop, aes(x = method_new, y = prop, fill = sex)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Male and Female Fight Outcomes", x = "", y = "% of Total Fights", fill = "Gender")
```
The chart above does a great job calling attention to the differences in fight outcomes between male and female fights. Submissions, Draws, and Disqualifications are all relatively similar in likelihood between the genders. The data show almost 20% more fights end in Decision among female fighters than male fighters. One can see that entire 20% delta is almost completely accounted for when looking at fights ending by KO/TKO.
Let's not forget our earlier graphs showing a positive relationship between the weight of a fighter and the probability that the fight ends in a KO/TKO. There are no female fighters above 145lbs, nor are there male fighters below 125lbs. Let's plot the data without the Women's Strawweight division (115lbs) and also remove all Men's weight classes above 145lbs.
```{r, echo=FALSE}
# Only plotting the below divisions (shown in pairs for convenience):
# Flyweight | Women's Flyweight" | 125lbs
# Bantamweight | Women's Bantamweight | 135lbs
# Featherweight | Women's Featherweight | 145lbs
mma_df_prop <- mma_df %>%
filter(division %in% c("Women's Flyweight", "Women's Bantamweight" ,
"Women's Featherweight", "Flyweight", "Bantamweight" , "Featherweight")) %>%
group_by(sex, method_new) %>%
summarize(n = n()) %>%
mutate(prop = n / sum(n)) %>%
ungroup() %>%
mutate(method_new = reorder(method_new, -n)) # reorder 'method_new' based on frequency
ggplot(mma_df_prop, aes(x = method_new, y = prop, fill = sex)) +
geom_col(position = "dodge") +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Male and Female Fight Outcomes of Same-weight Divisions",
x = "Fight Outcome", y = "% of Total Fights", fill = "Gender")
```
Looking only at the weight classes whose weight limit is shared between genders yields more similar distribution of fight outcomes. The gap between male and female fighters whose fights end by decision has shrunk in lockstep with the fights ending by KO/TKO. Male fight outcome distribution is much more closely mirrored by female fight outcomes when controlling for the weight of the fighters.
## Cramer's V Correlation and PCA
#### Correlation, Association, or Causation?
```{r, warning=FALSE}
# compute Cramer's V correlation coefficient
mma_filtered_df = mma_df %>% filter(weightclass != 0)
assocstats(table(mma_filtered_df$division, mma_filtered_df$method_new))
```
#### Observations:
1) The results indicate that there is a statistically significant association between the categorical variables 'division' and 'method_new'. The likelihood ratio test and Pearson test both yield large chi-square values (693.59 and 687.04, respectively), with 44 degrees of freedom (DoF), and a p-value of less than 0.001, indicating that the association between the two variables is not likely to be due to chance (DoF determine p-value of the test).
2) The contingency coefficient of 0.224 suggests a moderate association between the two variables, and the Cramer's V value of 0.115 suggests a small effect size. Overall, these results suggest that the weight division a fighter belongs to might have somewhat of an influence on the method of victory they employ.
3) The Pearson chi-square test is used to test the null hypothesis: there is no association between the two variables. It is a measure of the difference between the observed and expected frequencies in the contingency table.
4) The results show the association between the two variables, while moderate, is likely not due to chance.
### Test Assumptions of Principal Component Analysis (PCA)
It is recommended to drop highly correlated columns before running PCA in order to limit the degrees of freedom. Plotting a correlation matrix will assist in this effort.
```{r fig.width=6.5, fig.height=4.5, warning=FALSE}
num_cols = sapply(mma_df, is.numeric)
mma_numeric = mma_df[, num_cols]
other_cols = sapply(mma_df, function(x) !is.numeric(x))
other_cols = names(mma_df)[other_cols]
num_cols = colnames(mma_numeric)
mma_df_subset = mma_df[c(num_cols, other_cols)]
correlations = cor(mma_numeric)
# corrplot(correlations, type="full", method="number", tl.cex=0.7) #hiding initial correlation plot
num_cols_drop = c("sig_strikes_attempts", "total_strikes_attempts", "total_strikes_absorbed", "total_strikes_landed", "round")
mma_numeric = mma_numeric %>% select(-one_of(num_cols_drop))
new_corr = cor(mma_numeric)
corrplot(new_corr, type="full", method="number", tl.cex=0.8)
```
Variables are considered highly correlated if they have a correlation coefficient > 0.8. The results of the initial correlation plot showed several pairs of variables with this high of a coefficient. After exercising knowledge and context of the data, the following variables are dropped: 'sig_strikes_attempts', 'total_strikes_attempts', 'total_strikes_absorbed', 'total_strikes_landed', and 'round'.
### Test for Normality with Anderson-Darling test
```{r}
ad_results = NULL
for (col in names(mma_numeric)) {
ad_results = cbind(ad_results, ad.test(mma_numeric[[col]])$p.value)
}
colnames(ad_results) = names(mma_numeric)
# View the p-values
datatable(ad_results)
```
All calculated p-values are less than significance values (.05), thus we reject null hypothesis that data is normally distributed.
#### Visual evidence that data is not normally distributed with histograms
```{r, fig.align='left', echo=FALSE, fig.width=7, fig.height=6}
num_cols = colnames(mma_numeric)
# Create a 3 by 3 grid of plots
par(mfrow = c(3, 3))
# Loop through each numeric column and create a histogram
for (col in num_cols) {
hist(mma_numeric[[col]], main = col, xlab = col)
}
```
#### Test for Linearity with Scatterplots
```{r, fig.align='left'}
par(mfrow = c(6, 6), mar=c(2,2,2,2))
colors <- rainbow(length(num_cols)) # generate a vector of colors
for (i in 1:length(num_cols)) {
for (j in 1:length(num_cols)) {
if (i != j) {
col <- colors[i] # assign color to the variable
plot(mma_df_subset[[num_cols[i]]], mma_df_subset[[num_cols[j]]],
main = paste(num_cols[i], "vs", num_cols[j]),
xlab = num_cols[i], ylab = num_cols[j],
col = col) # use the color for the markers
}
}
}
```
#### Data is neither normally distributed nor linear
For good measure, we show visual evidence of the Anderson-Darling test proving that data is NOT normally distributed with histograms of each numeric variable. Additionally, the scatterplots show the data is not linear. We can conclude that Principle Component Analysis is not appropriate with this data. We'll move on and explore Nonlinear (Kernel) PCA.
### Kernel PCA
```{r}
drop_these_cols = c("result", "fight_url", "result", "fighter", "method", "date",
"opponent", "round", "time", "weightclass")
mma_df_subset = mma_df_subset %>%
filter(!division %in% c("Open Weight", "Super Heavyweight", "Catch Weight"))
mma_df_test = mma_df_subset %>% select(-one_of(drop_these_cols))
# select categorical columns for one-hot encoding
cat_cols = c("division", "sex", "method_new")
# one-hot encode categorical columns
mma_df_test = fastDummies::dummy_cols(mma_df_test, select_columns = cat_cols)
# apply kernel PCA on numeric columns
mma_numeric = mma_df_test %>%
select_if(is.numeric) %>%
scale()
#set.seed(123)
#kpca_fit = kernlab::kpca(mma_numeric, kernel = "rbfdot", kpar = list(sigma = 0.1), features = 3)
#kpca_scores = as.data.frame(predict(kpca_fit, mma_numeric))
```
```{r}
# Visualize Kernal PCA scores with plot of principal components against each other
#pairs(kpca_scores, main = "Kernel PCA")
```
#### Observations:
1) The plot does not show the data to be a good candidate for KPCA
### Further exploration:
- What could explain this general trend of heavier weightclasses seeing more KOs/TKOs proportional to the number of fights?
- Heavier fighters >> more muscle mass >> more power >> greater chance of knockout?
- Are number of sig_strikes absorbed correlated with fight outcome?
- Are number of strikes thrown correlated with weight class? Fight outcome?
- Deeper dive into fighter performance over time
- Exploring age of fighters and their success rate
- Exploring total strikes absorbed over time vs. win %
- Other things to consider:
- As fighters increase in weightclass, how much does their ability to absorb strikes without being KO'd/TKO'd change?
- Does this trend hold true with female fighters?
- Is number of previous fights correlated with probability of KO/TKO?
- Is number of previous KOs/TKOs correlated with probability of KO/TKO?
- Is number of days since last KO/TKO correlated with probability of KO/TKO?
## Closing Thoughts (6)
### Guiding questions
Based on the analysis, we can conclude that the sport of MMA has grown substantially since its professional inception in 1994. Some possible explanations include incredible talent debuting in the early 2000s who would go on to dazzle soon-to-be MMA fighters and fans alike. With MMA being a men's sport originally, it realized its organic growth potential with the addition of women's divsions in the 2010s. Coupled with amplifying social media outlets, sports stars have never before been more known by the fans. Where the days of the pre-social media era saw a couple of superstars, this post-social media era has likely increased the reach of all stars in a sport. This increased reach - and the existence of the star talent in the sport - could explain the rise of MMA we have seen.
While examining the different outcomes a fight can have, we noticed and explored the variance in the proportion of KO/TKO finishes across both weightclass and gender. The data showed a somewhat positive correlation between weight of the fighter and likelihood of a KO/TKO finish. Keeping that in mind, we compared Male and Female fight outcomes, which showed Submissions, Draws, and Disqualifications all to have a similar probability of occurring. Decision and KO/TKO were another story. The data show almost 20% more fights end in Decision among female fighters than male fighters, with that delta being traded off in KOs/TKOs. Remembering the somewhat positive correlation between a fighter's weight and likelihood of a KO/TKO, we controlled for weight and plotted only the weightclasses whose weight limits were shared between men and women. This yielded a much more similar distribution of fight outcomes between the two genders.
Using a Pearson test we found a statistically significant association between a fighter's division and the fight's outcome. Large chi-square values and a p-value of less than 0.001 indicated that the association between the two variables is not likely to be due to chance. However, Cramer's V correlation coefficient of 0.115 suggests a small effect size. To conclude, the association between the the weight division a fighter belongs to and the method of victory they employ is moderate. However, it's highly unlikely that it's not due to chance.
When testing the assumptions necessary for a Principal Component Analysis, the data proved not to be normally distributed or linear. We then took a nonlinear approach, running a Kernel PCA analysis.
### What additional data could be used to expand on our findings?
- The financials of different MMA (UFC) events would help in exploring correlation between revenue earned and fights on the card. This could aid in exploring what fight outcomes are most exciting and likely to generate more revenue over time or perhaps what weightclasses are the most popular and drive the most pay per views.