-
Notifications
You must be signed in to change notification settings - Fork 2
/
Course Project - Practical Machine Learning .Rmd
144 lines (108 loc) · 6.42 KB
/
Course Project - Practical Machine Learning .Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
title: "Course Project - Practical Machine Learning"
author: "Vinicio De Sola, MFE"
date: "May 10, 2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(caret)
library(elasticnet)
library(rpart.plot)
```
# Classifying the class of Unilateral Dumbbell Biceps Curl depending on wearable accelerometers' data
## Executive Summary
In this project we're going to use accelerometers' data, provided by [Groupware@LES][1], to classify how well the Unilateral Dumbbell Curl exercise is done according to an A to E scale. We do cross-validation on the training set using a 60/40 split. We'll do a linear SVM, a Boosted Decision Tree, and a Random Forest Classifier.
[1]: http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises
## Data
The data is already divided into a [training][2] and a [test][3] set. The class variable goes from A to E:
-**A Exactly according to the specification**
-**B Throwing the elbows to the front**
-**C Lifting the dumbbell only halfway**
-**D Lowering the dumbbell only halfway**
-**E Throwing the hips to the front**
Only class A means that the exercise is executed correctly.
[2]: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
[3]: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
## Data cleaning and subsampling
We start by cleaning the data. First we upload the archive, turning all non-standard characters into NAs. Then, we apply a threshold of 5% of missing data as the benchmark to decide if we keep the variable or not. Although we want to keep the most information possible, unbalanced data with a lot of NAs is worst than dropping variables. The training set is big enough to allow us to do this type of cleaning. We also remove the first 7 variables (the id, the user name, the times of measurement, and the window of measurement).
```{r data}
## Load the data
## All non-standard characters are turned into NA
## We set the seed as 1245 fot reproducibility
set.seed(1245)
training <- read.csv('pml-training.csv', na.strings =c('NA','#DIV/0!',""))
testing <- read.csv('pml-testing.csv',na.strings =c('NA','#DIV/0!',""))
## Data cleaning
## Remove all variables with a threshold of 5% of NA's
training <- training %>% select_if(colSums(!is.na(.)) > 0.05*nrow(.)) %>%
.[,-c(1:7)]
testing <- testing %>% select_if(colSums(!is.na(.)) > 0.05*nrow(.)) %>%
.[,-c(1:7)]
## Divide the training into train and validation
inTrain <- createDataPartition(training$classe, p=0.6, list = FALSE)
newTrain <- training[inTrain,]
validation <- training[-inTrain,]
## Number of variables to work with
dim(newTrain)
dim(validation)
dim(testing)
## Look at the distribution of the classes
plot(newTrain$classe, col='green', main='Histogram of classe',
xlab = 'Classes Levels', ylab='Frequencies')
```
At the end we are working with 53 variables. The training set will have 11776 observations and the validation set 7846 observations. We have the same 53 variables for the test set. The depedent variable seems balanced.
## Modeling
### SVM
Let's start using a SVM classifier, with a linear Kernel. We'll use a 10-fold cross-validation as a standard for all models, because it's not too big nor too small of a k (good for the bias-variance trade-off). We need to preprocess the data (center and scale it), as needed to apply any SVM method. We also plot the cross-validation of the parameters (expressed in the grid).
```{r SVM, warning=FALSE, cache=TRUE}
set.seed(1245)
trControl <- trainControl(method='cv',number=10)
grid <- expand.grid(C = c(0,1, 5, 10))
modelSVM <- train(classe~., data=newTrain, method='svmLinear',
trControl = trControl, preProcess = c('center', 'scale'),
tuneGrid = grid,
tuneLenght = 10)
plot(modelSVM, main='Cross-Validation SVM')
## Accuracy
predicSMV <- predict(modelSVM, validation, type='raw')
confusionMatrix(predicSMV, validation$classe)
```
We have an accuracy of 0.7981 and the cross-validation is asking for a bigger value of the grid.
### Boosted Decision Tree
We'll use the boosted algorithm to get better results than regular decision trees. We're going to also use 10-fold classification and preprocess the data. We also plot the cross-validation used for selection of the hyper-parameters
```{r GBM, warning=FALSE, cache=TRUE}
set.seed(1245)
trControl <- trainControl(method='cv',number=10,
classProbs=T,savePredictions = T)
modelGBM <- train(classe~., data=newTrain, method='gbm',
preProcess = c('center', 'scale'),
trControl = trControl, verbose=FALSE)
predicGBM <- predict(modelGBM, validation, type='raw')
plot(modelGBM, main='Cross-Validation GBM')
## Accuracy
confusionMatrix(predicGBM, validation$classe)
```
In this model, our accuracy now jumps to 0.9593, and in the cross-validation graph we can see how caret chooses the number the depth and the boosting iterations,
### Random Forest
We're going to also use 10-fold classification and preprocess the data. We also plot the cross-validation used for selection of the hyper-parameters
```{r RF, warning=FALSE, cache=TRUE}
set.seed(1245)
trControl <- trainControl(method='cv',number=10,
classProbs=T,savePredictions = T)
modelRF <- train(classe~., data=newTrain, method='rf',
preProcess = c('center', 'scale'),
trControl = trControl, verbose=FALSE)
predicRF <- predict(modelRF, validation, type='raw')
plot(modelRF, main='Cross-Validation RF')
## Accuracy
confusionMatrix(predicRF, validation$classe)
```
At the end we have another jump in accuracy, to 0.9934 in the validation set. Thus, we select Random Forest as the final model. In the cross-validation graph we observe how the accuracy goes down if we increase the randomly selected predictors, thus we are picking the best hyper-parameter. Bar any major overfitting, which we know Random Forest sometimes is guilty of, we expect to have an out-of-sample accuracy around 99%. By definition, it should be lower than the training, but because we run it in a validation set, which we didn't use for training, we can infer similar results.
## Final Prediction
Here is our final prediction using the Random Forest model
```{r prediction}
prediTest <- predict(modelRF,testing)
prediTest
```