-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreport.Rmd
1231 lines (958 loc) · 42.2 KB
/
report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "MovieLens Project Report"
author: "Tang Li"
date: "`r format(Sys.Date(), '%b %d, %Y')`"
output:
pdf_document:
latex_engine: lualatex
toc: TRUE
df_print: kable
extra_dependencies: xcolor
mainfont: open-sans
monofont: inconsolata
linkcolor: violet
urlcolor: violet
toccolor: violet
---
```{r setup, include=FALSE}
if (!require(knitr)) install.packages(
"knitr", repos = "http://cran.us.r-project.org"
)
library(knitr)
options(digits = 3)
opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE,
fig.align = "center",
out.width = "75%"
)
```
```{r fonts, include=FALSE}
if (!require(gfonts)) install.packages(
"gfonts", repos = "http://cran.us.r-project.org"
)
library(gfonts)
if (!dir.exists("fonts")) {
dir.create("fonts")
}
download_font(
id = "open-sans", output_dir = "fonts",
variants = c("regular", "italic", "bold")
)
download_font(
id = "inconsolata", output_dir = "fonts",
variants = c("regular", "italic", "bold")
)
```
--------------------------------------------------------------------------------
Note: All the links are colored in \textcolor{violet}{violet} without underline,
feel free to click!
# Overview
## About this project
This machine learning project is about creating a movie recommendation system,
using the [MovieLens 10M](#ref-movielens) dataset. The goal is to develop a
machine learning algorithm that can predict movie ratings by users, with rmse
(root-mean-square-error) as low as possible.
Using [recosystem library](#ref-recosystem) which supports parallel matrix
factorization, my recommendation algorithm is able to predict ratings (from
unknown validation set) with a rmse of 0.786.
## Describe MovieLens dataset
Note: This section is just a brief overview. I'll go into more detail in the
[Methods](#methods) section.
From the [MovieLens readme](#ref-movielens) site, the MovieLens dataset contains
10000054 ratings of 10681 movies by 71567 users.
Each row of dataset contains:
- `rating`: can be half/whole stars ranging from 0.5 to 5
- `userId`: represent a user, unique
- `movieId`: represent a movie, unique
- `title`: title of the movie, with year in parenthesis
- `genres`: genres of the movie in a pipe-separated list, selected from a
total of [18 genres](#appx-movie-genres)
- `timestamp`: time when the user rated the movie
```{r diagram-dataset, echo=FALSE, out.width="30%"}
include_graphics("plots/dataset.png")
```
Immediately after downloading the MovieLens dataset, data is split into edx
(90%) and validation (10%) sets.
- edx set is used in model development
- validation set is NOT used anywhere except when **testing the final model**,
so it's hidden data
Just to give a preview of data visualization in this overview section, here is a
histogram that shows the ratings distribution of edx set. We can see that whole
star ratings are more common than half star ratings.
```{r plot-edx, echo=FALSE}
include_graphics("plots/edx_hist.png")
```
## Key steps performed {#key-steps}
```{r diagram-steps, echo=FALSE, out.width="50%"}
include_graphics("plots/steps.png")
```
1. download and split MovieLens dataset into edx and validation sets
2. explore and visualize edx set
3. develop the **first** model
- split edx set into training and test sets
- using training set
- **tune** to find the best tuning opts that achieve min. rmse
- **train** with best tuning opts
- using test set
- **predict** ratings
- **evaluate** rmse
> See [Modeling approach section](#modeling-approach) for more explanation.
>
> In recosystem, a model has 2 parts: best tuning opts and training data. Final
> model differs with first model by **using edx instead of training set**,
> because it has the full data.
>
> First model: training set + best tuning opts
>
> Final model: edx set + best tuning opts
4. develop the **final** model
- using edx set
- **train** the final model with best tuning opts
- using validation set
- **predict** ratings
- **evaluate** rmse
# Methods {#methods}
## Step 1: Prepare dataset
First, let's download and split MovieLens dataset into edx and validation sets.
We will download the libraries that are required to split dataset, as well as
recosystem library used in the recommendation algorithm. By using different
joins, there won't be any user/movie in `validation` that's not in `edx`.
Note: This download code is provided by the course.
```{r step1-download, results='hide', cache=TRUE}
if (!require(tidyverse)) install.packages(
"tidyverse", repos = "http://cran.us.r-project.org"
)
if (!require(caret)) install.packages(
"caret", repos = "http://cran.us.r-project.org"
)
if (!require(data.table)) install.packages(
"data.table", repos = "http://cran.us.r-project.org"
)
if (!require(recosystem)) install.packages(
"recosystem", repos = "http://cran.us.r-project.org"
)
library(tidyverse)
library(caret)
library(data.table)
library(recosystem)
dl <- tempfile()
download.file("http://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)
ratings <- fread(
text = gsub(
"::", "\t",
readLines(unzip(dl, "ml-10M100K/ratings.dat"))
),
col.names = c("userId", "movieId", "rating", "timestamp")
)
movies <- str_split_fixed(
readLines(unzip(dl, "ml-10M100K/movies.dat")),
"\\::", 3
)
colnames(movies) <- c("movieId", "title", "genres")
# if using R 3.6 or earlier:
if (as.integer(R.version$major) <= 3 & as.double(R.version$minor) <= 6) {
movies <- as.data.frame(movies) %>%
mutate(
movieId = as.numeric(levels(movieId))[movieId],
title = as.character(title),
genres = as.character(genres)
)
}
# if using R 4.0 or later:
if (as.integer(R.version$major) >= 4) {
movies <- as.data.frame(movies) %>%
mutate(
movieId = as.numeric(movieId),
title = as.character(title),
genres = as.character(genres)
)
}
movielens <- left_join(ratings, movies, by = "movieId")
# if using R 3.5 or earlier, use `set.seed(1)`
if (as.integer(R.version$major) <= 3 & as.double(R.version$minor) <= 5) {
set.seed(1)
} else {
set.seed(1, sample.kind = "Rounding")
}
test_index <- createDataPartition(
y = movielens$rating, times = 1, p = 0.1, list = FALSE
)
edx <- movielens[-test_index, ]
temp <- movielens[test_index, ]
# Make sure userId and movieId in validation set are also in edx set
validation <- temp %>%
semi_join(edx, by = "movieId") %>%
semi_join(edx, by = "userId")
# Add rows removed from validation set back into edx set
removed <- anti_join(temp, validation)
edx <- rbind(edx, removed)
rm(dl, ratings, movies, test_index, temp, movielens, removed)
```
Because I will only use `userId`, `movieId`, `rating` columns, we keep these 3
columns and remove the rest. Lastly, we save `edx` and `validation` dataframes
into rda files, and remove `validation` to save memory. The validation set rda
file can be loaded back in future when we use it.
```{r step1-prepare, results='hide'}
edx <- edx %>% select(userId, movieId, rating)
validation <- validation %>% select(userId, movieId, rating)
if (!dir.exists("rdas")) {
dir.create("rdas")
}
save(edx, file = "rdas/edx.rda")
save(validation, file = "rdas/validation.rda")
rm(validation)
```
Let's view `edx` by sampling a few rows.
```{r step1-edx-head}
set.seed(1)
edx %>% slice_sample(n = 6)
```
## Step 2: Describe dataset
### edx rating {#edx-rating}
Let's see the 5-number summary (min, Q1, median, Q3, max) of `edx$rating`.
```{r step2-rating-summary}
summary(edx$rating)
```
We can see that the minimum rating is 0.5, and half of the ratings are between 3
and 4 stars, so overall users are kind in their ratings!
Here is a ratings histogram, which shows that whole star ratings are more common
than half star ratings.
```{r step2-rating-histogram, echo=FALSE}
edx %>%
ggplot(aes(rating)) +
geom_bar() +
scale_x_continuous(breaks = seq(0.5, 5, 0.5)) +
labs(title = "Rating Distribution of edx set")
```
### Users and movies
There are 69878 users and 10677 movies are in `edx` set, and the sparsity is
1.21%.
```{r step2-sparsity}
n_distinct(edx$userId)
n_distinct(edx$movieId)
nrow(edx) / (n_distinct(edx$userId) * n_distinct(edx$movieId))
```
## Step 3: Develop the first model
### Create training and test sets
We can split `edx` set into `train` set (90%) and `test` set (10%) using the
caret package. By using different joins, there won't be any user/movie in `test`
that's not in `train`.
```{r step3-split, results='hide'}
set.seed(1)
test_index <- createDataPartition(
y = edx$rating, times = 1, p = 0.1, list = FALSE
)
train <- edx[-test_index, ]
temp <- edx[test_index, ]
# Make sure userId and movieId in test set are also in training set
test <- temp %>%
semi_join(train, by = "movieId") %>%
semi_join(train, by = "userId")
# Add rows removed from test set back into training set
removed <- anti_join(temp, test)
train <- rbind(train, removed)
rm(edx, test_index, temp, removed)
```
### Modeling approach {#modeling-approach}
#### What is a model?
To create a model means developing an algorithm by analyzing a known dataset.
Machine learning is about using this model so that it can predict useful
information of a new/unknown dataset.
#### Tune: generate the best recosystem algorithm
Previously in [Key Steps](#key-steps) we described that a model in recosystem
has 2 parts: training data + best tuning opts. Training data is the known
dataset, and best tuning opts is the algorithm part.
Although we don't know how recosystem does matrix factorization under the hood,
we can **create different recosystem algorithms by putting in different tuning
parameters**. This process is the tuning step, and the goal is to find best
tuning opts: the optimal param and option values that result in min rmse, and
thus generate the **best recosystem algorithm** *for our data*.
> The emphasis is on "for our data" because different tuning opts work best for
> different data. What we consider the best recosystem algorithm for movie
> recommendation is unlikely to work for song recommendation.
#### Train: create the model
Once we are satisfied with the rmse resulted by best tuning opts, we execute the
algorithm on training data to create a model. The model will contain matrices
necessary for the predicting step.
#### Predict: use the model to predict the unknown
The predict step uses the matrices in the model to predict ratings in the
unknown set. After getting the predicted ratings list, we can calculate the rmse
by comparing with the list of true ratings.
#### recosystem functions
Tune, train, and predict functions are stored in the model object, using and
updating information in the model. Each step is further explained in the next
sections.
| | |
|--------------------|-------------------|
| create a new model | `model <- Reco()` |
| tune | `model$tune()` |
| train | `model$train()` |
| predict | `model$predict()` |
### Create the first model {#create-the-first-model}
Using `Reco()` function, we create a new model object.
```{r step3-create-model}
first_model <- Reco()
```
But we are not yet done the setup because there is a data conversion step.
To put our data `train` and `test` in recosystem, they need to be **converted
from dataframes to DataSource** objects. We can use `data_memory` function to
convert. The DataSource objects will contain same data as the original
dataframes.
Let's add the **`_data`** suffix to DataSource variable names to avoid mix up.
Here's a table that shows the variable names, and the syntax to access user,
movie, and rating columns.
| Name | Type | Access user | Access movie | Access rating |
|--------------|------------|--------------|--------------|---------------|
| `train` | dataframe | `$userId` | `$movieId` | `$rating` |
| `train_data` | DataSource | `@source[1]` | `@source[2]` | `@source[3]` |
| `test` | dataframe | `$userId` | `$movieId` | `$rating` |
| `test_data` | DataSource | `@source[1]` | `@source[2]` | `@source[3]` |
Note: The `index1` argument below in `data_memory` function asks whether the ids
start with 0 or 1. They start with 1, so we set `index1 = TRUE`.
```{r step3-index1}
min(train$userId)
min(train$movieId)
```
```{r step3-convert-data}
train_data <- data_memory(
user_index = train$userId,
item_index = train$movieId,
rating = train$rating,
index1 = TRUE
)
test_data <- data_memory(
user_index = test$userId,
item_index = test$movieId,
rating = test$rating,
index1 = TRUE
)
rm(train, test)
```
### Tuning
The `$tune()` function uses k-fold cross validation to tune the model's
parameters.
**input**
- `train_data`: training data
- `opts`: parameters and options
- params are tuned, tune will choose the best value for each param
- options are NOT tuned, think of these as settings
**output**
- `min` (**best tuning opts** mentioned earlier): the optimal combination with
each param assigned a value, that result in **minimum rmse**
- `res`: each combination and its rmse
#### Default tuning opts {#default-tuning-opts}
From the help file by typing `?tune` in console, we find recosystem default
tuning opts. Default is used when calling `model$tune()` with no `opts`
argument. The code chunk below shows what default looks like, and stores the
tuning opts in a variable.
The tune function will look through each combination and pick the best one with
min rmse, out of $2^6 = 64$ combinations (6 params, 2 choices each).
> For example, the first combination is:
> `dim = 10, costp_l1 = 0, costp_l2 = 0.01, costq_l1 = 0, costq_l2 = 0.01, lrate = 0.01`.
```{r step3-default}
best_tune_opts <- list(
# params
dim = c(10, 20),
costp_l1 = c(0, 0.1),
costp_l2 = c(0.01, 0.1),
costq_l1 = c(0, 0.1),
costq_l2 = c(0.01, 0.1),
lrate = c(0.01, 0.1),
# options
loss = "l2",
nfold = 5,
niter = 20,
nthread = 1,
nbin = 20,
nmf = FALSE,
verbose = FALSE,
progress = TRUE
)
```
Here's a table about what each parameter and option means.
> P in `costp` means user, Q in `costq` means movies in this case.
>
> l2 is related to RMSE - the squared error, l1 is related to MAE - mean
> absolute error.
+-----------------+-----------------+-----------------+---------------------+
| Parameters | What it is | Options | What it is |
+=================+=================+=================+=====================+
| dim | number of | nfold | number of folds in |
| | factors for | | cross validation |
| | matrix | | |
| | factorization | | |
+-----------------+-----------------+-----------------+---------------------+
| costp_l1 | L1 user factors | niter | number of |
| | regularization | | iterations |
| | cost | | |
+-----------------+-----------------+-----------------+---------------------+
| costp_l2 | L2 user factors | nthread | number of threads |
| | regularization | | for parallel |
| | cost | | computing |
+-----------------+-----------------+-----------------+---------------------+
| costq_l1 | L1 movie | nbin | number of bins: |
| | factors | | must be $>$ nthread |
| | regularization | | |
| | cost | | |
+-----------------+-----------------+-----------------+---------------------+
| costq_l2 | L2 movie | nmf | whether to perform |
| | factors | | non-negative matrix |
| | regularization | | factorization |
| | cost | | |
+-----------------+-----------------+-----------------+---------------------+
| lrate | learning rate: | verbose | whether to print |
| | step size in | | info in console |
| | gradient | | |
| | descent | | |
+-----------------+-----------------+-----------------+---------------------+
| loss | error function: | progress | whether to print |
| | loss = l2 for | | progress bar in |
| | rmse | | console |
+-----------------+-----------------+-----------------+---------------------+
#### Change the default opts
The default provides a nice starting point, but we want to customize the params
and options before we call tune function for the first time. Yes, we will call
tune function 2 times! The reasons are below each code chunk.
```{r step3-change-options}
best_tune_opts$nthread <- 16
best_tune_opts$nbin <- 128
best_tune_opts$nmf <- TRUE
best_tune_opts$verbose <- TRUE
```
**Options**
+---------------+---------------+---------------+------------------------------+
| Option | Old value | New value | Reason for change |
+===============+===============+===============+==============================+
| nthread | 1 | 16 | Take advantage of |
| | | | recosystem's parallel |
| | | | computation and speed up the |
| | | | process. |
+---------------+---------------+---------------+------------------------------+
| nbin | 20 | 128 | What nbin exactly means is |
| | | | unclear in recosystem's help |
| | | | file, but since nbin must be |
| | | | $>$ nthread, I just made |
| | | | them both bigger. |
+---------------+---------------+---------------+------------------------------+
| nmf | FALSE | TRUE | All our ratings are |
| | | | non-negative, we don't want |
| | | | any predicted ratings to be |
| | | | $<$ 0. |
+---------------+---------------+---------------+------------------------------+
| verbose | FALSE | TRUE | Outputs progress in RStudio |
| | | | console (not in report pdf, |
| | | | there would be many pages!) |
+---------------+---------------+---------------+------------------------------+
**Params**
The 2 tuning performance concerns are efficiency and quality.
Tuning needs to be efficient because its intense computation can take a long
time, so we want to **run as few combinations as possible**. The changed set
only has $2^4 = 16$ combinations compared to the default of [64
combinations](#default-tuning-opts). We choose to only test 2 values for each
param (picking 3 values would result in $3^4 = 81$ combinations). We should also
try to improve the quality of each tuning, so we **shouldn't waste on
unlikely/bad guesses** with only 2 values to test for each param.
Moreover, the goal of tuning is try to find/approach the global minimum of rmse
in a few rounds of tuning, but it's easy to get stuck inside a local minimum. We
can avoid this by starting out big **(test values that are farther apart) in the
first call** of tuning, then we narrow down, test the best value's close
neighbors in later rounds. More on this later!
> For example, the first combination is:
> `dim = 20, lrate = 0.1, costp_l1 = 0, costp_l2 = 0, costq_l2 = 0, costq_l2 = 0`
```{r step3-change-params}
best_tune_opts$dim <- c(20, 40)
best_tune_opts$lrate <- c(0.1, 0.2)
best_tune_opts$costp_l1 <- 0
best_tune_opts$costp_l2 <- c(0, 0.2)
best_tune_opts$costq_l1 <- 0
best_tune_opts$costq_l2 <- c(0, 0.2)
best_tune_opts
```
+---------------+---------------+---------------+------------------------------+
| Param | Old value | New value | Reason for change |
+===============+===============+===============+==============================+
| dim | c(10, 20) | c(20, 40) | I believe that the more |
| | | | dimensions in matrix |
| | | | factorization, the better |
| | | | the result would be. Since |
| | | | movies have [18 movie |
| | | | genres](#appx-movie-genres), |
| | | | my guess is that we need at |
| | | | least 20 dimensions. |
+---------------+---------------+---------------+------------------------------+
| lrate | c(0.01, 0.1) | c(0.1, 0.2) | A learning rate step size of |
| | | | 0.01 is not a good guess |
| | | | because it's almost 0 (no |
| | | | gradient descent), test 0.2 |
| | | | instead |
+---------------+---------------+---------------+------------------------------+
| costp_l1, | c(0, 0.1) | 0 | l1 is related to mean |
| costq_l1 | | | absolute error, not rmse, so |
| | | | won't tune these |
+---------------+---------------+---------------+------------------------------+
| costp_l2 | c(0.01, 0.1) | c(0, 0.2) | l2 is related to rmse, and |
| | | | we test a bigger range |
+---------------+---------------+---------------+------------------------------+
#### Tuning round 1
Now let's tune, and store the tune result in a variable.
```{r step3-tune1, results='hide'}
set.seed(1)
tune_output <- first_model$tune(train_data, opts = best_tune_opts)
```
`min` shows the combination that yields the minimum rmse, and `res` shows rmse
of all combinations. The minimum rmse is 0.801, so looks pretty good on the
first try!
```{r step3-tune1-output}
tune_output$min
tune_output$res
```
##### Analysis
We can visualize the full result better with 2 raster plots.
**dim-lrate plot**: Using tidyverse library, we can `group by` dim and lrate (4
combinations), but because there are 4 rmses for each combination (since l2
factors are not fixed), we take the minimum rmse in `summarize`. The code chunk
below shows the dataframe which can then be piped to ggplot.
**l2 factors plot**: similar to first plot, `group by` costp_l2 and costq_l2,
and `summarize` min rmse
```{r step3-tune1-group}
tune_output$res %>%
group_by(dim = factor(dim), lrate = factor(lrate)) %>%
summarize(rmse = min(loss_fun))
tune_output$res %>%
group_by(costp_l2 = factor(costp_l2), costq_l2 = factor(costq_l2)) %>%
summarize(rmse = min(loss_fun))
```
```{r step3-tune1-dim-lrate-raster, echo=FALSE}
tune_output$res %>%
group_by(dim = factor(dim), lrate = factor(lrate)) %>%
summarize(rmse = min(loss_fun)) %>%
ggplot(aes(dim, lrate, fill = rmse)) +
geom_raster() +
labs(
fill = "Min. rmse",
title = "Raster Plot of rmse with dim and lrate"
) +
scale_fill_distiller(palette = "Reds")
```
Looking at the color bar legend in the dim-lrate raster plot, rmses roughly
ranges from 0.802 to 0.81, and don't vary much from each other. This suggests
that we further tuning dim and lrate won't be much of an improvement.
```{r step3-tune1-l2-raster, echo=FALSE}
tune_output$res %>%
group_by(costp_l2 = factor(costp_l2), costq_l2 = factor(costq_l2)) %>%
summarize(rmse = min(loss_fun)) %>%
ggplot(aes(costp_l2, costq_l2, fill = rmse)) +
geom_raster() +
labs(
fill = "Min. rmse",
title = "Raster Plot of rmse with l2 Cost Factors"
) +
scale_fill_distiller(palette = "Blues")
```
However, the color bar legend in the l2 cost factors plot shows that rmses
roughly ranges from 0.8 to 0.88, which is a big difference. Notably it's
`costq_l2` that significantly changes the rmses. Looking at the 2 tiles in the
first column `costp_l1 = 0`, changing `costq_l2` from 0 to 0.2 reduces rmse by
about 0.03. We will further tune `costq_l2` (the movie regularization cost
factors) in the next round.
Before starting tuning round 2, let's make sure our tuning list variable is
synced up with this round's min result.
```{r step3-tune1-update}
best_tune_opts$dim <- tune_output$min$dim
best_tune_opts$lrate <- tune_output$min$lrate
best_tune_opts$costp_l2 <- tune_output$min$costp_l2
best_tune_opts$costq_l2 <- tune_output$min$costq_l2
```
#### Tuning round 2
In tuning round 1's analysis section, we chose to further tune `costq_l2`, the
movie cost factors. We can verify whether `costq_l2 = 0.2` result in the true
minimum of rmse, or if there is a better neighbor of 0.2 that results in an even
lower rmse.
We choose the neighborhood of $[0.05, 0.3]$ with a step size of 0.05. The
expanded list is 0.05, 0.1, 0.15, 0.2, 0.25, 0.3. This time we can afford to
test 6 values, because we are only tuning 1 param, so there are only 6
combinations! The code chunk below updates `costq_l2` and displays our tuning
list variable.
```{r step3-tune2-param}
best_tune_opts$costq_l2 <- seq(0.05, 0.3, 0.05)
```
Now let's tune, and see the result!
```{r step3-tune2, results='hide'}
set.seed(1)
tune_output <- first_model$tune(train_data, opts = best_tune_opts)
```
```{r step3-tune2-output}
tune_output$min
```
##### Analysis
The min rmse is 0.797, with `costq_l2 = 0.1`. We can plot the full result using
`geom_point`.
```{r step3-tune2-plot, echo=FALSE}
tune_output$res %>%
ggplot(aes(costq_l2, loss_fun)) +
geom_point() +
scale_x_continuous(breaks = best_tune_opts$costq_l2) +
labs(
y = "rmse",
title = "Plot of RMSE vs. Movie l2 Cost Factors"
)
```
We won't tune costq_l2 again, because there isn't much room for improvement.
Here the lowest rmse is only 0.01 lower than highest rmse. Update and print the
best tuning opts variable, and the tuning stage is now complete!
```{r step3-tune2-update}
best_tune_opts$costq_l2 <- tune_output$min$costq_l2
best_tune_opts
rm(tune_output)
```
### Train
The `$train()` function will read from training data, and create a model that
contains matrices necessary for prediction later. We'll be using `train_data` as
our training data.
**input**
- `train_data`: training data
- `out_model`: where the model is stored, set `NULL` for storing in memory
- `opts`: tuning params and options
**output**
- There is no return value, but `$model` will be populated.
The code chunk output below shows that `tr_rmse` (training rmse) gradually
decreases over `niter = 20` training iterations.
```{r step3-train}
set.seed(1)
first_model$train(train_data, opts = best_tune_opts)
```
#### Analysis
How does the model generates and improves this `tr_rmse`, and why this rmse is
so much better than the 0.797 we got in tuning? The model probably uses cross
validation and gradient descent, though we don't really know what recosystem
does in its source code.
If the more iterations we train, the lower the `tr_rmse`, why not try a lot more
iterations like 100 iterations? However, this is NOT a good idea because **we
don't want to over-train** the model with training data, because eventually the
model will be evaluated against the unknown test data.
### Predict
The `$predict()` function predicts unknown ratings in the testing data. We'll be
using `test_data` as our testing data.
**input**
- `test_data`: testing data
- `out_pred`: where to store the predicted ratings, set `out_memory()` to
store in memory
output
- a list of predicted ratings
> Note: The true ratings stored in `test_data` will be ignored by the predict
> function. Only user and movie ids are used. From help file by typing
> `?predict`:
>
> In `$predict()` ... the testing data have the same format as training data,
> except that the value (rating) column is not required, and **will be ignored
> if it is provided**.
Let's predict the ratings and see the 5-number summary.
```{r step3-predict}
set.seed(1)
pred_ratings <- first_model$predict(test_data, out_memory())
summary(pred_ratings)
```
The predicted ratings range falls outside of the star range $[0.5, 5]$. Plotting
a histogram, we can see that it's a left-skewed bell curve. Unlike edx set
distribution we've seen in the [data exploration](#edx-rating) section, where
there are a lot more whole star ratings, here our predicted ratings **don't
favor whole stars ratings over half stars**.
```{r step3-predict-histogram, echo=FALSE}
tibble(rating = pred_ratings) %>%
ggplot(aes(rating)) +
geom_histogram(binwidth = 0.5) +
scale_x_continuous(breaks = seq(0, 6.5, 0.5)) +
labs(title = "Distribution of test set's Predicted Ratings")
```
Using the `bound_rating()` function, we can bound the predicted ratings inside
the $[0.5, 5]$ by setting any rating $<$ 0.5 to 0.5, and any rating $>$ 5 to 5.
As we've seen in the [data conversion](#create-the-first-model) section, test
set's true ratings are stored as `test_data@source[[3]]`. We call
`evaluate_rmse()` with true and predicted ratings. The evaluated rmse is 0.789.
```{r step3-rmse}
bound_rating <- function(ratings) {
sapply(ratings, function(r) {
if (r > 5) return(5)
if (r < 0.5) return(0.5)
return(r)
})
}
evaluate_rmse <- function(true, pred) {
sqrt(
mean((true - pred)^2)
)
}
pred_ratings <- bound_rating(pred_ratings)
rmse_test <- evaluate_rmse(test_data@source[[3]], pred_ratings)
rmse_test
```
#### Analysis
Besides rmse evaluation, there is much more information for us to discover. For
instance, we can generate a boxplot comparing the predicted and true ratings.
The model predicted ratings with nearly the same IQR (interquartile range/middle
half) as the true ratings. **However, the model is a little cautious on
predicting higher ratings.** The predicted median about 3.5 is lower than the
true median of 4.
```{r step3-predict-boxplot, echo=FALSE, out.width="50%"}
rbind(
tibble(rating = pred_ratings, source = "predicted"),
tibble(rating = test_data@source[[3]], source = "true")
) %>%
ggplot(aes(source, rating, color = source)) +
geom_boxplot() +
guides(color = "none") +
labs(title = "Boxplot of test set's \n Predicted vs True Ratings")
```
We can use a jitter point plot to see for each true star rating how far off our
predicted ratings are.
> The plot is generated by sampling 5000 points (for the code to finish on
> time). The alpha is 0.005, so if readers see one solid colored dot, it's the
> result of an overlap of hundreds of points.
```{r step3-predict-jitterplot, echo=FALSE}
set.seed(1)
tibble(true = test_data@source[[3]], pred = pred_ratings) %>%
slice_sample(n = 5000) %>%
ggplot(aes(true, pred, color = factor(true))) +
geom_point(alpha = 0.005) +
geom_jitter() +
scale_x_continuous(breaks = seq(0.5, 5, 0.5)) +
scale_y_continuous(breaks = seq(0.5, 5, 0.5)) +
guides(color = "none") +
labs(x = "true", y = "predicted",
title = "test set's Predicted vs True Ratings")
```
Predicted ratings are continuous, but true ratings are whole/half stars. Could
we then round the continuous ratings to star ratings? Similar to the standard
decimal rounding rules, $<$ 0.25 rounds to 0, $[0.25, 0.74)$ rounds to 0.5, and
$>$ 0.75 rounds to 1.
If we choose to round ratings to whole/half stars, then:
- Rounding one predicted rating **farther** from the true rating contributes
to a **larger error** compared with no rounding. For example, true rating =
3.5, predicted rating = 3.2, rounded rating = 3
- Rounding one predicted rating **closer** to the true rating contributes to a
**smaller error** compared with no rounding. For example, true rating = 4,
predicted rating = 3.8, rounded rating = 4
Since **rmse penalizes larger errors much more than smaller errors**, making 1
bad rounding of large error might need many good rounding of small error.
Rounding is beneficial when many points are concentrated close around true
rating, and only few points are far away. We are NOT seeing this shape in our
jitter plot since many points are far away, so we won't round.
Finally we clean up the variables and prepare for step 4, the final model.
```{r step3-cleanup}
rm(train_data, test_data, pred_ratings, first_model, rmse_test)
```
## Step 4: Develop the final model
Finally we reached to the most exciting step, combining previous tuning list
with edx set and form the final model! I won't put much description here since
the steps are almost the same as when we created the first model.
### Create final model
We convert `edx` and `validation` dataframes to recosystem DataSource objects
`edx_data` and `validation_data`. We create a new model object named
`final_model`.
```{r step4-create-model}
load("rdas/edx.rda")
load("rdas/validation.rda")
edx_data <- data_memory(
user_index = edx$userId,
item_index = edx$movieId,
rating = edx$rating,
index1 = TRUE
)
validation_data <- data_memory(
user_index = validation$userId,
item_index = validation$movieId,
rating = validation$rating,
index1 = TRUE
)
rm(edx, validation)
final_model <- Reco()
```
### Train
We skip the tuning stage for the final model, because we are satisfied with our
first model tuning opts, which forms the algorithm. We are using the same
algorithm, but on the full edx set data.
The training stage generates the matrices necessary for the predicting stage.
```{r step4-train}
set.seed(1)
final_model$train(edx_data, opts = best_tune_opts)