-
Notifications
You must be signed in to change notification settings - Fork 0
/
predict-movie-ratings.Rmd
207 lines (155 loc) · 12.9 KB
/
predict-movie-ratings.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
title: "Building a recommender system with R"
author: "Bart Boerman"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
pdf_document:
citation_package: natbib
fig_caption: yes
keep_tex: yes
latex_engine: pdflatex
template: svm-latex-ms.tex
word_document: default
fontfamily: mathpazo
fontsize: 11pt
geometry: margin=1in
keywords: recommendation, collaborative filtering, LIBMF, matrix factorization, R
endnote: no
spacing: double
abstract: "The main task of a recommender system is to predict the users response to different options. GroupLens Research has collected and made available rating data sets from the MovieLens web site. A data set with 10 million rows is downloaded and split into a train (90%) and a ‘unseen’ test (10%) set for evaluation. This document is a deliverable of the first of two capstone projects in the ‘Professional Certificate in Data Science’ course hosted by Harvard University (HarvardX). I followed a CRISP-DM like process: a short background study about content-based versus collaborative-filtering, analyze and prepare data, configure machine learning model and evaluate the results. The evaluation statistic is root-mean-squared error (RMSE) and must be lower than 0.8775 to be rewarded with the maximum grade. This is achieved by applying a collaborative filtering method."
---
```{r setup, include=FALSE, warning=FALSE, message=FALSE}
# Document option
options(tinytex.verbose = TRUE)
knitr::opts_chunk$set(echo = FALSE)
# New page function
doc_pagebreak <- function() {
if(knitr::is_latex_output())
return("\\newpage")
else
return('<div style="page-break-before: always;" />')
}
# figure options
options(figcap.prefix = "Figure", figcap.sep = ":", figcap.prefix.highlight = "**")
# Execute script in external file. Objects will be loaded into environment.
source('predict-movie-ratings.R', echo = FALSE, verbose = FALSE)
```
# Problem description
Train a machine learning algorithm using the inputs in one subset to predict movie ratings in the validation set (unseen test data). The data used is the MovieLens Dataset with 10 million ratings. The download script with basic data wrangling (combine, split into train and test data) is provided by EDX.
# Methods
I performed the following tasks to design and test a R scripts which predicts movie ratings:
* Background study on popular approaches for designing a recommender system.
* Improve the script provided by EDX so that data is downloaded only once when already available.
* Basic exploratory data analysis.
* I used the ‘recosystem’ package for Collaborative-Filtering.
* Parameter tuning, and model training and evaluation on the training data.
* Test the model on the ‘unseen’ test data.
# Popular approaches for designing a recommender system
* **Content-Based systems** analyse the characteristics of items. Similarity of items is determined by measuring the similarity in their properties. Content-based filtering recommends items based on a comparison between the content of the items and a user profile. The content of each item is represented as a set of descriptors or tags given to the item by users. The user profile is represented with the same descriptors and built up by analyzing the content of items which have been seen by the user.
* **Collaborative-Filtering systems** analyse the relationship between users and items. The underlying assumption of the collaborative filtering approach is that if person A has the same opinion as person B on an issue, A is more likely to have B’s opinion on a different issue than that of a randomly chosen person.
I’ve chosen collaborative-filtering approach, since the data, as specified within the capstone project, contains a limited set of meta data. For example, it does not contain user tags nor does it contain information about director, budget or cast of the movies.
\newpage
# Matrix Factorization
Collaborative-Filtering systems use an user-item matrix as a basis to identify similarity between relationships. In the simplified example below the system will assume that Bart is most likely to rate movie 3 with a 6 since he agrees with Nuray on other movies.
```{r example matrix, results="show"}
User <- c("Bart","Karel","Nuray")
Movie_1 <- c(5,2,5)
Movie_2 <- c(1,8,1)
Movie_3 <- c("","",6)
as.matrix(cbind(User,Movie_1,Movie_2, Movie_3)) %>% knitr::kable(format = "pandoc", caption = "User-item matrix")
```
Other common names for this task include ‘collaborative filtering’, ‘matrix completion’ and ‘matrix recovery’.
# The MovieLens Dataset
GroupLens Research has collected and made available rating data sets from the MovieLens web site. Among others, a data set with 10 million rows which we split into a train (90%) and a ‘unseen’ test (10%) set for evaluation.
## Description
The training dataset consists of `r dim(train.org)[1]` records and `r dim(train.org)[2]` columns with:
* `r n_distinct(train.org$userId)` anonymized user IDs.
* `r n_distinct(train.org$movieId)` movies with title and genre, the real MovieLens id.
* `r n_distinct(train.org$genres)` unique lists in the genres column.
* The ratings, made on a 5-star scale, with half-star increments. Hence 10 possible outcomes of our prediction algorithm.
* Timestamp of the rating.
The data spreads a timespam from `r min(as_datetime(train.org$timestamp, origin = "1970-01-01"))` until `r max(as_datetime(train.org$timestamp, origin = "1970-01-01"))`.
```{r example rows, results="show"}
head(train.org, 6) %>% knitr::kable(format = "pandoc", caption = "Example rows")
```
Notes:
* The title contains the release year of the movie.
* The genre column contains multiple genres seperated with a pipe.
* The timestamp of the rating represented seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.
## Missing values
The dataset has `r sum(is.na(train.org))` records with missing values.
```{r show missing, echo=FALSE, warning=FALSE, error=FALSE}
# Which is `r round((sum(is.na(train.org))/sum(complete.cases(train.org)))*100,2)` procent.
train.org %>% summarise_all(funs((sum(is.na(.))/length(.))*100)) %>% round(.,2) %>% knitr::kable(format = "pandoc", caption = "Percentage missing values per column")
```
\newpage
## Suggestions for feature engineering when going content based
A Collaborative-Filtering recommendation system uses a matrix of users, movies and ratings as values. Thus, feature engineering is not required. The engineering of the following derived features may add value to the data when building a Content-Based recommendation system:
* Date attributes: year, month, day of week, hour.
* Release year (included in the title text).
* Number of genres per movie.
* One-hot encoding of genres.
* Add mean, median and or mode of rating per genre, movie and user.
* Add number of ratings per movie.
* Add number of rating per user.
# Machine learning model based on LIBMF with R package ‘recosystem’
I used the R package ‘recosystem’ which is a wrapper of the open source LIBMF library developed by the Machine Learning Group at Nation Taiwan University. It is developed for recommender system based on ‘collaborative filtering’.
## Main features of LIBMF
The main features of LIBMF include:
* Providing solvers for real-valued matrix factorization, binary matrix factorization, and one-class matrix factorization.
* Parallel computation in a multi-core machine and other technical performance optimizations.
* Cross validation for parameter selection.
* Support disk-level training, which largely reduces the memory usages.
\newpage
## Design decisions
I made the following design decisions to optimize RMSE and to overcome some minor shortcomings:
* Instead of providing the original rating values I decided to convert the rating to integers after multiplying the original value by 2.
* The algorithm estimates the rating given by users to the movies rather than picking a rating out of the possible values ranging from 1 to 10. Hence, the outcome of the prediction must be rounded to the nearest integer for optimal RSME.
* The algorithm predicts values out of the range of the provided integer values. Predictions below 1 are adjusted to 1 and predictions above 10 are adjusted to 10.
## Process flow
The final model consists out of the following steps:
* Feature selection: select user ID, movie ID and rating.
* Data conversion: convert all features to integers. Rating was multiplied with 2 to get it on a 1 to 10 scale.
* Determine optimal parameters with cross validation.
* Train model on 9.000k rows, which 90% of the provided data.
* Predict on 1000k unseen rows, which is 10% of the provided data.
* Correct predictions out of the 1 to 10 range.
* Convert ratings and predictions back to the original scale.
* Calculate RMSE: sqrt(mean((observed - predicted) ^ 2)
\newpage
# Discussion
* I did not (yet) analyze the predictions on topics like distribution, bias and overfitting.
* Training on the 9000k row training dataset took a long but reasonable amount of time on my laptop (Pentium I7, SSD, 8 GB). I did not investigate the scalability of the model.
* For reasons of technical performance I evaluated just a view settings for parameters optimization during cross validation. Result might improve a bit when more effort towards parameter tuning.
* Low absolute accuracy of around 25% out of 10 possible outcomes (ratings). The model will give a good estimate (RMSE of 0.7918), which can be used as an indication of user preferences towards an option (movie) but fails to be precise.
# Conclusion
The main task of a recommender system is to predict the users response to different options. My mission was to grasp the concepts of a collaborative-filtering based recommender system and to provide an example in R. This document and the accompanying script show that it is relatively simple to create a basic system which predicts movie rating. The algorithm scores a RMSE of 0.7918. More than enough to be rewarded with the maximum grade.\
\
Collaborative-filtering using LIBMF is easy to setup in both R and Python but has, as a method, the following cons:
* The so-called ‘Continuous Cold Start’ when the systems welcomes new movies and/or new users. By design it will not be able to provide reasonable predictions.
* Uses within a system need be active in providing ratings. Inactivity hides changes in their interests over time.
* The system needs to be retrained on the complete matrix after receiving new ratings. Hence, it has a performance penalty. One could consider training different models for different demographic and geographic segments.
\newpage
A optimal solution would be an ensemble with business rule and more extensive content-based systems. Other common tips are:
* Apply a popularity based strategy. Trending products can be determined by trends and what’s been popular recently, demographically, regionally, seasonal, or at that certain time of the day.
* Other user actions like showing interest, viewing or buying, are potentially more robust because most users do not provide product ratings. These actions can be translate to a binary indicator which can be used as input for a collaborative-filtering based recommender system.
# References
## Resources consulted during background study
I read, and used, the following online content during the background study about building a recommender system in R:
* http://infolab.stanford.edu/~ullman/mmds/ch9.pdf
* https://www.csie.ntu.edu.tw/~cjlin/libmf/
* https://statr.me/2016/07/recommender-system-using-parallel-matrix-factorization/
* https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf
* https://medium.com/@cfpinela/recommender-systems-user-based-and-item-based-collaborative-filtering-5d5f375a127f
## Resources consulted for technical design
I recommend reading the following online posts which I used as a basis for the technical solution:
* https://cran.r-project.org/web/packages/recosystem/recosystem.pdf
* https://www.r-bloggers.com/recosystem-recommender-system-using-parallel-matrix-factorization/
\newpage
## R packages used in software code
The following R open source packages are used in the software code which is described, or used, in this document:
* **'tidyverse'**, A collection of R packages for data science. Includes 'dplyr' for data wrangling tasks. Ships with 'ggplot2' for data visualization.
* **'lubridate'**, Provides date functions, e.g. extract month and year from a date.
* **'Metrics'**, Calculates performance metrics for model validation.
* **'Caret'**, The caret package (short for Classification And Regression Training) contains functions to streamline the model training process for complex regression and classification problems. It provides a common interface to R packages.
* **'recosystem'**, Recommender system using Matrix Factorization, R wrapper of the 'libmf' library. LIBMF is a Matrix-factorization Library for Recommender Systems. It is open source and developped by the Machine Learning Group at National Taiwan University.