Jeremy Howard recently taught the Fastai Deep Learning for Coders (Part 1) course. This course has a lesson on Collaborative Filtering where he uses MovieLens dataset to demonstrate models for predicting ratings of movies. This content is covered in videos of lecture 5 and lecture 6.
This notebook is an attempt to create a R version (using Reticulate package) of the MovieLens python notebook covered in the course. While we can have separate python and R chunks with interoperability using Reticulate, I have tried to do everything in R since it will be easier to use this as standalone R script also. This course uses a library fastai (written by Jeremy) which is a wrapper around PyTorch.
It will be helpful to listen to the lectures before going through this notebook since the concepts of the model and approach are discussed in the lecture and this notebook is just a replication attempt of the material from the course using R.
Initial Setup
# import R libraries
library(reticulate)
library(ggplot2)
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(irlba)
Loading required package: Matrix
In this course, they talk about several cloud GPU options and provided a conda environment fastai in those environments. I am using the Paperspace setup that is covered in Lesson 1. Reshama Shaik has also listed detailed upto date instructions on getting set up in Paperspace. For setting up RStudio server in the same machine, I followed Cloud GPU setup section in TensorFlow for R site (except for the EC2 part).
Since the machine already comes with fastai conda environment, we first need to use the python that is part of the fastai environment and also use the fastai conda environment.
use_python("/home/paperspace/anaconda3/envs/fastai/bin/python", required = TRUE)
use_condaenv("fastai")
py_config()
python: /home/paperspace/anaconda3/envs/fastai/bin/python
libpython: /home/paperspace/anaconda3/envs/fastai/lib/libpython3.6m.so
pythonhome: /home/paperspace/anaconda3/envs/fastai:/home/paperspace/anaconda3/envs/fastai
version: 3.6.4 |Anaconda, Inc.| (default, Dec 21 2017, 21:42:08) [GCC 7.2.0]
numpy: /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/numpy
numpy_version: 1.13.3
NOTE: Python version was forced by use_python function
main = import_main()
bi = import_builtins()
The fastai library is located in the folder ~/fastai. So first, we import fastai.learner and fastai.columndata (libraries used in the notebook) from the folder.
# get relevant python imports
fstai_learner = import_from_path("fastai.learner", "../../fastai")
fstai_coldata = import_from_path("fastai.column_data", "../../fastai")
These modules import several other modules. So the import command used in the python notebook is directly called below so that all the other modules are also available for use later.
py_run_string("
from fastai.learner import *
from fastai.column_data import *
")
Get Data
The ratings dataset has the ratings for different users and movies. The movies dataset has the movie title information.
datapath = "../../data/ml-latest-small/"
ratings = read.csv(paste0(datapath, "ratings.csv"), stringsAsFactors = FALSE)
head(ratings)
movies = read.csv(paste0(datapath, "movies.csv"), stringsAsFactors = FALSE)
head(movies)
Model
Each user is \(i\) represent by an embedding vector \(u_i\) consisting of n_factor
values and a user bias value \(ub_i\). Similarly a movie \(j\) is represented by an embedding vector \(m_j\) consisting of n_factor
values and movie bias value \(mb_j\). The model of rating \(r_{ij}\) given by user \(i\) to movie \(j\) is: \[ r_{ij} = u_i^Tv_j + b_i + m_j \]
Model Fitting
First, the set of user/movie combinations that would be used as validation set is determined. Then, the following parameters are set:
- weight decay (wd)
- Number of factors in embedding
n_factors
For few of the code snippets below, I have included the python code in comments to show the correspondence between python code and R code with reticulate. In most cases, I just had to do one of the following things to get things to work:
- Replace period in python code with dollar
- If it is a python variable or function, prefix a
py$
- Explicitly convert an R object into python object when passing to a function using
r_to_py
if needed.
- If a python function requires an integer as argument, then I ensured that an integer is passed as an argument (for example, passing
2L
instead of 2
)
# python code
# val_idxs = get_cv_idxs(len(ratings))
#
val_idxs = py$get_cv_idxs(nrow(ratings))
val_idxs = as.integer(val_idxs)
wd = 2e-4
n_factors = 50L
Next a data loader object cf
and a learner object learn
is created
# python code
# cf = CollabFilterDataset.from_csv(path, 'ratings.csv', 'userId', 'movieId', 'rating')
#
cf = py$CollabFilterDataset$from_csv(datapath, 'ratings.csv', 'userId', 'movieId', 'rating')
# python code
# learn = cf.get_learner(n_factors, val_idxs, 64, opt_fn=optim.Adam)
#
learn = cf$get_learner(n_factors, val_idxs, 64L, opt_fn=py$optim$Adam)
We review the model
learn$model
EmbeddingDotBias(
(u): Embedding(671, 50)
(i): Embedding(9066, 50)
(ub): Embedding(671, 1)
(ib): Embedding(9066, 1)
)
The model has the following components:
- The vector of
n_factors = 50
for each of n_users = 671
is represented by a n_users x n_factors
embedding matrix u
.
- The vector of
n_factors = 50
for each of n_movies = 9066
is represented by a n_movies x n_factors
embedding matrix i
.
- Users have a bias vector
ub
of length n_users = 671
and movies have a bias vector ib
of length n_movies = 9066
.
Next we fit the model
In Jupyter notebook, a widget shows a nice output of progress. That output doesn’t render properly within RStudio and the html generated document has too much output. For now, I have turned off output and am just storing the final validation loss metric (MSE loss in this case)
paste0("MSE of validation set = ", round(mdlfit[[1]], 3))
[1] "MSE of validation set = 0.766"
Next we compare the predicted and actual ratings for the validation set
yval_preds = learn$predict()
yval=learn$data$val_y
dfplt = data.frame(yval = yval, yval_preds = yval_preds)
ggplot(dfplt) + geom_histogram(aes(x = yval_preds)) + facet_grid(yval ~ .) +
xlab("predicted ratings") + theme_bw()

Interpretation of Movie Bias
The model uses a sequential numeric id for users and movies, we first get the movie to id mapping from the data object cf
.
movie2idx = cf$item2idx
movie2idx = unlist(movie2idx)
head(movie2idx)
31 1029 1061 1129 1172 1263
0 1 2 3 4 5
idx2movie = as.numeric(names(movie2idx))
names(idx2movie) = movie2idx
head(idx2movie)
0 1 2 3 4 5
31 1029 1061 1129 1172 1263
Get the top 3000 movies that got the most ratings
topmovies = ratings %>% group_by(movieId) %>% summarize(cnt = n()) %>% arrange(desc(cnt)) %>% slice(1:3000)
topmoviesidx = movie2idx[as.character(topmovies$movieId)]
topmoviesidx = np_array(topmoviesidx)
Get the movie bias variable for the top 3000 movies
# python code
# m=learn.model
# movie_bias = to_np(m.ib(V(topMovieIdx)))
#
m = learn$model
movie_bias = py$to_np(m$ib(py$V(topmoviesidx)))
movie_bias[1:20]
[1] 0.8358478 0.9074793 1.3072114 0.8890176 0.8090852 0.4687574 0.8606437 0.5267966
[9] 1.0062566 0.6911108 0.7748715 0.5598993 0.5669060 0.8955097 0.7633992 0.7947082
[17] 0.2389306 0.6011345 0.2934838 0.7729112
topmovies$movie_bias = movie_bias[,1]
topmovies = left_join(topmovies, movies %>% select(movieId, title), by = "movieId")
Find the lowest rated movies (lowest values of movie bias).
topmovies %>% arrange(movie_bias) %>% slice(1:15)
Find the highest rated movies (highest values of movie bias)
topmovies %>% arrange(desc(movie_bias)) %>% select(title) %>% slice(1:15)
Interpretation of Embeddings
Get the embeddings of movies
# python code
# movie_emb = to_np(m.i(V(topMovieIdx)))
#
movie_emb = py$to_np(m$i(py$V(topmoviesidx)))
dim(movie_emb)
[1] 3000 50
Since there are 50 dimensions, a PCA is done to examine the first two principal components.
pcamdl = prcomp_irlba(movie_emb, n = 2)
loadings_df = data.frame(pcamdl$rotation)
scores_df = data.frame(pcamdl$x)
scores_df$movieidx = py_to_r(topmoviesidx)
scores_df$movieId = idx2movie[as.character(scores_df$movieidx)]
scores_df = inner_join(scores_df, movies %>% select(movieId, title), by = "movieId")
Check the movies that have highest and lowest scores for PC1.
scores_df %>% arrange(PC1) %>% slice(1:10) %>% select(title)
scores_df %>% arrange(desc(PC1)) %>% slice(1:10) %>% select(title)
Check the movies that have highest and lowest scores for PC2.
scores_df %>% arrange(PC2) %>% slice(1:10) %>% select(title)
scores_df %>% arrange(desc(PC2)) %>% slice(1:10) %>% select(title)
The above results could be used to assign some sort of meaning to the first 2 principal components. Next a scatter plot of PC1 vs PC2 with labels of movies is shown. This could be used to check which movies cluster together.
set.seed(12345)
scores_df_samp = scores_df %>% sample_n(50)
ggplot(scores_df_samp) + geom_point(aes(x = PC1, y = PC2)) +
geom_text(aes(x = PC1, y = PC2, label = title), size = 2.5, hjust = 0, nudge_x = 0.02) +
theme_bw()

Summary
Reticulate package is a great addition to R. Working through this example showed that it is not too hard to develop a R version of the analysis in python thanks to reticulate.890opkl,m
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /home/paperspace/anaconda3/envs/fastai/lib/libmkl_intel_lp64.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2 irlba_2.3.2 Matrix_1.2-11 dplyr_0.7.4 ggplot2_2.2.1 reticulate_1.6
loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 knitr_1.20 bindr_0.1.1 magrittr_1.5 munsell_0.4.3
[6] colorspace_1.3-2 lattice_0.20-35 R6_2.2.2 rlang_0.1.6 stringr_1.3.0
[11] plyr_1.8.4 tools_3.4.3 grid_3.4.3 gtable_0.2.0 yaml_2.1.18
[16] lazyeval_0.2.1 assertthat_0.2.0 tibble_1.4.2 reshape2_1.4.3 glue_1.2.0
[21] labeling_0.3 stringi_1.1.6 compiler_3.4.3 pillar_1.1.0 scales_0.5.0
[26] jsonlite_1.5 pkgconfig_2.0.1
---
title: "Fastai Collaborative Filtering with R and Reticulate"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

Jeremy Howard recently taught the [Fastai Deep Learning for Coders (Part 1)](http://course.fast.ai/) course. This course has a lesson on Collaborative Filtering where he uses MovieLens dataset to demonstrate models for predicting ratings of movies. This content is covered in videos of [lecture 5](http://course.fast.ai/lessons/lesson5.html) and [lecture 6](http://course.fast.ai/lessons/lesson6.html). 

This notebook is an attempt to create a R version (using Reticulate package) of the MovieLens python [notebook](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb) covered in the  course.  While we can have separate python and R chunks with interoperability using Reticulate, I have tried to do everything in R since it will be easier to use this as standalone R script also. This course uses a library fastai (written by Jeremy) which is a wrapper around PyTorch. 

It will be helpful to listen to the lectures before going through this notebook since the concepts of the model and approach are discussed in the lecture and this notebook is just a replication attempt of the material from the course using R.

## Initial Setup

```{r}
# import R libraries
library(reticulate)
library(ggplot2)
library(dplyr)
library(irlba)
```

In this course, they talk about several cloud GPU options and provided a conda environment fastai in those environments. I am using the Paperspace setup that is covered in [Lesson 1](http://course.fast.ai/lessons/lesson1.html). Reshama Shaik has also listed detailed upto date [instructions](https://github.com/reshamas/fastai_deeplearn_part1/blob/master/tools/paperspace.md) on getting set up in Paperspace. For setting up RStudio server in the same machine, I followed Cloud GPU [setup section](https://tensorflow.rstudio.com/tools/cloud_gpu.html) in TensorFlow for R site (except for the EC2 part).

Since the machine already comes with fastai conda environment, we first need to use the python that is part of the fastai environment and also use the fastai conda environment.

```{r}

use_python("/home/paperspace/anaconda3/envs/fastai/bin/python", required = TRUE)
use_condaenv("fastai")
py_config()

main = import_main()
bi = import_builtins()
```

The fastai library is located in the folder ~/fastai. So first, we import fastai.learner and fastai.columndata (libraries used in the notebook) from the folder.

```{r}
# get relevant python imports
fstai_learner = import_from_path("fastai.learner", "../../fastai")
fstai_coldata = import_from_path("fastai.column_data", "../../fastai")
```

These modules import several other modules. So the import command used in the python notebook is directly called below so that all the other modules are also available for use later.

```{r}
py_run_string("
from fastai.learner import *
from fastai.column_data import *
              ")
```

## Get Data

The ratings dataset has the ratings for different users and movies. The movies dataset has the movie title information.

```{r}
datapath = "../../data/ml-latest-small/"

ratings = read.csv(paste0(datapath, "ratings.csv"), stringsAsFactors = FALSE)
head(ratings)

movies = read.csv(paste0(datapath, "movies.csv"), stringsAsFactors = FALSE)
head(movies)
```

## Model

Each user is $i$ represent by an embedding vector $u_i$ consisting of `n_factor` values and a user bias value $ub_i$. Similarly a movie $j$ is represented by an embedding vector $m_j$ consisting of `n_factor` values and movie bias value $mb_j$. The model of rating $r_{ij}$ given by user $i$ to movie $j$ is:
$$ r_{ij} = u_i^Tv_j + b_i + m_j $$

## Model Fitting

First, the set of user/movie combinations that would be used as validation set is determined. Then, the following parameters are set:

* weight decay (wd)
* Number of factors in embedding `n_factors`

For few of the code snippets below, I have included the python code in comments to show the correspondence between python code and R code with reticulate. In most cases, I just had to do one of the following things to get things to work:

* Replace period in python code with dollar
* If it is a python variable or function, prefix a `py$`
* Explicitly convert an R object into python object when passing to a function using `r_to_py` if needed.
* If a python function requires an integer as argument, then I ensured that an integer is passed as an argument (for example, passing `2L` instead of `2`)

```{r}
# python code
# val_idxs = get_cv_idxs(len(ratings))
#
val_idxs = py$get_cv_idxs(nrow(ratings))
val_idxs = as.integer(val_idxs)
wd = 2e-4
n_factors = 50L
```

Next a data loader object `cf` and a learner object `learn` is created
```{r}
# python code
# cf = CollabFilterDataset.from_csv(path, 'ratings.csv', 'userId', 'movieId', 'rating')
#
cf = py$CollabFilterDataset$from_csv(datapath, 'ratings.csv', 'userId', 'movieId', 'rating')

# python code
# learn = cf.get_learner(n_factors, val_idxs, 64, opt_fn=optim.Adam)
#
learn = cf$get_learner(n_factors, val_idxs, 64L, opt_fn=py$optim$Adam)
```

We review the model
```{r}
learn$model
```

The model has the following components:

* The vector of `n_factors = 50` for each of `n_users = 671` is represented by a `n_users x n_factors` embedding matrix `u`.
* The vector of `n_factors = 50` for each of `n_movies = 9066` is represented by a `n_movies x n_factors` embedding matrix `i`.
* Users have a bias vector `ub` of length `n_users = 671` and movies have a bias vector `ib` of length `n_movies = 9066`. 

Next we fit the model
```{r results="hide"}
# python code
# learn.fit(1e-2, 2, wds=wd, cycle_len=1, cycle_mult=2)
#
mdlfit = learn$fit(1e-2, 2L, wds=wd, cycle_len=1L, cycle_mult=2L)
```

In Jupyter notebook, a widget shows a nice output of progress. That output doesn't render properly within RStudio and the html generated document has too much output. For now, I have turned off output and am just storing the final validation loss metric (MSE loss in this case)

```{r}
paste0("MSE of validation set = ", round(mdlfit[[1]], 3))
```

Next we compare the predicted and actual ratings for the validation set

```{r}
yval_preds = learn$predict()
yval=learn$data$val_y
```

```{r}
dfplt = data.frame(yval = yval, yval_preds = yval_preds)
ggplot(dfplt) + geom_histogram(aes(x = yval_preds)) + facet_grid(yval ~ .) + 
   xlab("predicted ratings") + theme_bw()
```

## Interpretation of Movie Bias
The model uses a sequential numeric id for users and movies, we first get the movie to id mapping from the data object `cf`. 
```{r}
movie2idx = cf$item2idx
movie2idx = unlist(movie2idx)
head(movie2idx)
```

```{r}
idx2movie = as.numeric(names(movie2idx))
names(idx2movie) = movie2idx
head(idx2movie)
```

Get the top 3000 movies that got the most ratings
```{r}
topmovies = ratings %>% group_by(movieId) %>% summarize(cnt = n()) %>% arrange(desc(cnt)) %>% slice(1:3000)
topmoviesidx = movie2idx[as.character(topmovies$movieId)]
topmoviesidx = np_array(topmoviesidx)
```

Get the movie bias variable for the top 3000 movies
```{r}
# python code
# m=learn.model
# movie_bias = to_np(m.ib(V(topMovieIdx)))
#
m = learn$model
movie_bias = py$to_np(m$ib(py$V(topmoviesidx)))
movie_bias[1:20]
```

```{r}
topmovies$movie_bias = movie_bias[,1]
topmovies = left_join(topmovies, movies %>% select(movieId, title), by = "movieId")
```

Find the lowest rated movies (lowest values of movie bias). 
```{r}
topmovies %>% arrange(movie_bias) %>% slice(1:15)
```

Find the highest rated movies (highest values of movie bias)
```{r}
topmovies %>% arrange(desc(movie_bias)) %>% select(title) %>% slice(1:15)
```

## Interpretation of Embeddings

Get the embeddings of movies
```{r}
# python code
# movie_emb = to_np(m.i(V(topMovieIdx)))
#
movie_emb = py$to_np(m$i(py$V(topmoviesidx)))
dim(movie_emb)
```

Since there are 50 dimensions, a PCA is done to examine the first two principal components. 

```{r}
pcamdl = prcomp_irlba(movie_emb, n = 2)
loadings_df = data.frame(pcamdl$rotation)
scores_df = data.frame(pcamdl$x)
```


```{r}
scores_df$movieidx = py_to_r(topmoviesidx)
scores_df$movieId = idx2movie[as.character(scores_df$movieidx)]
scores_df = inner_join(scores_df, movies %>% select(movieId, title), by = "movieId")
```

Check the movies that have highest and lowest scores for PC1.
```{r}
scores_df %>% arrange(PC1) %>% slice(1:10) %>% select(title)
scores_df %>% arrange(desc(PC1)) %>% slice(1:10) %>% select(title)
```

Check the movies that have highest and lowest scores for PC2.
```{r}
scores_df %>% arrange(PC2) %>% slice(1:10) %>% select(title)
scores_df %>% arrange(desc(PC2)) %>% slice(1:10) %>% select(title)
```

The above results could be used to assign some sort of meaning to the first 2 principal components. Next a scatter plot of PC1 vs PC2 with labels of movies is shown. This could be used to check which movies cluster together.

```{r}
set.seed(12345)
scores_df_samp = scores_df %>% sample_n(50)
ggplot(scores_df_samp) + geom_point(aes(x = PC1, y = PC2)) + 
    geom_text(aes(x = PC1, y = PC2, label = title), size = 2.5, hjust = 0, nudge_x = 0.02) +
     theme_bw()
```

## Summary

Reticulate package is a great addition to R. Working through this example showed that it is not too hard to develop a R version of the analysis in python thanks to reticulate.890opkl,m

```{r}
sessionInfo()
```