A recommendation system in R, applied with respect to the movielens database

Visualization of Clusters of Movies using distance metrics between movies (in terms of movie genre features) and visualized then as an adjacency Matrix under SNA visualization guidelines. Node size proportional to total degree.
Visualization of Clusters of Movies using distance metrics between movies (in terms of movie genre features) and visualized then as an adjacency Matrix under SNA visualization guidelines. Node size proportional to total degree.

Given a user preferences matrix, defining user preferences to items, the recommendation system seeks to generate estimated values for preference of any user to any item. In particular, this is important as a user may not likely have preferences stated for all items in store.

Similarly, given a set of common features across all items, it is possible them to create an auxiliary matrix which supports comparison between items.

So here we are with three possible matrices, Users2Items, Items2Users (its transpose), and Items2Features. With respect to the movielens dataset, users are reviewers and items are movies and our goal is to find which movies to recommend to users. As a result, our input matrices are Users2Movies, Movies2Users, and Movies2Features. Note that originally, these input matrices are very sparse as few ratings exist between users and items. This sparsity is represented in terms of missing values of NAs. Our goal is to replace those NAs with actual recommended values in a Netflix-like modus operandi similar, “given you liked this, you may want to see this” and “users similar to you, have also been watching this“.


NOTE: The movielens dataset was publicly made available by the Univ. of Minnesotta Movielens Research Group:

# Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
# Framework for Performing Collaborative Filtering. Proceedings of the
# 1999 Conference on Research and Development in Information
# Retrieval. Aug. 1999.
#
# GroupLens Research currently operates a movie recommender based on
# collaborative filtering:
#
#         http://www.movielens.org/


Caveat Emptor:

The following recommendation system is build on R and relies on the iterative refinement model of the Theta values for (users->features->users->features, etc.) as opposed to the full gradient descent application to the full set of users and features derivatives and JCOST.

  • The core of the system is found here: recommender_system.R
  • The loading of the data is found here:
    Movielens dataset loader
    Movielens dataset loader

    Screenshot - 04162015 - 09:27:15 PM


The basic idea behind the recommender system is to produce produce a recommended rating for each missing value. If there are few rating levels, then a customized multiclass classifier for the ratings for classes could be done, in particular if enough samples exist. However, because the user preference data is often so limited, a multivariate linear regression model is used to fit:

  • a set of THETA coefficients representing REVIEWER/USER PREFERENCES for MOVIE GENRES –  given previous ratings of a REVIEWER/USER across all of his/her movies as found in U2M given past ratings and the corresponding MOVIE FEATURES for each such movie.
Iterative convergence loop on user preferences and genre features
Iterative convergence loop on user preferences and genre features
  • a second set of THETA coefficients representing GENRE FEATURE VALUES for MOVIES – given previous ratings of a MOVIE across all of its reviewers as found in M2U given past ratings and the corresponding REVIEWER/USER PREFERENCES as features for each such reviewer.
Iterative convergence - dual step
Iterative convergence – dual step

Unlike in standard recommendation systems, we do not fit a joint JCOST for the above set of coefficients but rather iterate between fitting the first set and then, using the results to fit the second set, and then repeat until a convergence in both JCOST is achieved.

This has some benefits in debugging the gradient descent and monitoring convergence while for this dataset converges fast enough. So the steps are as follows:

  • Load and/or setup the initial version of these three matrices
    • Ratings: User2MovieRatings and Movie2UserRatings=t(User2MovieRatings)
    • Movie Features; Movie2GenreFeatures
    • User Preferences: U2GenrePreferences
  • Fit: Users * GenrePreferences = MovieRatings for EACH user and obtain the USER_THETA parameters GenrePreferences of EACH user. That is, the linear regressed model predicted values given the previous user ratings for other movies.
  • Fit: Movies * GenreFeatures = UserRatings for EACH movie and obtain the MOVIE_THETA parameters GenreFeatures for EACH movie.
  • Iterate over the above two steps until JCOST is acceptably low.

    Dual step, iteration kernel
    Dual step, iteration kernel
  • Now, given these rating matrices, we can obtain similarity between users as well as similarities between movies. We do this using two techniques, distance metrics between users allows to recognize closest user in terms of fitted UserPreferences. Moreover, given the identification of such user, we can identify the set of movies not seen by one but seen by the other and do recommendations based on this.
  • Example of predicted ratings using converged fitting of preferences and genre theta parameters.
    Example of predicted ratings using converged fitting of preferences and genre theta parameters.
  • Example of recommended ratings generated.
    Example of recommended ratings generated.

Once this implemented, the provided R scripts implement along the described process and produce output such as this:

A tabulation of similarity between users.
A tabulation of similarity between users.
A tabulation of similarity between movies.
A tabulation of similarity between movies.

Finally, as the distance metrics between users and movies also represents a edge adjacency, we can plot the resulting social network and clusters associated with either movies, users or the digraph of both as shown here:

plot_recommendation_network_collabfilt

Plotting of the Digraph between Users and Movies as a social network using the user to movie ratings as an edge adjacency matrix.
Plotting of the Digraph between Users and Movies as a social network using the user to movie ratings as an edge adjacency matrix.
Advertisements