Given a user preferences matrix, defining user preferences to items, the recommendation system seeks to generate estimated values for preference of any user to any item. In particular, this is important as a user may not likely have preferences stated for all items in store.
Similarly, given a set of common features across all items, it is possible them to create an auxiliary matrix which supports comparison between items.
So here we are with three possible matrices, Users2Items, Items2Users (its transpose), and Items2Features. With respect to the movielens dataset, users are reviewers and items are movies and our goal is to find which movies to recommend to users. As a result, our input matrices are Users2Movies, Movies2Users, and Movies2Features. Note that originally, these input matrices are very sparse as few ratings exist between users and items. This sparsity is represented in terms of missing values of NAs. Our goal is to replace those NAs with actual recommended values in a Netflix-like modus operandi similar, “given you liked this, you may want to see this” and “users similar to you, have also been watching this“.
NOTE: The movielens dataset was publicly made available by the Univ. of Minnesotta Movielens Research Group:
# Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
# Framework for Performing Collaborative Filtering. Proceedings of the
# 1999 Conference on Research and Development in Information
# Retrieval. Aug. 1999.
# GroupLens Research currently operates a movie recommender based on
# collaborative filtering:
The following recommendation system is build on R and relies on the iterative refinement model of the Theta values for (users->features->users->features, etc.) as opposed to the full gradient descent application to the full set of users and features derivatives and JCOST.
- The core of the system is found here: recommender_system.R
- The loading of the data is found here:
- The visualization of the data is found here:
- The metrics, convergence and utility functions are found here:
The basic idea behind the recommender system is to produce produce a recommended rating for each missing value. If there are few rating levels, then a customized multiclass classifier for the ratings for classes could be done, in particular if enough samples exist. However, because the user preference data is often so limited, a multivariate linear regression model is used to fit:
- a set of THETA coefficients representing REVIEWER/USER PREFERENCES for MOVIE GENRES – given previous ratings of a REVIEWER/USER across all of his/her movies as found in U2M given past ratings and the corresponding MOVIE FEATURES for each such movie.
- a second set of THETA coefficients representing GENRE FEATURE VALUES for MOVIES – given previous ratings of a MOVIE across all of its reviewers as found in M2U given past ratings and the corresponding REVIEWER/USER PREFERENCES as features for each such reviewer.
Unlike in standard recommendation systems, we do not fit a joint JCOST for the above set of coefficients but rather iterate between fitting the first set and then, using the results to fit the second set, and then repeat until a convergence in both JCOST is achieved.
This has some benefits in debugging the gradient descent and monitoring convergence while for this dataset converges fast enough. So the steps are as follows:
- Load and/or setup the initial version of these three matrices
- Ratings: User2MovieRatings and Movie2UserRatings=t(User2MovieRatings)
- Movie Features; Movie2GenreFeatures
- User Preferences: U2GenrePreferences
- Fit: Users * GenrePreferences = MovieRatings for EACH user and obtain the USER_THETA parameters GenrePreferences of EACH user. That is, the linear regressed model predicted values given the previous user ratings for other movies.
- Fit: Movies * GenreFeatures = UserRatings for EACH movie and obtain the MOVIE_THETA parameters GenreFeatures for EACH movie.
- Iterate over the above two steps until JCOST is acceptably low.
- Now, given these rating matrices, we can obtain similarity between users as well as similarities between movies. We do this using two techniques, distance metrics between users allows to recognize closest user in terms of fitted UserPreferences. Moreover, given the identification of such user, we can identify the set of movies not seen by one but seen by the other and do recommendations based on this.
Once this implemented, the provided R scripts implement along the described process and produce output such as this:
Finally, as the distance metrics between users and movies also represents a edge adjacency, we can plot the resulting social network and clusters associated with either movies, users or the digraph of both as shown here: