This chapter proposes an introduction to recommendation techniques and suggests some exercises and projects. We do not present a recommendation system in particular but rather focus on the general methodology. As an illustrative example, we will use the MovieLens data set to construct movie recommendations.
The chapter successively introduces recommendation, user-based collaborative filtering and item-based collaborative filtering. It discusses different methods parameterizations and evaluates their result with respect to the quality of the data set. We show how to generate recommendations using SQL queries on the MovieLens data set. Finally, we suggest some projects for students who want to investigate further the realm of recommendation systems.
Given a set of ratings of items by a set of users, a recommendation system produces a list of items for a particular user, possibly in a given context. Such systems are widely used in Web applications. For example, content sites like Yahoo! Movies (movies), Zagat (restaurants), LibraryThing (books), Pandora (music), StumbleUpon (website) suggest a list of items of interest by predicting the ratings of their users. E-commerce sites such as Amazon (books) or Netflix (movies) use recommendations to suggest new products to their users and construct bundle sales. Usually, they exploit the recent browsing history as a limited context. Finally, advertisement companies need to find a list of advertisements targeted for their users. Some of them, like Google AdSense, rely more on the context (e.g., keywords) than on an estimation of the user’s taste based on her/his recent browsing history. Nevertheless, techniques close to recommendation methodologies are successfully used, for example, by DoubleClick or Facebook ads.
One usually distinguishes two kinds of tasks on data: information retrieval and information filtering. Information retrieval is the problem of answering dynamic queries on static content. Typical examples are answering keyword queries on the Web or SQL queries on a database. The general method relies on data modeling, providing structure and semantics to the data, that is then organized using indexes. Information filtering is the problem of answering static queries on dynamic content. A typical example is the monitoring of Web server logs. The general method is to model the queries, which are then organized as filters. Under this general perspective, recommendation stands between information retrieval and information filtering: data (the set of ratings) varies slowly at the scale of a user but quickly at the scale of the system; queries (a user and possibly some context) depend on a few parameters, each having a wide domain.
Specifically, a recommendation system may either produce top-k ranking (list of “best” items) or prediction of ratings. The focus of the result may be generic (everyone receives the same recommendations), demographic (everyone in the same category receives the same recommendations) or personal. In the present chapter, we are mostly interested in personal recommendation. The context may rely on the user’s current activity or on her/his long-term interests
The information that serves as a basis to recommendation systems consists of the following components:
The ratings matrix is incomplete, being fed only by either acquiring data from the user (e.g., an item is bought, or a level of interest is explicitly collected), or by monitoring her/his activity (an item is visited, which gives some hint on the user’s interests). Recommendation is indeed the process of filling empty cells of the matrix with predicted ratings derived from the other sources of information, including known ratings.
This chapter uses SQL: we assume the reader familiar with the language. You will need
access to a relational database, for example by installing MYSQL on your computer: see
http://www.mysql.com. Here is a very brief introduction to MYSQL commands (refer to the
Web for information on any other SQL database systems). Assuming that you have an account on
the MYSQL server, the connection is established with:
mysql -h [servername] -P [port] -u [login] -p
The utility asks for your passwords and gives you access to the command-line interpreter. You may
directly type SQL commands, or execute command(s) stored in a file myCom.sql:
mysql> source myCom.sql;
We will play with the MovieLens (http://www.movielens.org) data set to generate recommendations of movies. The data set must first be imported in your database. Create the following tables and indexes (the SQL scripts can be found on the book’s site):
You can get the MovieLens 100K Ratings data set from http://www.grouplens.org/node/73. The files are respectively named u.data, u.item, and u.user. They are loaded in the database as follows
Table ratingsdata table now contains the list of ratings. Most of the computation presented further rely on its content. Table items and users contain respectively the list of movies and the list of users.
The quality of a given recommendation method highly depends on the quality of the input. It can be characterized by the support (number of users and items, and distribution of the number of ratings by users and by items) and by the rating quality (distribution of the ratings by user and by movies). Let us consider the support first, which can be determined by the following SQL commands:
We now examine the quality of the ratings with the following SQL queries:
Run the queries and examine the result. Can you determine the distribution law? What happens regarding the distribution of the average ratings by movies, compared to the natural expectation? Try to figure out what would be the normal curve for such an application, and explain the “picks” associated to each rounded rating. Also note the curve behavior for extreme values, and provide an explanation. Finally note that the distribution of ratings is not centered. Why?
As for most data analysis tasks, raw data has to be cleaned up during a preprocessing step. We will limit ourselves to the centering of the ratings distribution. This normalization makes easier the comparison of the users’ behavior. A more involved normalization would, among others, also correct the standard deviation. This is left as an exercise. The centering is obtained by the following query:
Global recommendation roughly retrieves the movies with the best average rating. The query is straightforward:
If you carefully look at the result, you should observe that items with the best average ratings are those with a very small support (only a few ratings are known). This is a classic problem in statistics: an estimation of the average cannot be accurate if the support is too small. Problems related to the low quality of the support are very common in recommendation. In practice, a safe rule is to base any estimation on at least ten measurements. How can you correct the query to obtain a better result? Write and run the corrected query.
The next query retrieves the 40 movies with the largest number of ratings.
Pick 20 of those movies (if possible those you know) and give them a rating using the command:
You may want to check your updates with:
We will use this table to compute some movie recommendations for you. Keep in mind that in a real recommendation system, one has to find recommendation for every user, so scaling is a real issue.
The collaborative filtering class of methods focuses on the ratings matrix and ignores the users or items description. It usually proceeds in two steps: first the correlation step determines a similarity between users (for the user-based approach) or between items (item-based), then the aggregation step predicts new rating from this similarity information.
In the case of user-based collaborative filtering (Figure 18.1), the correlation between a pair of users is computed by comparing their ratings. For simplicity (and efficiency), we only compute the correlation between you and all the other users. Then the ratings of these users for a given item are aggregated to predict the rating of the initial user for this item.
There exist several possible measures of correlations. Let Ui be the vector of ratings of user ui (seen as a line), then
The cosine correlation is obtained by the following query:
You can compare the ratings of user ui to yours with the following query:
You should observe that the estimation of the correlation is not accurate for pairs of users with a small support (i.e., users who rated only a few common items). How can you modify the correlation formula to take the support into account? You should in particular try the other formula suggested above. This should lead to the conclusion that there is a trade-off regarding the support: giving too much weight to the support may bias the result toward popular items, whereas simply ignoring it leads to a bad estimation quality.
We used the normalized table in the SQL commands. What could happen if we had used the initial, non-normalized data?
We keep the correlated users whose behavior is close to yours, into the sim table, using the following command:
Let r(ui,ik) be the rating prediction of user ui and item ik and let St(ui) be the user highly correlated with ui (the users that you put in the sim table). The following formula represent some possible ways of computing aggregated values:
The means aggregation is obtained by:
You probably want to remove the movies that you already know by adding the following filter to the where clause:
You may also want to see the movies you probably dislike, replacing desc by asc in the previous command.
For item-based collaborative filtering (Figure 18.2), we compute the correlation between any pairs of items by comparing their ratings. We then aggregate the ratings of a user for these items to predict the rating of this user for the initial item. To avoid too much computation time, you may only compute the correlation between all items and yours. Let Ik be the vector of ratings of item ik (seen as a column). You may use:
The following projects outline some suggestions to extend the basic recommendation scheme presented above.
So far, we limited the computation to recommendations for a single user. In general, recommendation systems attempt to provide recommendations to every of their users. Several methods can be envisaged to achieve scalability:
Any of these methods can be used as a starting point for a project aiming at scalable computation of the recommendations. Distribution is a suitable objective if you wish to investigate in the context of a practical project some of the main techniques described in the book. You could for instance design and experiment the computation of the correlation and aggregation indicators with the MAPREDUCE paradigm, taking the opportunity of implementing your functions in one of the systems that we present in other Putting into Practice chapters (e.g., HADOOP or COUCHDB).
Some recommendation methods are fully based on a probabilistic model. In general, they consist in choosing a probabilistic model of generation (e.g., using Markov Chains), followed by an estimation of the model parameters (e.g., using Expectation Maximization). The project can be conducted by finding academic references to model-based recommendation. You should then choose a probabilistic model of generation and use the standard statistics methods to estimate the ratings.
Many refinements can improve the recommendations obtained by the basic methods presented here. In particular, in some cases, content filtering, i.e., prediction of ratings given the description of items and users, provides some useful additional information. The description can also be used to increase diversity. For example, one may look for the list of items that maximize the sum of aggregated ratings under the constraint that the elements do not share all their attributes. The description of users and items are respectively in the files u.user and u.item of the MovieLens database. The imdb field of the item table can be used to get more attributes from the IMDB database.
Another standard improvement is to manage more precisely serendipity, i.e., to suggest items that are more risked. It may happen for instance that an item has been rated by only few users. If it turns out that all of them are enthusiastic, it may be worth proposing the item even if the support is low. For example, in user-based collaborative filtering the first aggregation function can be modified to base the means only on users who have produced ratings. It yields the same problem of trade-off between sparsity and noise
Taking context into account to filter the recommendation results is another interesting issue. For example, one may try to produce a recommendation for a given bag of keywords looked up in the attributes of the items. Explanation is another direction of improvement of recommendation (i.e., help the user to understand why s/he got this recommendation). The cold start problem (users or items with very few ratings) is also an important topic of research and can be easily experimented. Recommendation can benefit from interacting with the user to modify the recommendation process based on feedback. Finally, one may try to recommend to a group of users instead of a single user.
The project will try to improve the recommendation in some of these directions.