The recommender systems have a quite long history. The problem of deciding what content should be presented to a user is important to many companies. It can bring more money and what’s important in the long run, it makes the experience of users more delightful. It is also one of the biggest motivators for the growth of machine learning. It started with a very famous competition run by Netflix that offered a million dollars to a team that built the best recommender system. It was in 2009, over a decade has passed since this competition. Needless to say that during this time a lot of different solutions have been proposed. Nowadays, Facebook, Spotify, and an endless list of other companies use their proprietary algorithms to enhance their products. There are a lot of posts on the web that address recommender systems very well. They usually start from the problem definition and then move to the collaborative filtering and matrix factorization – techniques that seem to be an integral part of the toolbox. Here we won’t have time to discuss every single detail of these techniques. We’ll provide essential information and focus more on intuition and will move to different approaches. We have to say that recommender systems apart from being treated as a distinct kind of problem can be (depending on the application) understood as classification, regression, or ranking problem. A lot of factors can influence how you should look at the problem. But there is one distinct feature of the recommendation problem – sparsity of data. It is something they have in common with NLP applications. In the second part of the post how we will make use of it.

In this post, we’ll start by sketching the problem. Then we’ll move to explain shortly collaborative filtering and matrix factorization. Finally, we’ll talk about Factorization Machine, which although introduced about a decade ago has inspired a lot of new inventions and it is useful to know.

Basic definitions

Let’s start from the basics. Historically, recommender systems were motivated by movie recommendations.

So here for the sake of argument, we will talk about users and movies but it can be generalized to any recommendation application. Imagine that we have a database for users, a database for movies, and a database for rankings. We will present the first two tables later when they are needed. Now, let’s focus on the ratings matrix. Suppose ratings vary from 1 to 5. As you can see matrix R is filled only partially. Not all users have watched and rated all the movies. Our task is to fill these missing values with reasonable guesses.

We have N users and M movies. The dimension of this matrix is N \times M. It is sparse because most users can only watch a few films. The problem is to figure out what movie to recommend to a user that hasn’t been already seen and it will be probably a good match. If someone is a fan of horrors, it is not a good idea to show him or her a comedy. At least we don’t have any signs that it would be a good choice.

With the problem formulated in such a way, we can ask what to show a user j. Let’s use our intuition. Suppose he or she watched a movie (let’s say k) and rated it high. We would like to display on our recommendation list a movie that would be liked as much as k. The simple idea would be to select the most similar movie. How can we measure it? Assume movies are similar if watched by the same users and are rated similarly. In other words, if the movie k and another movie m have some common users that watched them and they rated them similarly, movie m seems to be a good candidate for a recommendation. We could also look at this problem from a different perspective. We have some ratings for user j, so let’s find the most similar user in terms of watched movies and recommend something that has been highly rated by the latter if they happen to have similar tastes (a common set of movies doesn’t imply they were equally enjoyed by both users). The problems of this kind are the subject of collaborative filtering.

Collaborative Filtering

We ended the last section with a simple idea of searching for the most similar movie or user in order to get an educated guess about the movie we should recommend. It seems that we would get even better results if we had taken all users into account and weighed their impact on the final guess according to the strength of similarity to the user we want predictions for.

This type of strategy is called Collaborative Filtering. In general, it can be divided into two subcategories:

  • User-User Collaborative Filtering
  • Item-Item Collaborative Filtering

Both approaches are analogical. The first one is easier to think about. We want to find ratings for users, i.e. we ask ourselves what rating a given user would give a movie if he or she had a chance to watch it. In the second approach, we just need to switch to thinking in terms of items. We’ll only talk about the first approach. Suppose that we want to find an estimate for R_{ij}, i.e. what rating would the user I give to movie j if he or she watched? We use the following formula to find an estimate.

    \[\hat{R_{ij}} = \frac{\sum_{k \in \text{users that watched } j}w_{k}R_{kj}}{\sum_{k \in \text{users that watched } j}w_{k}}\]

In the preceding equation, w_{k} measures the similarity of user k to user j. There are a couple of ways to find a value for it. As we know the only information we have at our disposal is a set of ratings. We could, for example, compute the correlation coefficient between users. If so, it is important to remember that some users have zero or a few common watched films, so we should set some threshold (e.g 5) of the minimum number of movies before taking the user into account and computing w_{k}. The second thing is that different people can have different ways of perceiving numbers, which means that for one person “3” can mean a bad movie and not worth time and for another person who very rarely gives high scores it can mean a good film. To remedy this problem we must account for bias and we don’t estimate mean value, but deviation from our average. If we denote by \bar{R_{i}} the average rating given by user i, we would end up with the following equation:

    \[\hat{R_{ij}} = \bar{R_{i}} + \frac{\sum_{k \in \text{users that watched } j}w_{k}(R_{kj} - \bar{R_{k}})}{\sum_{k \in \text{users that watched } j}w_{k}}\]

Matrix factorization

In the matrix factorization approach we assume that users can be described by a set of factors, let’s say K, so are the movies. We can imagine that a user can like action movies, doesn’t like horrors, likes seeing celebrities or good music, and so on. We don’t explicitly create these features, because it would take an enormous amount of work and we probably won’t find all the relevant factors. So we suppose the existence of hidden factors that are found by mathematical optimization. The assumption that the number of hidden factors for both users and movies is the same seems to be less convincing, but it makes the problem feasible and it doesn’t hurt if we tune it correctly. Suppose we got these matrices somehow. Each row in the user matrix has K components corresponding to hidden factors or features of a given user. Similarly so is constructed the movie matrix.

How can we say these matrices are good? The purpose of this matrix is to model ratings estimation by taking dot products between users and movies.

    \[r_{ij} \approx u_{i}^{T}w_{j}\]

or in matrix form

    \[R \approx UW^{T}\]

Probably we won’t do explicit matrix multiplication of U and W, because R is sparse, whereas U and W are dense, so is the result of the multiplication. Now suppose that we need to present some content to a user i. If the found matrices are good, the process comes down to find the movies, whose dot produces with the vector representation of the user are the biggest. There are some extensions to this simple model. You can add global bias (mu), vectors of users’ biases (b) and vector of movies’ biases (c). In this case, you would have N + M + 1 more parameters to tune. In general, the equation would be as follows:

    \[r_{ij} \approx mu + b_{i} + c_{j} + u_{i}^{T}w_{j}\]

Let’s think about how we can understand this approach more intuitively. Dot products represent, up to scalar multiplication, cosine between angles (for more check the Wikipedia page on cosine similarity).

In other words, if we have similar vectors, the angle they create is small. Graphically we can represent it as below:

The relation “having small angle” is very often transitive, i.e. if vectors x and y have a small angle, and so do y and z, then also x and z do. In a strict mathematical sense it doesn’t have to be true, because we can define small angle as being at most \theta and we can plot x, y and z vectors in such a way that angle(x, y) = \theta, angle(y, z) = \theta and angle(x, z) = \theta. But here we just want to build intuition. Suppose we know the ratings for user i and movies j and m, and also we know rating for user k and a movie j. What is a rating for an unknown pair, i.e. the user j and the movie m?

 wjwm  
ui51  
uk5?  
     

We can say that 5 means a big value of dot product between vectors or a small angle. And the same reasoning can be applied to the value 1, i.e. a big angle in this case. Let’s transfer our intuition to the plot below.

In plain English, we could say that both users liked the same movie, so it is logical to expect if one dislikes the other, so would the user that hasn’t watched it yet. It is the idea behind collaborative filtering we’ve discussed earlier.

Apart from it, you can regularize the weight of these matrices so their performance on unseen data (which is almost all of the entries in the R matrix) is better. We will stick with the simple version. We haven’t said yet how these matrices can be obtained. Actually there are a couple of ways. Since the problem is simple you could implement the derivative yourself and run update U in W in an alternating way. But the more convenient way is to use the existing library to do the job for us. Last but not least, remember to keep some part of ratings as a testing data set, otherwise, you won’t know if you carefully approximate the R matrix or just entries that happen to be known.

from tensorflow import keras


K = 2 # number of hidden factors
N = 30 # number of users
M = 10 # number of movies

user_input = keras.layers.Input(shape=1)
movie_input = keras.layers.Input(shape=1)

U = keras.layers.Embedding(N, K)(user_input)
W = keras.layers.Embedding(M, K)(movie_input)
UW = keras.layers.Flatten()(
    keras.layers.Dot(axes=2)([U, W]))

model = keras.Model([user_input, movie_input], UW)

# Toy example
# Below we have three ratings (embeddings are indexed from 0, whereas matrices from 1)
# R_{1, 2} = 5, R_{1, 6} = 5, R_{5, 4} = 3

users = np.array([0, 0, 4])
movies = np.array([1, 5, 8])
ratings = np.array([5, 5, 3])

model.compile(
    loss='mse',
    optimizer=keras.optimizers.SGD(lr=0.05, momentum=0.9),
    metrics=['mse'],
)

model.fit([users, movies], ratings, batch_size=1, epochs=1)

estimated_U, estimated_W = model.get_weights()

Factorization Machine

The factorization machine is a very powerful model that appeared for the first time in this paper. So far we ignored that fact we might have more data than only information about ratings. In our database each user is characterized by some feature, so are movies.

To better explain what the Factorization Machine is we must detour a bit. It is the way the article introduces the concept and is more natural. We’ll later see how it is connected to our application. We talked about hidden factors for users and for movies and then what interpretation has its dot product, which was the rating or its estimation. Similarly here the simple operation of dot products is of great importance. Now we don’t look for hidden factors for entities (like users or movies) but for features, in machine learning sense. We know that the last statement can sound a bit vague. It will be more clear as we get to the example. Suppose that we have a simple dataset representing houses. We have their prices, surface area, a number of rooms, and a binary feature reflecting having a garden or not. And suppose that the price is our target, it is what we want to predict. A simple linear regression would have the following form:

    \[y = a + bx_{surface} + cx_{rooms} + dx_{garden}\]

We could introduce some features manually by adding, for example, a feature x_{doors}x_{garden} if we believe that having garden influence somehow we appraise the number of rooms. In other words, could add some interactions. Let’s add them.

    \[y = a + bx_{surface} + cx_{rooms} + dx_{garden} + ex_{surface}x_{rooms} + fx_{surface}x_{garden} + gx_{rooms}x_{garden}\]

Normally we would run an optimization algorithm and find values for a, b, …, g. The difference between this approach and the factorization machine is the way it treats interactions. Interactions are modeled as dot products. Each feature gets a hidden vector, let’s say with K components.

Having said that, the model assumes the following form:

    \[y = a + bx_{surface} + cx_{rooms} + dx_{garden} + v_{surface}^{T}v_{rooms}x_{surface}x_{rooms} + v_{surface}^{T}v_{garden}x_{surface}x_{garden} + v_{rooms}^{T}v_{garden}x_{rooms}x_{garden}\]

In the process of fitting the interaction matrix (apart from a, b, and c) will be found. Of course, we wanted only to introduce you gently to the model. Probably it is not the best idea to solve this problem this way. When are these interactions useful then? As we saw in the previous section, dot product measures the similarity between vectors, and this can be helpful to transfer knowledge (transitive property). We somehow proved that it works for users. What about features? Actually we can formulate the problem in such a way that users are categorical features in our model.

Basically we have a ratings matrix and we can treat each rating as an observation. In this case, our label is the rating itself and two features, categorized users and movies. Simply speaking our feature vector is of length N + M.

If we wanted to include all interactions we could have \binom{N + M}{2} of them. But in our case, in our encoding, there exists only one interaction, between a given user and a given movie. Our factorization

In the article you can find a simple trick that makes the training of the model easy. We’ll omit some details. Here’s the trick:

    \[\sum_{i=1}^{n}\sum_{j=i+1}^{n}v_{i}^{T}v_{j}=\frac{1}{2}\sum_{f=1}^{K}((\sum_{i=1}^{n}v_{i, f}x_{i})^{2} - (\sum_{i=1}^{n}v_{i, f}^{2}x_{i}^{2}))\]

We can describe the math in plain English.

  • (\sum_{i=1}^{n}v_{i, f}x_{i})^{2} – It is a linear combination of rows squared. In sparse settings (which is ours) it is simply a sum of hidden vectors corresponding to non-zero features.
  • \sum_{i=1}^{n}v_{i, f}^{2}x_{i}^{2} – The same interpretation as the previous bullet point, but first squaring is performed
  • \frac{1}{2}\sum_{f=1}^{K} – summing all hidden components, collapsing to a scalar

Below you can find the code so that you can see how easy it is to implement it in tensorflow.

import tensorflow as tf
from tensorflow import keras


K = 2 # number of hidden factors
N = 30 # number of users
M = 10 # number of movies

user_input = keras.layers.Input(shape=1, dtype=tf.int32)
movie_input = keras.layers.Input(shape=1, dtype=tf.int32)
i = keras.layers.Concatenate()([user_input, movie_input])
V = keras.layers.Embedding(
    input_dim=N + M,
    output_dim=K,
    input_length=2,
)
rows = V(i)

first_component = tf.reduce_sum(rows, axis=1) ** 2
second_coponent = tf.reduce_sum(rows * rows, axis=1)
interaction_value = 0.5 * tf.reduce_sum(first_component - second_coponent, axis=1)

model = keras.Model([user_input, movie_input], interaction_value)

users = np.array([0, 0, 4])
movies = np.array([1, 5, 8])
ratings = np.array([5, 5, 3])

model.compile(
    loss='mse',
    optimizer=keras.optimizers.SGD(lr=0.05, momentum=0.9),
    metrics=['mse'],
)
model.fit([users, movies], ratings, batch_size=1, epochs=1)

estimated_V = model.get_weights()[0]

Of course, it is not a good idea not to add linear and bias terms. We’ll add linear terms in the subsequent snippet (although for simplicity we’ll omit global bias). The real power of this model can be seen if we have sparse data and we want to find generalizable value for interactions between features. But It might be the best choice if we have a dense table filled with numbers. So far we haven’t mentioned a word that we have different data at our disposal. It is easy to imagine that in our database of users and a database of movies, and also ratings (like the time the rating was submitted) we have a lot of new information. Let’s add for the sake of argument one more feature to our problem. In the code, a little will change, but we’ll provide you the code for reference.

We want to keep this example simple. Let’s add just one more feature which is production year.

    \[rating = mu + a_{i} + b_{j} + c_{k} + v_{i}^{T}v_{j} + v_{i}^{T}v_{k} + v_{j}^{T}v_{k}\]

import tensorflow as tf
from tensorflow import keras


K = 2 # number of hidden factors
N = 30 # number of users
M = 10 # number of movies
Y = 3 # number of movies, from 2018 to 2020

user_input = keras.layers.Input(shape=1, dtype=tf.int32)
movie_input = keras.layers.Input(shape=1, dtype=tf.int32)
year_input = keras.layers.Input(shape=1, dtype=tf.int32)

i = keras.layers.Concatenate()([user_input, movie_input, year_input])

V = keras.layers.Embedding(
    input_dim=N + M + Y,
    output_dim=K,
    input_length=2,
)

a = keras.layers.Embedding(
    input_dim=N,
    output_dim=1,
    input_length=1,
)

b = keras.layers.Embedding(
    input_dim=M,
    output_dim=1,
    input_length=1,
)

c = keras.layers.Embedding(
    input_dim=Y,
    output_dim=1,
    input_length=1,
)
rows = V(i)

first_component = tf.reduce_sum(rows, axis=1) ** 2
second_coponent = tf.reduce_sum(rows * rows, axis=1)
interaction_value = 0.5 * tf.reduce_sum(first_component - second_coponent, axis=1)
rating_prediction = a(user_input) + b(movie_input) + c(year_input) + interaction_value

model = keras.Model([user_input, movie_input, year_input], rating_prediction)

users = np.array([0, 0, 4])
movies = np.array([1, 5, 8])
years = np.array([0, 0, 2]) # 2018, 2018, 2020
ratings = np.array([5, 5, 3])

model.compile(
    loss='mse',
    optimizer=keras.optimizers.SGD(lr=0.05, momentum=0.9),
    metrics=['mse'],
)

model.fit([users, movies, years], ratings, batch_size=1, epochs=1)
estimated_V, estimated_a, estimated_b, estimated_c = model.get_weights()

Summary

It is the end of the first part of a short series devoted to recommendation systems. We have discussed classical approaches to the problem with the intuition of why they might work. We transfer that intuition to a more complex model, which is a Factorization Machine, and showed that is a good starting point for further improvement for example by including more data. In the next post, we’ll take a look at more advanced approaches. Hope to see you soon!