LivingDataLab - Collaberative filtering from scratch

1 Introduction

In this article we will look to build a collaberitive filtering model from scratch, using pure Pytorch and some support from the Fastai deep learning library. We will also look at the theory and mathematics behind collaberative filtering.

2 Dataset

We will use the MovieLens dataset, and a special subset curated by fastai of the 100,000 movies. This consists of 2 separate tables for ratings and movies, which we will join together.

# Download data
path = untar_data(URLs.ML_100k)

# Load ratings table
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user','movie','rating','timestamp'])
ratings.head()

	user	movie	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

# Load movie table
movies = pd.read_csv(path/'u.item',  delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie','title'), header=None)
movies.head()

	movie	title
0	1	Toy Story (1995)
1	2	GoldenEye (1995)
2	3	Four Rooms (1995)
3	4	Get Shorty (1995)
4	5	Copycat (1995)

# Merge tables
ratings = ratings.merge(movies)
ratings.head()

	user	movie	rating	timestamp	title
0	196	242	3	881250949	Kolya (1996)
1	63	242	3	875747190	Kolya (1996)
2	226	242	5	883888671	Kolya (1996)
3	154	242	3	879138235	Kolya (1996)
4	306	242	5	876503793	Kolya (1996)

# Create dataloader
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

	user	title	rating
0	542	My Left Foot (1989)	4
1	422	Event Horizon (1997)	3
2	311	African Queen, The (1951)	4
3	595	Face/Off (1997)	4
4	617	Evil Dead II (1987)	1
5	158	Jurassic Park (1993)	5
6	836	Chasing Amy (1997)	3
7	474	Emma (1996)	3
8	466	Jackie Chan's First Strike (1996)	3
9	554	Scream (1996)	3

3 Theory

The key data here is the ratings i.e. the user-movie ratings, as we can see in the listing above. In collaberative filtering, an easier way to see this is as a user-item matrix, with movies as columns, users as rows, and cells as the ratings for each user-movie combination.

We can see here some cells are not filled in which are ratings we do not know, these are the values we would like to predict so we can know for each user which movie they would like.

So how might we approach this? If we imagine there are some reasons that effect peoples preferences, lets call them factors such as genre, actors etc then that might give us a basis to figure out which users would like each movie. What if we could represent these factors as a set of numbers? then we could represent each user and movie as a unique set of these numbers (or vectors) representing how much of each of the factors that user or movie represented.

Then we could say, we want each of these user and movie factors vectors when multipled to equal a rating. This would give us a basis to learn these factors, as we have ratings we know, and we could use these to estimate the ratings we don’t know. This approach of using movie vectors multipled by user vectors and summed up is known as the dot product and is the basis of matrix multiplication.

So we can randomly initialise these user and movie vectors, and learn the correct values for these that predict the ratings we know, using gradient descent.

So to do the dot product we could look up the index of each user and movie, then multiply the vectors. But neural networks don’t know how to look up using an index, they only multiply matrices together. However we can do a index looking up using matrix multiplication by using one-hot encoded vectors.

The matrix you index by multiplying by a one-hot encoded matrix, is called an embedding or embedding matrix. So our model will learn the values of these embedding matrices for the users and movies, using gradient descent.

It’s actually very easy to create a collaberative filtering model using fastai’s higher level methods - but we are going to explore doing this from scratch in this article.

learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))

learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.937713	0.953276	00:11
1	0.838276	0.873933	00:11
2	0.717332	0.832581	00:11
3	0.592723	0.818247	00:11
4	0.476174	0.818869	00:11

4 Collaberative filtering - Model 1

We will now create our first collaberative filtering model from scratch. This will contain the embedding matrices for the users and movies, and will implement a method (in Pytorch this is normally the forward method) to do a dot product of these 2 matrices.

So the number of factors for each user and movie matrix will be determined when the model is initialised.

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users * movies).sum(dim=1)

So the input x to the model will be a tensor of whatever the batch size is multiplied by 2 - where the first column (x[:, 0]) contains the user IDs and the second column (x[:, 1]) contains the movie IDs. So the input essentially has 2 columns.

x,y = dls.one_batch()
x.shape

torch.Size([64, 2])

So we have defined our architecture and so can now create a learner to optimise the model. Because we are building the model from scratch we will use the Learner class to do this. We will use MSE as our loss function as this is a regression problem i.e. we are predicting a number, the rating.

n_users  = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
# Create model with 50 factors for users and movies each
model = DotProduct(n_users, n_movies, 50)
# Create Learner object
learn = Learner(dls, model, loss_func=MSELossFlat())

# Train model
learn.fit_one_cycle(5, 5e-3)

epoch	train_loss	valid_loss	time
0	1.336391	1.275613	00:09
1	1.111210	1.126141	00:09
2	0.988222	1.014545	00:09
3	0.844100	0.912820	00:09
4	0.813798	0.898948	00:09

5 Collaberative filtering - Model 2

So how can we improve the model? we know the predictions - the ratings: should be between 0-5. Perhaps we can help our model by ensuring the predictions are forced between these valid values? We can use a sigmoid function to do this.

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return sigmoid_range((users * movies).sum(dim=1), *self.y_range)

model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

epoch	train_loss	valid_loss	time
0	0.985542	1.002896	00:10
1	0.869398	0.914294	00:10
2	0.673619	0.873486	00:10
3	0.480611	0.878555	00:10
4	0.381930	0.882388	00:10

6 Collaberative filtering - Model 3

So while that didn’t make a huge difference, there is more we can do to improve. At the moment by using our user and movie embedding matrices, this only gives us a sense of how a particular movie or user is described as specific values for these latent factors. What we don’t have is a way to indicate something general about a particular movie or user such as this person is really fussy, or this movie is generally good or not good.

We can encode this general skew for each movie and user by including a bias value for each, which we can add after we have done the dot product. So lets add bias to our model.

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

epoch	train_loss	valid_loss	time
0	0.941588	0.955934	00:10
1	0.844541	0.865852	00:10
2	0.603601	0.862635	00:10
3	0.420309	0.883469	00:10
4	0.293037	0.890913	00:10

So this started much better, but then got worse! Why is this? This looks like a case of overfitting. So we can’t use data augmentation for this type of model, so we need some other way to stop the model fitting too much to the data i.e. some kind of regularization. One way to do this is with weight decay

7 Weight decay

So with weight decay, aka L2 regularization - adds an extra term to the loss function as the sum of all the weights squared. This will penalise our model for getting more complex than it needs to be i.e. overfitting, so this will encorage our model to have weights as small as possible the get the job done i.e. occams razor.

Why weights squared? The idea is the larger the model parameters are, the steeper the slope of the loss function. This can cause the model to focus too much on the data points in the training set. Adding weight decay will make training harder, but will force our model to be as simple as possible, less able to memorise the training data - and force it to generalise better.

Rather than calculate the sum of all weights squared, we take the derivative which is 2 x parameters and addd to our loss e.g.

parameters.grad += wd * 2 * parameters

Where wd is a factor we can control.

x = np.linspace(-2,2,100)
a_s = [1,2,5,10,50] 
ys = [a * x**2 for a in a_s]
_,ax = plt.subplots(figsize=(8,6))
for a,y in zip(a_s,ys): ax.plot(x,y, label=f'a={a}')
ax.set_ylim([0,5])
ax.legend();

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.928223	0.957245	00:11
1	0.886639	0.881928	00:10
2	0.771433	0.832266	00:11
3	0.597242	0.821840	00:11
4	0.506455	0.822054	00:10

8 Manual embeddings

So we used a pre-made Embeddings class to make our embedding matrices, but did’nt see how it works so lets make our own now. So we need a randomly initialised weight matrix for each. By default Pytorch tensors are not added as trainable parameters (think why, data are tensors also) so we need to create it in a particular way to make the embeddings trainable, using the nn.Parameter class.

# 
# Create tensor as parameter function, with random initialisation
def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))
    
# Create model with our manually created embeddings
class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.923637	0.948116	00:12
1	0.869177	0.879707	00:11
2	0.731731	0.836616	00:12
3	0.590497	0.825614	00:11
4	0.484070	0.825161	00:11

9 Collaberative filtering - Model 4

Our models developed so far are not deep learining models, as they dont have many layers. To turn this into a deep learning model we need to take the results of the embedding lookup and concatenate those activations together - this will then give us instead a matrix that we can then pass through linear layers with activation functions (non-linearities) as we would in a deep learning model.

As we are concatinating embeddings rather than taking their dot product, the embedding matrices can have different sizes. Fastai has a handy function for reccomending optimal embedding sizes from the data.

embs = get_emb_sz(dls)
embs

[(944, 74), (1665, 102)]

class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range
        
    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

model = CollabNN(*embs)

learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)

epoch	train_loss	valid_loss	time
0	0.943013	0.951147	00:11
1	0.913711	0.900089	00:11
2	0.851407	0.886212	00:11
3	0.816868	0.878591	00:11
4	0.772557	0.881083	00:11

Fastai lets you create a deep learning version of the model like this with the higher level function calls by passing use_nn=True and lets you easily create more layers e.g. here with two hidden layers, of size 100 and 50, respectively.

learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100,50])
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	1.002377	0.995780	00:13
1	0.879825	0.928848	00:13
2	0.888932	0.899229	00:13
3	0.821391	0.871980	00:13
4	0.796728	0.869211	00:13

10 Conclusion

So we have built a collaberative filtering model from scratch, and saw how it can learn latent factors from the data itself.

Collaberative filtering from scratch

Subscribe