In my undergraduate degree, I built a movie recommendation system using Singular Value Decomposition (SVD). It was the first time I got to see the power of machine learning in action, and the results really surprised me in a good way.

I will go through the process used in the project, and the code for this is available in my Github.

Data Collection

One of the most famous movie data APIs, Open Movie Database (OMDB) was used to get the movie data. The data was then stored in a CSV file. The data consists of 10000 movies and their corresponding ratings from 1000 users. Let’s take a look at the structure of the data.

Let’s have a quick look at the data. While there are other files in the dataset, we will only be using the u.data file because it has all the ratings information we need.

user_id item_id rating timestamp
196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596

Data Preprocessing

import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import random


# Load data from disk
def ratings_reader():
   names = ['user_id', 'item_id', 'rating', 'timestamp']
   df = pd.read_csv('dataset/u.data', sep='\t', names=names)
   n_users = df.user_id.unique().shape[0]
   n_items = df.item_id.unique().shape[0]

   # Create r_{ui}, our ratings matrix
   ratings = np.zeros((n_users, n_items))
   for row in df.itertuples():
       ratings[row[1]-1, row[2]-1] = row[3]

   return ratings

ratings = ratings_reader()
n_users = ratings.shape[0]
n_items = ratings.shape[1]

Test-Train Split

def test_train_split(ratings):
   test = np.zeros(ratings.shape)
   train = ratings.copy()

   for user in range(ratings.shape[0]):
      test_ratings = np.random.choice(ratings[user, :].nonzero()[0],
      size = 10, replace=False)

      train[user, test_ratings] = 0
      test[user, test_ratings] = ratings[user, test_ratings]

   assert(np.all((train * test) == 0))

   return train, test


train, test = test_train_split(ratings)



def get_rmse(actual, pred):
   pred = pred[actual.nonzero()].flatten()
   actual = actual[actual.nonzero()].flatten()
   mse = mean_squared_error(pred, actual)
   return mse**0.5

Training the model

Set hyperparameters
n_iters = 200
gamma = 0.001
lmbda = 0.01

k = 80

users, items = train.nonzero()

Bu = np.zeros(n_users)
Bi = np.zeros(n_items)
P = np.random.rand(n_users, k)
Q = np.random.rand(n_items, k)
global_bias = np.mean(train[np.where(train != 0)])
train_rmses = []
test_rmses = []
Matrix Factorization Algorithm for Collaborative Filtering

Before we start, let’s try to visualize how the algorithm works. The algorithm is called SVD (Singular Value Decomposition)-based Matrix Factorization. There’s a very intuitive video on youtube by Visual Kernel that explains the algorithm. Feel free to watch it.

The following is a detailed breakdown of applying SVD to the given user-item matrix.
Consider the following user-item matrix:

User Indiana Jones Star Wars Empire Strikes Back Incredibles Casablanca
Bob 4 5 ? ? ?
Ted ? ? ? ? 1
Ann ? 5 5 5 ?

Step 1: Initialize the User-Item Matrix ( R )

  • Fill missing values with the average rating or zeros for simplicity:
    \(R = \begin{bmatrix} 4 & 5 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 5 & 5 & 5 & 0 \end{bmatrix}\)

Step 2: Apply SVD

  • Decompose \(R\) into three matrices \(U\), \(\Sigma\), and \(V^T\) such that \(R \approx U \Sigma V^T\).

Example decomposition:

\[U = \begin{bmatrix} 0.58 & 0.58 & 0.58 \\ 0.29 & -0.71 & 0.65 \\ 0.76 & -0.41 & -0.50 \end{bmatrix}\] \[\Sigma = \begin{bmatrix} 9 & 0 & 0 \\ 0 & 5 & 0 \\ 0 & 0 & 2 \end{bmatrix}\] \[V^T = \begin{bmatrix} 0.58 & 0.58 & 0.58 & 0.29 & 0.29 \\ 0.29 & -0.71 & 0.65 & 0.29 & -0.29 \\ 0.76 & -0.41 & -0.50 & 0.29 & 0.29 \end{bmatrix}\]

Step 3: Reconstruct the Matrix

  • Use the top \(k\) singular values to approximate \(R\): \(R' = U_k \Sigma_k V_k^T\)

How does this work?

  • The matrix \(R\) is decomposed into three matrices: \(U\), \(\Sigma\), and \(V^T\).
  • \(U\) contains the left singular vectors, \(\Sigma\) is a diagonal matrix with singular values, and \(V^T\) contains the right singular vectors.
  • By selecting the top \(k\) singular values, we reduce the dimensionality of the matrices while retaining the most significant features.
  • The product of these reduced matrices \(U_k \Sigma_k V_k^T\) gives us an approximation of the original matrix \(R\), denoted as \(R'\).
  • This approximation helps in reconstructing the matrix with reduced noise and improved generalization for predicting missing values.

Step 4: Predict Missing Ratings

  • Use the reconstructed matrix \(R'\) to fill in missing ratings: \(R' = \begin{bmatrix} 4 & 5 & 3 & 2 & 1 \\ 2 & 1 & 1 & 1 & 1 \\ 3 & 5 & 5 & 5 & 2 \end{bmatrix}\)

The predicted ratings for missing values are derived from the reconstructed matrix \(R'\).

Matrix Factorization is a collaborative filtering algorithm used to predict user-item interactions. The goal is to factorize the user-item interaction matrix into two lower-dimensional matrices, representing users and items, respectively. These matrices are then used to predict missing entries in the original matrix.

Mathematically, we aim to decompose the user-item interaction matrix \(R\) into two matrices \(P\) and \(Q\) such that: \(R \approx P \cdot Q^T\)

where:

  • \(R\) is the \(n \times m\) user-item interaction matrix.
  • \(P\) is the \(n \times k\) user-feature matrix.
  • \(Q\) is the \(m \times k\) item-feature matrix.
  • \(k\) is the number of latent features.

The predicted rating \(\hat{r}_{ui}\) for user \(u\) and item \(i\) is given by:

\[\hat{r}_{ui} = P_u \cdot Q_i^T + B_u + B_i + \mu\]

where:

  • \(P_u\) is the \(u\)-th row of matrix \(P\)
  • \(Q_i\) is the \(i\)-th row of matrix \(Q\)
  • \(B_u\) is the bias term for user \(u\)
  • \(B_i\) is the bias term for item \(i\)
  • \(\mu\) is the global bias term, representing the average rating.

To learn the matrices \(P\) and \(Q\), we minimize the following regularized squared error loss function using Stochastic Gradient Descent (SGD):

\[\min_{P, Q, B_u, B_i} \sum_{(u,i) \in \mathcal{K}} \left( r_{ui} - \hat{r}_{ui} \right)^2 + \lambda \left( \|P_u\|^2 + \|Q_i\|^2 + B_u^2 + B_i^2 \right)\]

where:

  • \(\mathcal{K}\) is the set of user-item pairs for which the ratings are known.
  • \(\lambda\) is the regularization parameter to prevent overfitting.

The update rules for SGD are as follows:

\(P_u \leftarrow P_u + \gamma \left( e_{ui} \cdot Q_i - \lambda \cdot P_u \right)\)
\(Q_i \leftarrow Q_i + \gamma \left( e_{ui} \cdot P_u - \lambda \cdot Q_i \right)\)
\(B_u \leftarrow B_u + \gamma \left( e_{ui} - \lambda \cdot B_u \right)\)
\(B_i \leftarrow B_i + \gamma \left( e_{ui} - \lambda \cdot B_i \right)\)

where:

  • \(e_{ui} = r_{ui} - \hat{r}_{ui}\) is the prediction error.
  • \(\gamma\) is the learning rate.

By iteratively updating the matrices \(P\) and \(Q\), and the bias terms \(B_u\) and \(B_i\), we can minimize the loss function and obtain the optimal factorized matrices for predicting user-item interactions.

Now, let’s implement the algorithm.

def trainer(n_iters, users, items, P, Q, Bu, Bi, global_bias, trainer, tester, gamma, lmbda):
   # use the stochastic gradient descent approach to mimimize the error in prediction
   train_rmses = []
   test_rmses = []

   for n in range(n_iters):
      for u,i in zip(users, items):
         prediction = P[u,:].dot(Q[i,:].T) + Bu[u] + Bi[i] + global_bias
         e = train[u,i] - prediction
         Bu[u] += gamma * (e - lmbda * Bu[u])
         Bi[i] += gamma * (e - lmbda * Bi[i])
         P[u,:] += gamma * (e * Q[i,:] - lmbda * P[u,:])
         Q[i,:] += gamma * (e * P[u,:] - lmbda * Q[i,:])
      
      pred = np.zeros((P.shape[0], Q.shape[0]))

      for u in range(P.shape[0]):
         for i in range(Q.shape[0]):
            pred[u,i] = P[u,:].dot(Q[i,:].T) + Bu[u] + Bi[i] + global_bias

      train_rmse = get_rmse(train, pred)
      test_rmse = get_rmse(test, pred)
      train_rmses.append(train_rmse)
      test_rmses.append(test_rmse)
      print(n+1, train_rmse, test_rmse)
      return(train_rmses, test_rmses)


# Start training the model
train_rmses, test_rmses = trainer(n_iters, users, items, P, Q, Bu, Bi, global_bias, train, test, gamma, lmbda)

Plotting the learning curves

def draw_learning_curve(n_iters, train_rmse, test_rmse):
   # use this function to use the above train_rmses and test_rmses to plot the rmse curves, 
   # see how they converge
   plt.plot(range(n_iters), train_rmse, label = 'Train', linewidth = 5)
   plt.plot(range(n_iters), test_rmse, label = 'Test', linewidth = 5)
   plt.xlabel('Iterations')
   plt.ylabel('RMSE')
   plt.legend(loc='best', fontsize=30)


def cosine_similarity(Q):
   # finds the cosine similarity based on item feature matrix
   # parameter: Q ( item latent feature matrix)
   sim = Q.dot(Q.T)
   norms = np.array([np.sqrt(np.diagonal(sim))])
   return sim/ norms/ norms.T


def get_top_k(sims, movie_id, k):
   # return the tuple of top k movies using the similarity matrix and movie_id
   movie_indices = np.argsort(sims[movie_id,:])[::-1]
   top_k = movie_indices[1:k]   
   top_k = tuple(top_k)
   return top_k

Inference

def infer(user_id, item_id, P, Q, Bu, Bi, global_bias):
    """
    Make a prediction for a given user and item using the trained matrices.
    
    Parameters:
    user_id (int): The ID of the user.
    item_id (int): The ID of the item.
    P (numpy array): User latent feature matrix.
    Q (numpy array): Item latent feature matrix.
    Bu (numpy array): User biases.
    Bi (numpy array): Item biases.
    global_bias (float): The global bias.
    
    Returns:
    float: The predicted rating.
    """
    prediction = P[user_id, :].dot(Q[item_id, :].T) + Bu[user_id] + Bi[item_id] + global_bias
    return prediction

# Example usage:
user_id = 0  # example user ID
item_id = 0  # example item ID
predicted_rating = infer(user_id, item_id, P, Q, Bu, Bi, global_bias)
print(f"Predicted rating for user {user_id} and item {item_id}: {predicted_rating}")

The fullstack project with the code is available here.

References:

  • https://www.ethanrosenthal.com/2016/01/09/explicit-matrix-factorization-sgd-als/
  • https://www.youtube.com/watch?v=vSczTbgc8Rc