Skip to content

Collaborative Filtering

The Algorithm

Collaborative filtering is probably the most common approach in recommender systems for many applications. The algorithm does not make use of content metadata, instead, recommendations are based on user similarity: the goal is to recommend new videos to a user U. The system searches for other users who also watched some of the videos U watched and rated them similarly to how U rated them. Based on their ratings, the system predicts how U would rate videos he didn't watch yet, but users similar to him did. The videos with the highest predicted ratings are recommended.


Because the algorithm does not make use of content metadata, recommendations are not based on item similarity and thus appear to be more diverse than the ones produced by content-based approach. Additionally, there is no need to be concerned with the common problem of bad metadata quality.


The collaborative approach suffers from both item and user cold-start, which means that before enough users watched a certain item, it cannot be recommended, and that before a user watched enough content, they do not get reasonable recommendations.


To get the best out of collaborative filtering and another approach, content-based filtering, we recommend to implement them both and combine results. To learn about the other approach, see this tutorial on content-based filtering.

The Structure

The algorithms library contains an implementation of collaborative filtering with optimisation for accuracy. The implementation is based on the Spark mllib framework, which uses matrix factorisation with ALS).

The algorithm is divided into two notebooks: a task and a query notebook. The task notebook contains model computation as a function decorated with @pipe_task. This task is executed regularly to update the model with the newest available user-content interactions (i.e. play events).


The algorithm is implemented in a generic way, you will only need to make minimal changes to adapt it for your event data source and event format.

The Workflow

1. Specify path to play events data

You will first need to specify data_path to your play events data in the pipe_task.

2. Modify event preprocessing function

generate_model() is called with specified data_path as an argument. The first step there is to convert play events into a list of Rating instances from mllib. Rating expects user_id, item_id and rating as constructor parameters. Therefore, this event preprocessing may include:

  • mapping user ids into an integer format (as required by Rating, take notice!) and saving this mapping
  • filtering out pause events.


The rating value can be broadcaster-specific as well. For example, it could be a number from 0 to 1 identify the fraction of the video watched (implicit feedback) or it could be a value from 1 to 5 in case you use a 5-star explicit rating system.


You don't need to change anything else in this notebook - the rest, i.e. model training and saving part is generic.

3. Recommendation query

Now, go to the notebook which contains @pipe_query . Here you might need to do small changes in get_als_recommendations() depending on your preprocessing function from 2. For example, you might need to map a given user id into an integer id using the mapping you created.


That's it! Now you can train a model by running the @pipe_task function from the first notebook and then request recommendations for some test user by calling the @pipe_query on the other notebook.

Final notes

If something is still unclear or you have any problem, please don't hesitate to contact us.

There will soon be an example of collaborative filtering which uses MovieLens dataset.