Skip to content

Generic Content-Based Filtering

The Algorithm

Content-based filtering is one of the classical approaches in recommender algorithms which makes use of content metadata to produce recommendations. Based on user watch events, it creates a user representation analogous to items (i.e. with the same metadata fields) where the values of the metadata fields for the user are derived from the metadata of items they watched. This way, a user can be compared to items and similar to the user items can be recommended.

One of the biggest advantages of the content-based approach in comparison to the collaborative approach is that it does not rely on other users and their watching behaviour, therefore dealing much better with the item cold-start problem (new items don't need to be watched by anyone to be recommended). In addition, recommendations look much less random (though less diverse) than the results of collaborative filtering, which does not use any information on content-metadata and is domain independent. For more information on collaborative filtering and the implementation, see this tutorial.

The structure

The algorithm is divided into two notebooks: a task and a query notebook.

The task notebook contains model recomputation and model update logic as a function decorated with @pipe_task. This task is executed regularly to update the model with the most current catalog of items and their metadata.

There is a separate model generated for every metadata field, converting item metadata into "features" (for example a title metadata field would become a vector of word frequencies of the title).

No explicit model for users is computed on a regular basis and stored, instead, it is calculated dynamically upon a recommendation request for a user.

Recommendation requests are processed in the second notebook in the function decorated with @pipe_query. There, a user's history is employed to create a user model, then similar items are found based on the models and the best ones are recommended.


The algorithm is designed in a way that only minimal adjustments need to be done to tailor the algorithm to the needs of your organisation. The following tutorial includes the steps to be done as well as a general description of the algorithm.

The workflow

1. Define a metadata scheme

Different broadcasters might have different content metadata fields (aka. features) of different importance (aka. weights). Because of that, each instance of the algorithm needs to define a metadata scheme in the following format:


Field is just a named tuple. field_name is the name of the feature as it is used in actual item metadata storage. field_weight is the importance of the feature from 0 to 1, the greater it is, the stronger this feature contributes to derivation of user taste. field_class - a Python class which defines the behaviour of this feature type, typically you use one of the predefined classes (see next section) field_read defines a broadcaster-specific function which converts metadata of items in some format into two lists: a list of item ids and a list of corresponding field values. The input format is arbitrary - the algorithm does not depend on that.

So your first step is to define a scheme which corresponds to the content metadata format of your organisation.

Feature classes

To describe metadata fields you need to specify field classes like in the example above. Different type of metadata require different processing and comparison functions.

Currently "Textual" and "Categorical" feature classes are available. These two types cover the most common metadata fields public broadcasters might have, such as title and description (Textual features) or tags and category (Categorical features).


In case a broadcaster has some metadata field which can not represented by one of the types, you can define and implement custom types (see the BaseFeature class definition for instructions or the Textual class as an example in pipe-algorithms -> notebooks -> CBFiltering -> Features).


The feature classes are implemented as part of the algorithms library available from your notebook engine. You can find them in the pipe-algorithms -> notebooks -> CBFiltering folder and import them directly to your pipe task and pipe query notebooks.

2. Metadata processing

Now that the metadata scheme is defined, we need to create a task which will run periodically to update models with the newest available content metadata (because new content is being constantly added and old content metadata might be edited).

Broadcasters use different representation and storage for their content metadata. Therefore, the data should be retrieved and then converted into a common format before building models.

Thus, you need to include metadata retrieving and preprocessing logic in the beginning of update_model(): content metadata should be read from your broadcaster-specific storage and converted into some format which your corresponding read_field() expects as an input.

3. Model computation


Yes, you've already made all required changes in the task notebook. The text below just describes what will happen in the background.

The next step in update_model() is either model update or model recomputation depending on the amount of time that passed after the last full recomputation.

In case of full model recomputation, recompute_models() is called, where each metadata field is first converted into the format returned by read_field(), and then each feature is computed and the corresponding model saved into Redis. This means that the model is recomputed from scratch - we won't need the old model ever again.


You can change the time period between full model recomputations (2 hours by default), but we recommend to recompute the model regularly to take into account the latest changes of metadata on existing items and to ensure full accuracy of the model.

In case of an incremental update, a bunch of new items which became available since the last task run are converted into features (by calling compute() on each metadata field) and then appended to the end of each model.


While updating the model allows to include and therefore to recommend very fresh items, it leads to model precision loss. The model should be fully recomputed regularly to avoid growing errors.

The last step of this pipe_task is saving some statistics such as a timestamp of the last model recomputation and computation duration.

4. Recommendation query

It's time to move to the second part of the algorithm - responding to recommendation requests.


Upon a recommendation request, first, the user model is dynamically computed based on the newest watch events available for this user. Then, similarities between the user and all items are computed per feature, based on the corresponding models. Finally, the total similarity scores are computed based on feature weights, and top recommendations returned.

Take a quick look at the arguments of @pipe_query(custom_args=['c', 'n_recs']). Here c stands for cookie id and n_rec is the number of recommendations to be returned. These parameters should be passed to the request handler.

First, you will need to access the user history for a given user id (which is passed as an input parameter). This part is custom because similarly to the metadata, user history is most likely stored in different ways for each broadcaster. User history is then saved into Redis as a set of item ids in update_user_history().

Nothing else needs to be changed here. The rest of the function is responsible for comparing each feature to the implicit corresponding feature of the user in an efficient way and then combining results into similarity scores using feature weights.


If you are interested in the implementation details of these steps, please take a look at the detailed comments in the code.


This approach with dynamic user model computation on recommendation request allows us to make use of the very latest watch events and therefore to understand the user's taste at this very moment.


You made it! You can now try to train a model from the notebook by running the @pipe_task function and then request recommendations for some test user.

Recommend similar to

There is a modification of the algorithm available, which recommends videos which both suit user's taste and are similar to an item, whose id is passed into @pipe_query as the parameter similar_to. If the parameter is not passed, the normal logic described above is applied. The @pipe_task notebook is not in any way affected by this change.


This is not a simple content-to-content algorithm because it does take into account user history and bases recommendations on that as well as on the item which recommendations should be similar to.


It might be useful to be able to filter out certain items based on some criteria so they are not recommended. For example, if a relevance period of an item is set to a day, but the item was published longer ago, it should not appear in recommendations.

For such use cases, we implemented a Filter class (it's also a subclass of BaseFeature). To introduce a filter field, you need to add it to your metadata scheme definition with a zero weight. In the model computation part (i.e. the @pipe_task notebook) nothing else needs to be changed.

Now go to the other notebook which contains your @pipe_query and first add filter names to the list of @pipe_query arguments. Now, take a look at Step 4 in the function. Here you will need to call compare_fast() on each filter field by passing filter type (i.e. Boolean or String) and either a value to match or a tuple of values (i.e. a range of dates).


Amazing, now you not only have a customised to your metadata algorithm, but also employed advanced filters on other metadata fields.

Final notes

In case you encounter any problem, please don't hesitate to contact us - we will gladly solve the issue together.

Example notebooks for this tutorial will be available soon.