Technical Introduction to Recommendation Service
As described in the product overview of the Recommendation Service, it provides 4 separated steps to conveniently implement the data acquisition, processing, distribution and monitoring necessary to deliver recommendations for each user.
1. Data Acquisition
The library available for the collection on the web pages is called
peach-collector.js and is implemented in a very similar way than you would know from Google Analytics or any other common website analytics tool. For website developers it will be very straight forward to add the necessary code bits to the pages to get started with data collection. The technical description and how to get started with it is documented in the respective section peach-collector.js
Additionally to collecting information on the web pages, the consumption on the mobile applications should also be considered. PEACH has developed a simple to integrate library for iOS and Android which are available on the PEACH repository at https://git.ebu.io/pipe/pipe-ios for iOS and https://git.ebu.io/pipe/pipe_android for Android.
peach-collector.js it is pretty straight forward for web developers to integrate the data acquisition in existing websites and web players and start collecting events. The events are sent to an acquisition REST API (
pipe-collect) provided by the EBU which collects all events one by one or by batch and sends them through the pipeline to be stored on storage. (More information on event's data format is available here)
peach-collector.jsis integrated on the website and will send events (XHR requests, AJAX) that the user triggered by its consumption
- Pipe Collect REST endpoints are hosted on a Python Flask web server.
- For scaling and fault tolerance reasons the data is then pushed to RabbitMQ
- The data is picked up from the data queues by Flume and persisted onto HDFS
- Data is stored in JSON format on Hadoop Distributed File System (HDFS) data stores
The pipe-collect API provides a batch ingest which collects several events within one API call. A broadcaster might prefer to collect event by batch for instance send event in a mobile app only when connected through a Wifi.
2. Recommendation Algorithms
The second step in the recommendation process is described and integrated in the Data Scientist Platform. Mainly the part concerning the recommendation is the processing and computation of the algorithms to provide data to the recommendation API. Most of the operations and actions are handled within the management dashboard allowing the broadcaster to control the evaluation and distribution endpoints.
See the Introduction to Data Scientist Platform .
3. Processing & Distribution
The algorithms developed by the data scientists are converted into tasks that are run periodically on a Spark cluster and process the collected data in order to provided updated models for the recommendations.
- The data stored on HDFS is read by an Apache Spark Cluster
- A recurring task is scheduling the Algorithms created by the data scientists and will process the data in the Spark Cluster
- The pre-computed recommendation models are stored inside Redis to allow faster access by the Recommendation API
- The Recommendation API (a Python Flask web server) computes the Recommendation List for a particular user from the data stored in Redis
Once the models are computed they are made available through the Recommendation API. The endpoints provide a way to query and execute the computation necessary to build the list of recommendations given user id and, optionally, some additional parameters. This part uses the results of the previous task computation stored in Redis (i.e. the model).
The way how the Recommendation API is queried depends on the implementation, in most cases, the broadcasters backend will query the Recommendation API and build a more elaborate list with metadata to be returned to a visitor. In some cases querying the Recommendation API directly from the visitors devices or browser may also work if limited or no metadata is available. Additionally a list of content can be placed on a CDN to work as a fallback in case the Recommendation API experiences issues and does not return in proper time.
- The recommendations are made available through API endpoints on the Recommendation API hosted on a Python Flask web server.
- The Recommendation API usually receives additional query data such as user ids or cookie information.
- Recommendation Lists usually need to be extended with metadata as they return only content ids, thus most implementation they are queried through the broadcasters backend that will add the necessary metadata of the contents. However the infrastructure is build to support direct query as well.
- The ecosystem made available to data scientists provides tools to compute and store periodically update lists directly to CDN infrastructures to be used as a static fallback option.
4. Evaluation and Monitoring
Measuring and improving algorithm requires an A/B or Multi-Variate Testing environment. The Recommendation API and the management interface offer these options.
Example of the options available when setting up an experiment on the management interface:
The evaluation of the recommendation follows the workflow below. Based on the configuration of the experiment, a Model is selected to a request based on the user'd id, its cookies or randomly. This allows a consistent research and testing framework for data scientists. Additionally if the Model computation does not return a consistent list another fallback list can be chosen to overcome cold start and other common issues or recommendations.
- Introduction to Recommendation Service
- Introduction to Single Sign-On (in progress)
- Introduction to Data Scientist Platform
Once you went over the introduction, below are all the tutorials available for the implementation and integration of the PEACH Products:
- Data Acquisition
- Recommendation Algorithms
- Processing & Distribution
- Evaluation & Monitoring
Single Sign-On & Identity Provider
Data scientist platform
The following recommendation algorithms are available in the Data Pipeline:
- 1 - Generic Content-Based Filtering - proposes similar content based on metadata
- 2 - Collaborative Filtering - attempts to deliver relevant content based on similar usage patterns seen in all users
- 3 - Diversified Algorithm - declination to expose user to a broader catalogue of content
- 4 - Trending - delivers popular content