Introduction to Data Scientist Platform

The Data Scientist platform provides an environment to explore collected data, create task to run periodical processing of it, simply roll out algorithm to production and measure the impact.

Process Collected Data

Many recommendation algorithms contain a part which needs to be recomputed as new data (usually user-item interaction events or new items) becomes available. Having implemented such part, one can create a task using pipe manager interface which will run periodically recomputing the model. Such periodic automatic execution could also be useful for time series metrics computation for the dashboard.

Tasks are running on Spark to fetch data fro HDFS.

Tasks might take long time to execute and can take the benefit of Spark parallel running.

Result of computation (for instance matrix representing user etc) are stored into Redis.

Algorithm computation

Tasks and Algorithms

We call an algorithm some Python code that interact with Spark context (+HDFS) and Redis. An algorithm might be used to serve recommendation through an endpoint in the recommendation API. In this case, for performance purpose, it's recommended that the algorithm would access only data in Redis (not in HDFS). An algorithm could also be used as a task. A task runs repeatedly to compute data. There is no issue accessing HDFS via spark in task since there are not supposed to be fully executed in a short amount of time.

Implementing Tasks and Algorithms

Algorithms are implemented in Python.

One of the cool thing in the project is that Data scientist have a full Python environment out of the box with a complete access to the data.

That environment is based on Jupyter. Jupyter let a Data Scientist execute Python via notebooks.

Notebooks are interactive computational environment, accessible from a standard browser, where you can execute Python code.

Schedule Tasks and Deploy Endpoints in Production

Once an algorithm is ready to be used in production, the Data scientist just have to decorate corresponding the function, in the notebook, with @pipe_query or @pipe_task. Both decorator let platform admin create an algorithm with decorated function. "Query" suggest that the function should be used to respond to a recommendation query whereas task indicate that the function is supposed to be used within a task.

After the function is decorated, it would be available in production. See tutorial here.