How to define a task
Tasks are grouped by organisation identified by codops, for example, debr or sesr. There is a special component called Tasks Scheduler which is executing tasks one by one for every organisation in indefinite loop.
There is also a dedicated loop for tasks that fetch metadata, shared by all organisations. It has shorter timeout limit than other loops.
Some of the features provided by task scheduler includes:
- extracting code of the algorithm according to the configuration of the task
- distributing dependencies of the task to the cluster
- enforcing timeout of one hour for every task to prevent one task blocking the loop (or 5 minutes for metadata loop)
- collecting some basic metrics about task execution (duration, number of successes/failures)
- capturing logs from the task
Prerequisites
To be able to complete this tutorial, you will need to be familiar with PEACH Lab environment.
Write and setup a task
Let's create a simple task that would be executed repeatedly to compute some data and store them to be used quickly from the endpoint.
For the purpose of this example here we created a simple code in TensorFlow to train model and then to save computed weights in Redis for the future use from the API endpoints level.
You are free to distribute your code across many cells (you will need to define import of the full notebook using full_notebook
, see later) or leave everything in one cell only (function would be picked up automatically with the whole cell). Just don't forget to have your dependencies and supporting code there as well.
Important part is to have an entry function, which would be executed, in our case it's called train_model
. You don't have to call this function manually, it will automatically be executed in the task loop.
Registering a task
Now as we have code for our task - it's time to register it in the organizational task loop. There are two ways of registering a task:
-
via creating
peach.yaml
file in the root of repository with definitions of all your tasks -
via splitting your tasks definitions to many
.yaml
files and placing them in thepeach.conf
directory located in the root of repository
Your repository needs to either have internal access or if private - to have granted access to pipe-jupyter-user
user with Reporter privileges
In this example let's try to define our task inside the peach.conf
folder. We create tf.yaml
file inside the folder with the following content.
codops: default
tasks:
tf_task:
notebook: tensorflow_demo.ipynb
method: train_model
dependencies:
- tensorflow==2.0.0-alpha0
- The file has to be inside
peach.conf
folder on the root level, or it can be within the first-level subfolders of thepeach.conf
folder - It has be be a YAML file
- On the root level of configuration file organization code has to be defined
- Configuration file needs to have
tasks
key, inside which there is embedded key-value structuretask_name: task_definition
. It is allowed to define multiple tasks here. In our case we call our tasktf_tasks
and nesting all the information about the task inside this key - Relative path to the notebook file with your code
- An entry point function which would be called on execution. Only the cell with the entry point function would be used (use
full_notebook: true
here if you want the full notebook to be used) - Global list of dependencies for all the tasks and endpoints (all tasks per organization are executed in the same enviroment). You may also define a list of dependencies for the tasks only (see example below)
codops: default
tasks:
tf_task:
notebook: tensorflow_demo.ipynb
method: train_model
dependencies:
- tensorflow==2.0.0-alpha0
If using only one peach.yaml
file - just place all tasks definitions inside this file.
After saving the file and commiting changes to the git repository (don't forget to label it! Read more about peach-lab
label in the Introduction) the task will be added to the next task loop!
You need to wait for the next task loop execution to have your task registered
All the code in the repository should be in the master branch
Important: the task executor will ALWAYS execute the curent version defined in MASTER branch (not the version that was in the notebook at the moment the task was registered). We strongly recommend to work in separate branches during development
Tasks with arguments
Tasks allow optional arguments, for example:
tasks:
dependencies:
- orjson==3.3.1
- pyarrow
related_publications_labse:
executor: python
notebook: notebooks/zzebu/techebu/related_publication_labse.ipynb
method: task
args:
user: f57246b6-1848-4b1a-a8a2-d8469a050025
item_id: 35421
message: "used for testing only"
my_dict:
item1: 'message'
second_dict:
sub_item1: 8734
sub_item2:
- 'test'
- 23.45
no_args:
executor: python
notebook: notebooks/zzebu/techebu/related_publication_labse.ipynb
method: task
We defined here 2 tasks, with same code but different arguments. Thus, it allows to conveniently reuse the same code for different tasks.
Metadata loop
By default, tasks are executed in the same loop as codops name. However, for tasks that need to fetch some metadata from APIs on a more frequent basis - there's a dedicated metadata loop, which is shared by all broadcasters. Timeout on this loop is set to 5 minutes.
Running your task in metadata loop can be achieved by setting loop
field to metadata
:
codops: default
tasks:
tf_task:
loop: metadata
executor: python
notebook: articles_fetch.ipynb
method: fetch_from_api
Please use this task loop only for metadata-related tasks, not for any computations.
Displaying tasks information
A dedicated Notebook is provided to display informations about tasks. You have to execute all the cells to display the information and it will have the following view:
- You just need to run the cell calling
initialize_tasks_info()
- Select organization from the dropdown list
- You can see list of all the tasks for selected organization, their state (
PENDING
,RUNNING
,FINISHED
), status of the last execution, start time and duration of the last execution - You can select particular task to display detailed information about its previous executions.
- The table contains version (which is shortened version of the head commit of the executed notebook), start time, status of the particular execution and time taken by the task
- By default the logs frame would be displaying all the logs available, but if you want to filter only logs of particular execution you can do it by pressing "show below" next to the execution
- Graph which shows duration of the execution for this task and the status (blue dot if it was successful, red cross - if task failed)
- Logs frame
Next steps
Check useful tips and more examples in the lab-tutorials repository.