About Memoization (Reuse)
By letting the Platform reuse already existing outputs of your previous runs, you can achieve significant time and cost optimization of your workload.
If memoization is enabled, tasks will use pre-calculated results, instead of generating new ones. This, however, relies on the existence of intermediate files. Specifically, reuse of previous task results will be possible for the duration of that task’s intermediate files retention.
Memoization takes place at job level. Specifically, if a job has already been executed, and a job with the same app and same inputs is scheduled, even in a completely different task, the new job will not be executed but instead will return the existing outputs of the original execution, provided that memoization is enabled and the outputs are saved for a sufficient period of time.
Once memoization is triggered and job outputs are reused in a new context, appropriate job workspace directory, with all intermediate files, will also be created (files are going to be copied, so you are not going to be charged twice for the same intermediate files) for the new job, thus providing you with the option to view logs and relevant files from the original job (available via the View task logs option in View stats & logs).
Intermediate files
Intermediate files are files that are created during the course of job execution, but are not reported as task outputs. Apps are usually configured to pick up and display only relevant outputs to the user, while often also creating, but not reporting, files that are created as intermediate products. These files are, however, crucial for later executions of the jobs that are consuming/producing them, as the memoization mechanism will reuse them, instead of having to execute those jobs again.
The Platform saves intermediate files for 24 hours by default, but this option can be changed in Project Settings. The minimum value is 1 hour, maximum is 120 hours (5 days). Once this setting is changed, the new value will be applied to tasks executed after the change.
Task Stats
When a job within a task reuses outputs from a previous run of the same job, it will not be visible on the task stats timeline. The example below shows a workflow that is executed again, with the same inputs as a previously completed run of the same workflow. The banner above the timeline indicates that jobs have reused already available outputs instead of being executed to produce them again.
Task Logs
Task Logs will also indicate those jobs that used adequate precomputed outputs instead of being executed all over again. Such jobs will be clearly labelled by the icon and tooltip (available on hover), as shown in the image below. Please note that logs are also not regenerated, so the ones you are seeing are actually copies of the logs generated the first time when the job was executed using the same inputs.
Important considerations
- Folders on Inputs: Memoization will not work if a task has folders set up as its inputs or outputs. As we are currently not tracking folder content, we therefore cannot guarantee that inputs are the same.
- Non-Deterministic Tools: Be careful with non-deterministic tools. If you need stochastic results for a non-deterministic tool with the same inputs, you should turn off memoization.
- Tools with dynamic inputs/outputs: If your tool dynamically pulls inputs and pushes outputs from/to an external source, (i.e. the files are not explicitly set as inputs or outputs in the CWL app), you should turn off memoization.
With CWL 1.x apps, you can now address these cases by leaving Memoization enabled on the task level, but disabling the mechanism only for the tool in question by using the CWL 1.x WorkReuse feature.
Activate Memoization for a project
Only project administrators can activate Memoization within a project. Memoization can be activated while creating a project, or subsequently within project settings following the procedure below:
- Go to your project dashboard.
- Click the Settings tab.
- Under Execution settings switch Memoization to On.
Activate Memoization for a draft task
Please note that settings at task level override project-level settings.
- Create a draft task.
- On the draft task page, switch Memoization to On under the Execution Settings tab.
Memoization control via the API
The use_memoization
parameter provides control over enabling or disabling Memoization in projects and tasks, while the intermediate_files
parameter specifies the period of availability of intermediate files on BioData Catalyst powered by Seven Bridges and can be used at project level only. Please note that project-level settings can be changed only by project administrators.
Memoization is a part of the following API calls:
Project (both use_memoization
and intermediate_files
parameters are available):
Task (only use_memoization
is available):
Updated about 1 year ago