Demystifying Nextflow resume
This two-part blog aims to help users understand Nextflow’s powerful caching mechanism. Part one describes how it works whilst part two will focus on execution provenance and troubleshooting. You can read part two here.
Task execution caching and checkpointing is an essential feature of any modern workflow manager and Nextflow provides an automated caching mechanism with every workflow execution. When using the -resume
flag, successfully completed tasks are skipped and the previously cached results are used in downstream tasks. But understanding the specifics of how it works and debugging situations when the behavior is not as expected is a common source of frustration.
The mechanism works by assigning a unique ID to each task. This unique ID is used to create a separate execution directory, called the working directory, where the tasks are executed and the results stored. A task’s unique ID is generated as a 128-bit hash number obtained from a composition of the task’s:
- Inputs values
- Input files
- Command line string
- Container ID
- Conda environment
- Environment modules
- Any executed scripts in the bin directory
How does resume work?
The -resume
command line option allows for the continuation of a workflow execution. It can be used in its most basic form with:
In practice, every execution starts from the beginning. However, when using resume, before launching a task, Nextflow uses the unique ID to check if:
- The working directory exists
- It contains a valid command exit status
- It contains the expected output files
If these conditions are satisfied, the task execution is skipped and the previously computed outputs are applied. When a task requires recomputation, ie. the conditions above are not fulfilled, the downstream tasks are automatically invalidated.
The working directory
By default, the task work directories are created in the directory from where the pipeline is launched. This is often a scratch storage area that can be cleaned up once the computation is completed. A different location for the execution work directory can be specified using the command line option -w
. For example:
Note that if you delete or move the pipeline work directory, this will prevent to use the resume feature in subsequent runs.
Also note that the pipeline work directory is intended to be used as a temporary scratch area. The final workflow outputs are expected to be stored in a different location specified using the `publishDir` directive.
How is the hash calculated on input files?
The hash provides a convenient way for Nextflow to determine if a task requires recomputation. For each input file, the hash code is computed with:
- The complete file path
- The file size
- The last modified timestamp
Therefore, even just performing a touch on a file will invalidate the task execution.
How to ensure resume works as expected?
It is good practice to organize each experiment in its own folder. An experiment’s input parameters can be specified using a Nextflow config file which also makes it simple to track and replicate an experiment over time. Note that you should avoid launching two (or more) Nextflow instances in the same directory concurrently.
The nextflow log command lists the executions run in the current folder:
You can use the resume command with the session ID to recover a specific execution. For example:
Stay tuned for part two where we will discuss resume in more detail with respect to provenance and troubleshooting techniques!