Error recovery and automatic resource management with Nextflow
Recently a new feature has been added to Nextflow that allows failing jobs to be rescheduled, automatically increasing the amount of computational resources requested.
The problem
Nextflow provides a mechanism that allows tasks to be automatically re-executed when a command terminates with an error exit status. This is useful to handle errors caused by temporary or even permanent failures (i.e. network hiccups, broken disks, etc.) that may happen in a cloud based environment.
However in an HPC cluster these events are very rare. In this scenario error conditions are more likely to be caused by a peak in computing resources, allocated by a job exceeding the original resource requested. This leads to the batch scheduler killing the job which in turn stops the overall pipeline execution.
In this context automatically re-executing the failed task is useless because it would simply replicate the same error condition. A common solution consists of increasing the resource request for the needs of the most consuming job, even though this will result in a suboptimal allocation of most of the jobs that are less resource hungry.
Moreover it is also difficult to predict such upper limit. In most cases the only way to determine it is by using a painful fail-and-retry approach.
Take in consideration, for example, the following Nextflow process:
The above definition will execute as many jobs as there are fasta files emitted by the sequences
channel. Since the retry
error strategy is specified, if the task returns a non-zero error status, Nextflow will reschedule the job execution requesting the same amount of memory and disk storage. In case the error is generated by t_coffee
that it needs more than one GB of memory for a specific alignment, the task will continue to fail, stopping the pipeline execution as a consequence.
Increase job resources automatically
A better solution can be implemented with Nextflow which allows resources to be defined in a dynamic manner. By doing this it is possible to increase the memory request when rescheduling a failing task execution. For example:
In the above example the memory requirement is defined by using a dynamic rule. The task.attempt
attribute represents the current task attempt (1
the first time the task is executed, 2
the second and so on).
The task will then request one GB of memory. In case of an error it will be rescheduled requesting two GB and so on, until it is executed successfully or the limit of times a task can be retried is reached, forcing the termination of the pipeline.
It is also possible to define the errorStrategy
directive in a dynamic manner. This is useful to re-execute failed jobs only if a certain condition is verified.
For example the Univa Grid Engine batch scheduler returns the exit status 140
when a job is terminated because it's using more resources than the ones requested.
By checking this exit status we can reschedule only the jobs that fail by exceeding the resources allocation. This can be done with the following directive declaration:
In this way a failed task is rescheduled only when it returns the 140
exit status. In all other cases the pipeline execution is terminated.
Conclusion
Nextflow provides a very flexible mechanism for defining the job resource request and handling error events. It makes it possible to automatically reschedule failing tasks under certain conditions and to define job resource requests in a dynamic manner so that they can be adapted to the actual job's needs and to optimize the overall resource utilisation.