Paolo Di TommasoPaolo Di Tommaso
Nov 12, 2024

Fusion snapshots: optimal use of spot instances in your data pipelines

Spot instances provide access to cost-efficient cloud compute resources that are essential for deploying economically sustainable data pipelines within any organization. However, spot instances come with a significant caveat: they can be reclaimed (i.e. interrupted) by the cloud provider at any point in time. This unpredictability can pose challenges in the execution of real-life data analysis workflows at scale.

To mitigate these challenges, Nextflow implements an error failover strategy that allows for the automatic re-execution of tasks that have been interrupted by spot instance reclamation. When a task is interrupted, Nextflow attempts a new task execution from scratch. In typical execution scenarios, only 5% to 10% of the total tasks can be affected by a spot retry. As a result, the final overall execution time and cost are generally not significantly impacted.

However, production scenarios can present different challenges. For instance, the limited availability of certain instance types or the heightened demand for specific compute resources can lead to an increased frequency of spot instance reclamation. In such cases, the Nextflow retry mechanism kicks in, attempting to execute interrupted tasks again. Unfortunately, this can create a continuous loop of interruption and retries, ultimately resulting in pipeline failures when the maximum retry attempts are exhausted, wasting compute resources.

Introducing Fusion snapshots

Fusion is a file system for Nextflow data pipelines that is designed to optimize the use of cloud resources. The use of Fusion can massively improve the efficiency of data transfer in Nextflow data analysis pipelines, both in terms of compute time and resource cost.

Recently, Fusion was enhanced with a snapshotting capability that enables the optimal use of spot instances. This feature allows the persistence of the state of jobs interrupted by spot reclamation, enabling the recovery of the execution in a new compute instance. This makes it possible to guarantee the progress of the jobs irrespective of one or more spot events, thereby optimizing the use of compute resources under any conditions.

How does it work?

When a job is interrupted by a spot reclamation, the Fusion driver detects the spot event and takes a snapshot of the state of the processes tree carrying out the execution of your task, and saving it to S3 object storage. Upon retrying a task, Fusion identifies the retry attempt and restores the job state by reading the job snapshot from S3. This allows execution to resume seamlessly from the point of interruption.

Available now

Fusion snapshots is now available in Private Preview, compatible with Nextflow pipelines running on AWS Batch and using the Fusion file system. In a future release, we are planning to also add support for Google Batch and Azure Batch.

Conclusion

Fusion snapshotting enables the use of spot instances in your Nextflow data pipeline without worrying about the frequency of job interruptions. It guarantees that the computation of interrupted tasks remains unaffected, optimizing the use of spot instances and minimizing computation waste due frequent spot interruptions.

Interested in Fusion Snapshots? Register your interest in the Fusion Snapshots Private Preview