Sateesh PeriJan 22, 2024

Running Nextflow pipelines with MemVerge’s MMCloud

This is a guest blog post from Sateesh Peri from MemVerge highlighting the flexibility and level of third-party integrations possible using Nextflow and Seqera. Here, Sateesh highlights how MemVerge integrated MMCloud into the Seqera Platform.

At the Nextflow Summit in Barcelona, MemVerge described an exciting new integration between Nextflow and MemVerge’s Memory Machine™ Cloud (MMCloud). In this article, we discuss the integration and its benefits, explain how users can take advantage of it, and outline current efforts to strengthen interoperability between MMCloud, Nextflow, and Seqera.

Running pipelines in the cloud

A key advantage of Nextflow, is that it provides a rich set of executors that researchers can leverage to easily launch pipelines in any cloud or on-premises HPC cluster. By running workloads in the cloud, organizations can access the latest instance types, avoid the need to install and manage local compute environments, and can pay for only the resources they need. Users can also leverage low-cost Spot or preemptible instances, offering discounts that range from 50-90% compared to typical on-demand resources.

While the cloud offers clear advantages for scientific workflows, there are limitations with existing approaches:

When Spot instances are reclaimed by the provider, task execution typically begins from scratch wasting time and resources.
Cloud resources are often over- or under-provisioned, leading to tasks that need to restart due to insufficient resources, or over-spending on cloud resources.

MMCloud was designed to help address these and other challenges with running workloads in the cloud.

About MMCloud

Memory Machine™ Cloud (MMCloud) is a container orchestration platform for scientific computing developed by MemVerge. The platform is ideal for data-intensive pipelines and interactive computing applications in the cloud.

Rather than offering its own cloud infrastructure, MMCloud provides a set of services that can be layered on popular cloud platforms including AWS, Google Cloud Platform, Alibaba Cloud, and Baidu Cloud. MMCloud comprises several innovative services addressing common challenges with compute environments for data-intensive pipelines:

Workload Management: Simplifies job submission and management across cloud platforms, with capabilities for pausing and resuming tasks.
Resource and Cost Monitoring: Offers real-time insight into cloud resource usage and spending, helping to manage costs effectively.
Spot Instance Handling: Automatically handles spot interruptions by saving and migrating tasks, reducing the need to restart from the beginning.
Dynamic Resource Allocation: Continuously adjusts cloud resources based on task needs, ensuring efficiency and cost savings.

MMCloud and Seqera Platform

Users of open-source Nextflow can easily get started with MMCloud by creating an account and following the directions in the Memory Machine Cloud Quick Guides, depending on their preferred file system.

Seqera Platform users can also run pipelines on MMCloud by using Seqera’s AWS Batch integration to manage the Nextflow head job and provide MMCloud configuration settings within the Seqera interface.

In the Seqera Platform, create or select an existing AWS Batch compute environment designed for on-demand instances.

In the Launchpad, add a new pipeline or select an existing one.

Under Advanced Options, enter the following configuration to use MMCloud. Make sure to add your AWS and MMCloud credentials (the OpCenter IP address, username and password can be accessed from your MMCloud account).

plugins {
  id 'nf-float'
}
process {
  executor = 'float'
  errorStrategy = 'retry'
}
wave {
  enabled = true
}
fusion {
  enabled = true
  exportStorageCredentials = true
}
float {
  address = '<user-opcenter-ip-address>'
  username = '<username>'
  password = '<password>'
}
aws {
  accessKey = '<access-key>'
  secretKey = '<access-secret>'
}

Add the following command to the Pre-run script section to disable the AWS secrets provider. This is necessary to prevent the default AWS Batch execution:

unset AWS_BATCH_JOB_ID

Finally, launch the workflow! This will use AWS Batch to provision the head node and subsequently use MMcloud to execute all jobs

The nf-float plugin

As illustrated below, Nextflow users can connect to MMCloud on their preferred cloud via nf-float, a plug-in that provides seamless access to MMCloud resources.

Nextflow will automatically download and activate the plugin when it sees the following construct in the nextflow.config file:

plugins {
  id 'nf-float'
}

The MMCloud nf-float executor works with multiple cloud storage options, including NFS, s3fs, Juice FS, and Seqera’s Fusion file system.¹ An S3 bucket can be optionally specified as the working directory. In this case, MMCloud will automatically use S3FS to facilitate access to the S3 bucket unless users specifically request the Fusion file system by enabling it in the config file.

Details about the MMCloud environment are specified in a series of float directives in the nextflow.config file as explained in the plug-in documentation. Instructions for using Fusion with MMCloud are available here.

Quantifying the benefits of superior resource management

While results will vary depending on the pipeline and storage used (as well as runtime considerations such as spot availability and cost), savings can be dramatic. MemVerge recently published internal benchmarks for three widely used nf-core pipelines showing savings per run of up to 60% in the case of nf-core/rnaseq. These cost savings result from MMCloud’s ability to dynamically right-size cloud instances using WaveRider and by avoiding restarting tasks following Spot instance pre-emption with SpotSurfer. Additionally, MMCloud deploys single tasks per VM, along with a unique instance selection strategy that benefits both performance and cost.

See the Memory Machine Cloud blog for the complete set of benchmark results. In addition to making Spot instances safe for non-fault-tolerant apps and providing continuous instance size optimization at runtime, MMCloud also brings other benefits. These include real-time visibility into application resource use via WaveWatcher and the ability to run on clouds not directly supported by Nextflow, including Alibaba Cloud and Baidu.

Future directions

The Seqera and MemVerge teams are continuing to strengthen the integration making it easier to access MMCloud from Nextflow and the Seqera platform. Current areas of focus include:

A tighter integration between Seqera platform and MMCloud that avoids the need to run the Nextflow head job in AWS Batch
Improved MMCloud support for Seqera’s Fusion file system
Better cross-cloud compatibility

In addition, MemVerge just released an improved incremental snapshot functionality called AppCapsule++ to address the specific case where snapshots of tasks with large memory footprints cannot be completed before Spot instances are preempted.

Learn more: AppCapsule++, the 2nd generation of MMCloud snapshot technology to support bigger workloads and reduce overhead

AppCapsule++ functionality should result in even better performance and cost-efficiency when using Spot or preemptible instances in the cloud.

To learn more about using the Seqera platform with MMCloud, see the latest MemVerge blog post, Run Nextflow Workflows on MMCloud via Seqera Platform.

¹ Presently, support for FusionFS is limited. MMCloud’s checkpoint/resume functionality presently works only with S3FS and JuiceFS. MemVerge and Seqera aim to provide more complete support for the Fusion file system in future.