Announcing Nextflow Support for Google Cloud Batch
Easy access to scalable cloud resources is a game-changer for organizations engaged in life sciences research. By running portable and reproducible pipelines in the cloud, researchers can collaborate more effectively, avoid cost and complexity, and deliver high-quality results faster. This translates into higher productivity, an accelerated pace of discovery, and a faster time to market for critical new diagnostics and therapeutics.
With the launch of Google Genomics in 2014 (now Google Cloud Life Sciences), Google Cloud became the first major cloud provider to offer a comprehensive life sciences solution. Today, Google Cloud provides a complete suite of open and interoperable cloud services for life sciences, healthcare and academic research. Solutions range from leading genomic analysis tools to public datasets to integrations with data analysis tools such as Cloud Spanner and BigQuery.
On July 13th, Google announced the beta availability of Google Cloud Batch, a new fully managed batch service that will soon become generally available. Seqera is pleased to offer support for Google’s new Batch service out of the gate.
Google Cloud Batch is a comprehensive cloud service suitable for multiple use cases, including HPC, AI/ML, and data processing. While it is similar to the Google Cloud Life Sciences API used by many Nextflow users today, Google Cloud Batch offers a broader set of capabilities. As with Google Cloud Life Sciences, Google Cloud Batch automatically provisions resources, manages capacity and allows batch workloads to run at scale. It offers several advantages, however, including:
- The ability to re-use VMs across jobs steps to reduce overhead and boost performance
- Granular control over task execution, compute, and storage resources
- Infrastructure, application, and task-level logging
- Improved task parallelization including support for multi-node MPI jobs, array jobs, and subtasks
- Spot VM support, offering up to 91% savings vs. regular compute instances
- Streamlined data handling
- Other features including secret management, configurable priorities and task retry counts, a Batch UI, and more
With Google Cloud Batch, GCS-FUSE support is built-in. This means that Nextflow process steps can directly mount Cloud Storage buckets without the need to shuffle data back and forth between VM instances and Google Cloud Storage. The integration also enables data residing on Google Filestore (a managed NFS service) to be directly accessible to Nextflow pipelines running on Google Cloud Batch. Spot VM provides a better user experience than preemptible VMs with support for runtimes beyond 24 hours and additional savings.
Seqera has been working closely with Google to deliver a high-quality Nextflow Google Cloud Batch executor that takes full advantage of innovations in Google Cloud Batch. The integration will be freely available under an Apache 2.0 as part of the Nextflow distribution.
Nextflow lead engineer and Seqera co-founder Paolo Di Tommaso observed that “Google Cloud Batch provides an elegant but powerful API with a straightforward execution model. The integration makes data handling a breeze. These and other technical advantages baked into Google Cloud Batch will directly benefit pipeline efficiency, throughput, and reliability.”
According to the Seqera CTO, scalability is an area where customers should see gains. “A benefit of working with Google through the pre-release is that we’ve had ample time to focus on testing. Validating the integration at scale with dozens of production pipelines has helped us quickly get out of the gate with a high-quality integration.”
Google Cloud Batch offers an easy migration path for customers running traditional HPC schedulers or the Google Cloud Life Sciences beta API. Customers can continue to use existing integration methods and migrate at their own pace. Given the portable nature of containerized Nextflow pipelines, migrating to Google Cloud Batch is often as simple as tweaking a few settings in the nextflow configuration file and validating pipeline execution. Customers can continue using existing datasets in Google Cloud Storage, containers, and pipeline repositories.
Seqera will also offer Google Cloud Batch support in Nextflow Tower, an intuitive command post that enables collaborative analysis at scale. The combination of Tower and Google Cloud Batch will make it easier still for bioinformaticians and analysts to securely launch, manage, and monitor scalable Nextflow data analysis pipelines in the Google Cloud.
Nextflow users interested in leveraging Google Cloud Batch can visit https://cloud.google.com/batch or book a demo below to learn more.