Robert Lalonde
Robert LalondeJul 28, 2023

Preparing for a multi-cloud future

It is no secret that cloud computing is on the rise. According to an April 2023 report by Research and Markets, the global market for drug discovery informatics was estimated at US $2.3 Billion in 2022 and is projected to grow to US $4.8 Billion — a compound annual growth rate (CAGR) of 9.8%, with cloud computing expected to play a significant role in facilitating this growth.

This shift to the cloud is supported by our own research as well. Our 2023 State of the Workflow Community Survey showed the use of public cloud is up 20% since last year, with 43% of survey respondents using cloud today and 70% signaling plans to migrate their workloads. This trend is even more pronounced among commercial organizations, with fully 80% planning to run workloads in the cloud.

Among disciplines that embrace scientific computing, life sciences firms are moving to the cloud faster than their peers. Reasons include the bursty nature of workloads, pipelines of cloud-friendly open-source tools, easily anonymized datasets reducing security concerns, and collaborative public-private partnerships.1

So, what’s the issue?

Despite organizations’ best efforts to standardize on a single cloud vendor to simplify management, a multi-cloud future seems inevitable for most. According to Flexera’s 2023 State of the Cloud report reflecting a broad swath of industries, 87% of organizations report operating in a multi-cloud environment.2 Furthermore, 80% of enterprise users cite multi-cloud management as a top challenge.

There are several factors driving organizations toward hybrid and multi-cloud deployments:

  • Unplanned, unstructured growth – organizations tend toward heterogeneity for reasons that include business units making independent purchasing decisions, mergers & acquisitions, pilots that grow into full-fledged implementations, and collaborative research initiatives with third parties that bring their own computing environments.
  • Data Sovereignty and data gravity considerations – the location of datasets frequently influences where organizations choose to run their workloads.
  • Commercial “as-a-service” offerings with affinities to particular cloud providers may require that customers bring on additional cloud vendor relationships.

Besides these push factors, there are solid reasons for embracing a multi-cloud strategy despite the added complexity:

  • By designing for portability, and retaining the ability to easily shift to different providers, organizations can strengthen their position when it comes time to renegotiate enterprise agreements.
  • Preserving strategic flexibility to embark on new research initiatives or collaborations that may require access to data and tools on other cloud platforms.
  • Ensuring cost transparency and the ability to shift workloads to lower-cost cloud platforms or repatriate workloads if necessary to avoid overspending.

This last point is a big one. According to a 2022 HashiCorp Forrester report, 94% of surveyed organizations report overspending in the cloud. This overspending occurs for multiple reasons, including underused or over-provisioned resources and a lack of skills to utilize cloud infrastructure fully.

While cloud computing is convenient, it can also be expensive on a sustained basis. In an attempt to take advantage of discounts or plan capacity more effectively, organizations can find themselves embroiled in multiyear cloud contracts — arrangements eerily reminiscent of outsourcing contracts of old. If you chart a course to the cloud, it’s only prudent to make sure you can backtrack and bring workloads back in-house or to another cloud by making them portable should the need arise.

Embracing multi-cloud

Fortunately, Seqera can provide researchers with a variety of solutions to stay flexible and avoid becoming locked into a single cloud platform. Nextflow was designed from the ground up to enable developers to build workflows that are portable across compute environments, from laptops to HPC clusters to scalable cloud services. Several Nextflow features help enable this portability:

  • Pluggable executors. Nextflow provides an abstraction layer between pipeline processing and underlying compute environments. Pipeline authors can consolidate cloud and cluster-specific settings using config profiles, making compute environments selectable at runtime, and enabling pipelines to run anywhere without modification.

    See Nextflow on BIG IRON: Twelve tips for improving the effectiveness of pipelines on HPC clusters

  • Your choice of storage. Much as executors abstract away differences between compute environments, they do the same with storage. Developers can focus on high-level data flows, and let Nextflow manage the mechanics of data movement across various storage architectures — whether it is a local hard disk, a parallel file system, or one of several different cloud object stores or file systems.

    See Selecting the right storage architecture for your Nextflow pipelines

  • Support for multiple container formats, registries, and SCMs. Portability is further enhanced by Nextflow’s support for diverse container formats and registries. Nextflow supports virtually all major container formats as well as public and private registries. It is also tightly integrated with modern Git-based source code managers (SCMs) such as GitHub, BitBucket, GitLab, and similar cloud services. Users can draw pipeline code directly from source code repositories and run it on any cloud or on-premises compute environment.

For organizations doing collaborative research that need to share pipelines, data, compute environments, and analysis results, Tower extends the multi-cloud capabilities of Nextflow even further with additional capabilities.

For readers unfamiliar with Tower, Tower is an intuitive centralized command post that enables collaborative data analysis at scale. With Tower, users can easily launch, manage, and monitor scalable Nextflow data analysis pipelines and compute environments on-premises or across the cloud providers of their choice.

Built with open science in mind

Central to Tower’s design is an orientation toward open science and collaborative research. Like Nextflow, Tower explicitly avoids locking users into particular cloud platforms, storage technologies, or cloud management tools. Tower itself can run anywhere. Organizations can deploy it on-premises, on their choice of cloud, or they can use Tower Cloud, a managed service offering provided by Seqera.

Tower is specifically designed to work with a customer’s existing code repositories and data and brings additional capabilities to help simplify hybrid and multi-cloud deployments:

  • Sharable compute environments. Users collaborating in workspaces can follow a guided process to easily add multiple compute environments from AWS Batch, to Google Cloud Batch to on-premises Slurm clusters. Users simply plug in their cloud credentials, and Tower enables authorized users to run portable pipelines across their choice of cloud or on-prem compute environment.

  • Datasets. Unlike competing analysis tools that require data and pipeline code to be ingested and, in some cases, compiled before execution, Tower data can reside wherever you choose – In S3 buckets, Azure Blobs, or private data repositories. Tower provides a metadata-based approach to organizing versioned datasets so that users can easily reference and share data no matter where it resides physically.

  • Flexible provisioning. Tower does not prescribe particular provisioning methods for deploying compute and storage resources. For example, when deploying an AWS Batch environment, users or cloud administrators can have Tower automatically manage resource provisioning (using Tower Forge). For complete flexibility, they can also connect Tower to compute environments already deployed using tools such as AWS Parallel Cluster, Azure Resource Manager Templates, or Terraform.

  • A common user experience across clouds. Tower presents an intuitive, consistent, guided interface regardless of the underlying cloud platform or on-premises cluster regardless of where pipelines come from or where they execute. Scientific users, analysts, and clinicians don’t need to know where pipelines execute or where results are physically stored.

    See Best Practices for Deploying Pipelines with Nextflow Tower

  • Pipeline secrets. Last but not least, in addition to securely storing cloud credentials, Tower supports the concept of secrets – secure storage for keys and tokens used by workflow tasks to interact with external systems such as databases, API endpoints, and various cloud services. Secrets make it easy to further parameterize pipelines run under the control of Tower to ensure their portability.

  • Significant cost savings. In addition to helping ensure portability, resource optimization features in Tower have been shown to reduce cloud resource requirements and associated spending by up to 85%.

    See optimizing resource usage with Nextflow Tower

Towards a future-proof analysis environment

While most organizations will be tempted to consolidate on a single cloud provider, Nextflow and Tower make it easy to run analysis pipelines anywhere. Organizations can enjoy increased flexibility, avoid the risk of proprietary lock-in, and easily shift workloads as business requirements change.

Should it be necessary to repatriate workloads, change cloud providers, or enter into new collaborations, Nextflow and Tower can help navigate these transitions with ease.

To learn more about Nextflow Tower, download the whitepaper Enabling collaborative data analysis at scale with Nextflow Tower.

Notes

1 The Hackett Group – Business Impact of Cloud Adoption in the Life Sciences Industry

2 Note: Flexera’s definition of multi-cloud extends to hybrid cloud and includes on-premises private clouds.