Jon Manning

Nov 27, 2024

Ensuring consistent behavior across Conda and Docker environments with Wave

As bioinformaticians, reproducibility becomes almost an obsession. We need to ensure that the analysis we perform today, with the same input data, will yield the same results tomorrow. Even more critically, these results should remain consistent across different machines or infrastructures. Over the years, we've made significant strides toward achieving this goal, and at Seqera, we’re at the forefront of these efforts. With a recent feature in Wave, bioinformaticians and data scientists alike are about to experience an important step forward. But before we dive into that, let’s take a step back and explore some context.

The evolution of bioinformatics environments

In the early days, bioinformatics analyses were typically run on single machines, with software either installed system-wide by administrators or individually by users in their personal environments. This led to problems when someone upgraded R or Python, and suddenly, any results reliant on the older version were no longer reproducible.

As the field evolved, virtual environments became essential. They allowed analyses to be carried out in isolated environments with well-defined software requirements, making it possible to recreate results precisely at later dates. Conda made this even simpler, managing dependencies across disparate software ecosystems. More recently, container platforms like Docker have further streamlined reproducibility by packaging entire environments—including software, dependencies, and the operating system—into portable containers. This ensures analyses run consistently across different systems and infrastructures. Docker and similar platforms have become key tools for sharing work that can be reliably reproduced, regardless of the computational setup.

Tools like Nextflow took this a step further, allowing us to define independent software environments for every process in a workflow—sometimes hundreds of steps. This solved many dependency conflicts and made reproducibility more achievable. The nf-core community, for example, has built an expanding library of nearly 1,500 individually reusable workflows and modules. These modules come with predefined software for Conda, Docker and Singularity. However, one challenge we've encountered is ensuring replicability when the same software is deployed via different methods, particularly between Conda and containers. This issue largely stems from differences in what are known as “transitive dependencies.”

Transitive dependencies and the Conda-Docker gap

To illustrate, let’s consider a real-world example of a Conda environment configuration file used by an nf-core module:

channels:
  - conda-forge
  - bioconda
dependencies:
  - bioconda::bioconductor-deseq2=1.34.0

This configuration specifies that the process requires version 1.34.0 of the DESeq2 R package. It’s hosted in the Bioconda repository, and thanks to the magic of the BioContainers infrastructure, there’s also a Docker container available at quay.io. This means we can run the same process with identical versions of the DESeq2library, both inside and outside a container runtime.

This setup seems ideal—and, initially, it likely works as expected.However, the Conda package, bioconda::bioconductor-deseq2=1.34.0, has its own set of dependencies within the Bioconda repository and other Conda channels. When running a Nextflow workflow with Conda, the environment is resolved based on the package ecosystem at the moment the workflow is run, potentially pulling newer versions of dependencies, including crucial components like R itself. Over time, this can cause freshly built Conda environments to diverge from the Docker container that was frozen in time, undermining reproducibility and potentially leading to unexpected failures in tests, such as nf-test snapshots. Investigating these changes can require some serious detective work inside the container.

The case for environment locking

At this point, you might wonder: “Why not just record the version of every package used?” This is indeed where we’re heading—what we refer to as “locking” the environment. Rather than replacing flexible environment files, which are valuable for new projects needing the latest dependencies, we aim to provide lockfiles alongside these configurations. These lockfiles will match the versions used in Docker containers, ensuring exact (or nearly-exact) reproducibility of behaviors across different systems.

Enter Wave

This is where Wave comes in. An open-source tool from Seqera, Wave streamlines the process of generating containers for workflows, offering several advantages over the Biocontainers, especially when building containers for multiple Conda packages or when using Singularity.

With the Wave command-line interface (CLI) or the Seqera Containers web interface, you can request a container (Docker or Singularity) for a specified system architecture (arm64, amd64) based on a Conda environment:

wave --conda-package "fastp" --platform linux/arm64

Wave builds the environment and provides the container.

Wave and lockfiles: replicable behavior inside and outside containers

Until now, that was the end of the process: Wave built your container. But a year later, a new Conda environment built with the same specification would likely differ from this specific container's environment. A recent innovation in Wave now addresses this: along with building containers, Wave also preserves the Conda environment by running the following command internally:

micromamba env export --explicit > env.lock

This generates a lockfile for every container. A lockfile for the DESeq2 example above might look like this:

# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
@EXPLICIT
https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2#d7c89558ba9fa0495403155b64376d81
https://conda.anaconda.org/conda-forge/linux-64/libgomp-14.1.0-h77fa898_1.conda#23c255b008c4f2ae008f81edcabaca89
https://conda.anaconda.org/conda-forge/linux-64/_openmp_mutex-4.5-2_gnu.tar.bz2#73aaf86a425cc6e73fcf236a5a46396d
https://conda.anaconda.org/conda-forge/noarch/_r-mutex-1.0.1-anacondar_1.tar.bz2#19f9db5f4f1b7f5ef5f6d67207f25f38
https://conda.anaconda.org/conda-forge/noarch/_sysroot_linux-64_curr_repodata_hack-3-h69a702a_16.conda#1c005af0c6ff22814b7c52ee448d4bea
https://conda.anaconda.org/conda-forge/linux-64/ld_impl_linux-64-2.40-hf3520f5_7.conda#b80f2f396ca2c28b8c14c437a4ed1e74
https://conda.anaconda.org/conda-forge/noarch/kernel-headers_linux-64-3.10.0-h4a8ded7_16.conda#ff7f38675b226cfb855aebfc32a13e31
https://conda.anaconda.org/conda-forge/noarch/tzdata-2024a-h8827d51_1.conda#8bfdead4e0fff0383ae4c9c50d0531bd
https://conda.anaconda.org/conda-forge/noarch/sysroot_linux-64-2.17-h4a8ded7_16.conda#223fe8a3ff6d5e78484a9d58eb34d055
https://conda.anaconda.org/conda-forge/linux-64/binutils_impl_linux-64-2.40-ha1999f0_7.conda#3f840c7ed70a96b5ebde8044b2f36f32
https://conda.anaconda.org/conda-forge/linux-64/libgcc-14.1.0-h77fa898_1.conda#002ef4463dd1e2b44a94a4ace468f5d2
https://conda.anaconda.org/conda-forge/linux-64/libgcc-ng-14.1.0-h69a702a_1.conda#1efc0ad219877a73ef977af7dbb51f17
https://conda.anaconda.org/conda-forge/linux-64/libzlib-1.3.1-h4ab18f5_1.conda#57d7dc60e9325e3de37ff8dffd18e814
https://conda.anaconda.org/conda-forge/linux-64/tk-8.6.13-noxft_h4845f30_101.conda#d453b98d9c83e71da0741bb0ff4d76bc
https://conda.anaconda.org/conda-forge/linux-64/bwidget-1.9.14-ha770c72_1.tar.bz2#5746d6202ba2abad4a4707f2a2462795
https://conda.anaconda.org/conda-forge/linux-64/bzip2-1.0.8-h4bc722e_7.conda#62ee74e96c5ebb0af99386de58cf9553
...truncated...

With this feature, Wave fixes the versions of ALL the Conda packages in the environment. But it doesn’t just pin down dependencies using version numbers like before—it creates an explicit set of links to the Conda package files themselves. This means that the slow process of resolving dependencies when building an environment is also bypassed entirely. Using this lockfile in workflows, you not only get more reproducible results with Conda but faster ones too. Note that the URLs also include md5sum for the package downloads, ensuring an absolutely faithful environment.

Hands-on with Wave and Conda lockfiles

So how do you use Wave to generate a paired container definition and Conda lockfile? You have two choices here:

Seqera Containers: A web interface that provides easy access to Wave's functionality.
Wave CLI: For those comfortable with the Linux command line.

Seqera Containers

Seqera Containers is a web resource which gives you easy access to Wave functionality via a graphical interface. In the example below we select the R-based Bioconda package ‘r-shinyngs’ and hit ‘Get Container’.

That gives us a pop-up with the link to the container, which we can make a note of:

If we then click ‘View build details’:

… and then the ‘Sharable build page URL’:

This gives us access to a page with the lockfile available via a text box with a download button.

The Wave CLI

If you’re used to working from the Linux command line, the Wave CLI allows you to get both the container and Conda lockfile for a given package specification in just a couple of commands.

First, run Wave with your Conda package specification (see the documentation linked above for more information):

wave --conda-package "r-shinyngs" --platform linux/amd64 --await -o yaml

Be sure to specify the correct platform, and the yaml output option. This will give you a result like:

buildId: bd-e427cad04544d552_1
cached: true
containerImage: 195996028523.dkr.ecr.eu-west-1.amazonaws.com/wave/build:r-shinyngs--e427cad04544d552
containerToken: acfaa5232eae
duration: PT5M23.102806637S
expiration: '2024-10-17T03:53:03.357219154Z'
freeze: false
mirror: false
requestId: acfaa5232eae
scanId: sc-b7ddb9308b3f8f7e_1
status: DONE
succeeded: true
targetImage: wave.seqera.io/wt/acfaa5232eae/wave/build:r-shinyngs--e427cad04544d552

The ‘targetImage’ field in the output gives you an image you can use straight away. The URI for the Conda lockfile can be generated from the buildId:

build_id=bd-e427cad04544d552_1
lockfile_uri=https://wave.seqera.io/v1alpha1/builds/${build_id}/condalock

Testing the container and Conda lockfiles

To show the comparability of the Docker and Conda approaches, we can use both software definitions to derive the version of a transitive dependency, something whose version we didn’t explicitly fix.

To do using the container we obtained via Wave:

# Pull the image
docker pull wave.seqera.io/wt/acfaa5232eae/wave/build:r-shinyngs--e427cad04544d552
# Get the version of a transitive dependency
docker run --rm wave.seqera.io/wt/acfaa5232eae/wave/build:r-shinyngs--e427cad04544d552 R -q -e 

This gives us output like:

"packageVersion('SummarizedExperiment')"
> packageVersion('SummarizedExperiment')
[1] ‘1.32.0’

To do same in a Conda environment built from the lockfile, we must download the lockfile and use it to build a Conda environment, before running the command:

curl $lockfile_uri > r-shinyngs.lockfile.yml
conda create --name r-shinyngs --file r-shinyngs.lockfile.yml
source activate r-shinyngs
R -q -e "packageVersion('SummarizedExperiment')"

The environment creation is very fast, because we’re bypassing dependency resolution. The output of the R command is:

R -q -e "packageVersion('SummarizedExperiment')"
> packageVersion('SummarizedExperiment')
[1] ‘1.32.0’

This matches what we found in the container.

Transparency and portability

Docker has long been used to “freeze” software environments, but it’s not always transparent—especially when it comes to querying the versions of transitive dependencies. Containers work well as delivery mechanisms, but by preserving Conda environments separately, we get the best of both worlds: the convenience of containers for provisioning software (particularly in cloud environments) and the transparency and reproducibility of Conda lockfiles. These lockfiles provide an explicit record of all dependency versions, ensuring a near-identical environment both inside and outside container runtimes. If we ever lose sight of the container for whatever reason, we can also rebuild something almost identical based on that lockfile.

With Wave, reproducibility becomes faster, easier, and more consistent across platforms. We're excited to continue pushing the boundaries of bioinformatics workflows, helping you achieve the same results, every time, everywhere.

💡Interested in learning more about Wave? Download the whitepaper now