Running CAW with Singularity and Nextflow
This is a guest post authored by Maxime Garcia from the Science for Life Laboratory in Sweden. Max describes how they deploy complex cancer data analysis pipelines using Nextflow and Singularity. We are very happy to share their experience across the Nextflow community.
The CAW pipeline
Cancer Analysis Workflow (CAW for short) is a Nextflow based analysis pipeline developed for the analysis of tumour: normal pairs. It is developed in collaboration with two infrastructures within Science for Life Laboratory: National Genomics Infrastructure (NGI), in The Stockholm Genomics Applications Development Facility to be precise and National Bioinformatics Infrastructure Sweden (NBIS).
CAW is based on GATK Best Practices for the preprocessing of FastQ files, then uses various variant calling tools to look for somatic SNVs and small indels (MuTect1, MuTect2, Strelka, Freebayes), (GATK HaplotyeCaller), for structural variants(Manta) and for CNVs (ASCAT). Annotation tools (snpEff, VEP) are also used, and finally MultiQC for handling reports.
We are currently working on a manuscript, but you're welcome to look at (or even contribute to) our github repository or talk with us on our gitter channel.
Singularity and UPPMAX
Singularity is a tool package software dependencies into a contained environment, much like Docker. It's designed to run on HPC environments where Docker is often a problem due to its requirement for administrative privileges.
We're based in Sweden, and Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) provides Computational infrastructures for all Swedish researchers. Since we're analyzing sensitive data, we are using secure clusters (with a two factor authentication), set up by UPPMAX: SNIC-SENS.
In my case, since we're still developing the pipeline, I am mainly using the research cluster Bianca. So I can only transfer files and data in one specific repository using SFTP.
UPPMAX provides computing resources for Swedish researchers for all scientific domains, so getting software updates can occasionally take some time. Typically, Environment Modules are used which allow several versions of different tools - this is good for reproducibility and is quite easy to use. However, the approach is not portable across different clusters outside of UPPMAX.
Why use containers?
The idea of using containers, for improved portability and reproducibility, and more up to date tools, came naturally to us, as it is easily managed within Nextflow. We cannot use Docker on our secure cluster, so we wanted to run CAW with Singularity images instead.
How was the switch made?
We were already using Docker containers for our continuous integration testing with Travis, and since we use many tools, I took the approach of making (almost) a container for each process. Because this process is quite slow, repetitive and I~~'m lazy~~ like to automate everything, I made a simple NF script to build and push all docker containers. Basically it's just build
and pull
for all containers, with some configuration possibilities.
Since Singularity can directly pull images from DockerHub, I made the build script to pull all containers from DockerHub to have local Singularity image files.
After this, it's just a matter of moving all containers to the secure cluster we're using, and using the right configuration file in the profile. I'll spare you the details of the SFTP transfer. This is what the configuration file for such Singularity images looks like: `singularity-path.config`
This approach ran (almost) perfectly on the first try, except a process failing due to a typo on a container name...
Conclusion
This switch was completed a couple of months ago and has been a great success. We are now using Singularity containers in almost all of our Nextflow pipelines developed at NGI. Even if we do enjoy the improved control, we must not forgot that:
With great power comes great responsibility!
Credits
Thanks to Rickard Hammarén and Phil Ewels for comments and suggestions for improving the post.