How Nextflow is helping win the global battle against future outbreaks and pandemics
Using genetic sequencing to identify pathogens and respond to outbreaks is hardly new. Public Health Authorities (PHAs) routinely use genetic surveillance to monitor seasonal influenza and food-borne bacteria that can lead to illness. Genetic surveillance has also played a key role in helping contain viral outbreaks in recent years, including Monkeypox, Ebola, Yellow Fever, and Zika. Presently, health authorities are leaning on surveillance to combat what is described as a “tridemic” – the simultaneous surge of flu, respiratory syncytial virus (RSV), and new COVID-19 variants.
For public health institutions, the emergence of the SARS-CoV-2 in late 2019 was a game-changer. While the virus was identified and sequenced in record time, it proved almost impossible to contain. It quickly became clear that the surveillance systems we relied on were not up to tracking a fast-moving virus that spread asymptomatically. COVID-19 highlighted our systemic vulnerabilities to a global pandemic and their costs in terms of both lives and economic impact.
COVID-19 was arguably the first global pandemic to emerge in the modern bioinformatics era. To meet the threat, governments, health authorities, and scientists quickly mobilized, ratcheting up research and development efforts in a variety of areas, including:
- Sequencing the initial SARS-CoV-2 genome;
- Developing vaccines and therapeutics;
- Evaluating the efficacy of therapeutics and vaccines in the face of emerging variants;
- Conducting genomic surveillance, variant analysis and waste water detection.
What these activities have in common, is that they rely on fast, cost-effective sequencing, and scalable bioinformatics pipelines for data analysis. In combating the spread of viruses, genetic surveillance is critical. Without the ability to understand viral transmission patterns and evolution, health authorities would be in the dark. However, despite our advances in genetic sequencing in the past decade, surveillance efforts were still a challenge – particularly in the early days of the pandemic.
Most developed countries have genetic surveillance capabilities, but as cases proliferated, it became clear that existing approaches could not scale. Until the recent pandemic, surveillance typically relied on relatively small numbers of regional or national labs to conduct sequencing.
Tracking viral transmission and monitoring mutations within populations (viral phylodynamics) requires that a high percentage of positive tests be sequenced and analyzed. With some jurisdictions reporting tens of thousands of positive tests daily, facilities were quickly overwhelmed. Monitoring outbreaks and tracking rapidly mutating viruses and variants demands new levels of speed and scale. It quickly became clear that an “all of the above” effort would be required to scale surveillance capacity.
To meet this challenge, governments needed to mobilize not just PHAs, but regional labs, public and private hospitals, private labs, and universities. However, scaling analysis capacity across all these different institutions came with its own challenges. Labs often had very different levels of resources and capabilities, including:
- Different software tools and practices for extracting, analyzing, and classifying samples;
- Diverse sequencing platforms;
- Varying levels of bioinformatics and IT expertise making it challenging for some labs to implement and manage analysis pipelines themselves;
- Different IT infrastructure.
For example, while some labs operated on-premises compute clusters or even single servers suitable for genomic analysis, other organizations relied on cloud providers or had no local computing capacity at all.
Data harmonization was another challenge. Labs needed to gather and present genomic analysis and metadata in a standard way. Central health authorities needed automated methods to collect, cleanse, and aggregate data for downstream reporting, analysis, and sharing with international bodies. Given the time pressures, there was simply no time to train and learn how to use complex new analysis tools.
Fortunately, techniques pioneered in modern software development, such as collaborative source code management systems (SCMs), containers, and CI/CD pipelines, played a major role in helping address these challenges. By sharing open-source pipelines in public repositories, and encapsulating applications in containers, pipelines built to sequence and analyze variants could be distributed quickly and efficiently. Participating labs with minimal in-house IT expertise could quickly obtain and run curated analysis pipelines. Pipelines and software could essentially be treated as “black boxes.” These modern approaches enabled people with minimal knowledge of pipeline mechanics and IT infrastructure to immediately become productive.
While several workflow orchestration tools are used for analyzing sequences, given its modern design, Nextflow was particularly well-suited to this challenge. Nextflow brought several essential features that made it easier to scale analysis capacity rapidly. Among these features were:
- Pipeline portability – compute environments were abstracted from code so that flows can run unmodified across different IT environments, from local systems to clusters to clouds;
- Container support – support for all major container standards and registries simplified pipeline portability, deployment, and execution;
- SCM integrations – even analysts with minimal experience to easily view, pull, and manage pipelines in shared repositories such as GitHub, GitLab, and BitBucket.
Features such as workflow introspection that can adjust workflow behavior at runtime helped make pipelines more adaptable, reliable, and predictable. These capabilities meant pipelines could be easily adapted to diverse computing environments, including public clouds and infrastructure in private labs.
Curated, shared pipelines also made it easier for organizations building pipelines to publish and publish and share their work and jumpstart analysis efforts. The nf-core project is a community effort to collect analysis pipelines for a variety of datasets and analysis types.
The Ncov2019-artic-nf pipeline is a good example of how these orchestration platforms are enabling pathogenic surveillance efforts. With development led by Matt Bull of the Pathogen Genomics Unit, Public Health Wales, a sequencing pipeline was used by COVID-19 Genomics UK Consortium (COG UK) at local sequencing sites. The pipeline automates the analysis of SARS-CoV-2 sequences from both Illumina and Nanopore platforms generating consensus genomes, variants, quality control data, and various metadata. The pipeline then produces output files in a standard format to be easily shared, collected, and aggregated.
Pipelines like Bactopia, bring similar capabilities to tracking food-borne illnesses by making it easier to analyze and classify bacterial genomes. Viral Recon is another pipeline widely used for genomic surveillance for COVID, Monkeypox and other viruses. Viral Recon is an nf-core project building on work done on the ARTIC pipeline described above.
A separate pipeline called Elan (developed by Sam Nicholls of the University of Birmingham) runs centrally, processing data from various COG-UK labs. Elan is the daily heartbeat of the COG-UK bioinformatics pipeline and is responsible for sanity-checking uploaded data, performing quality control and updating Majora (the COG-UK database) with processed file metadata. Elan is also responsible for publishing the daily COG-UK dataset for downstream analysis. One of the most important downstream analyses performed is tree building, which is done by a different pipeline called Phylopipe. The Phylopipe pipeline builds phylogenetic trees from the COG-UK data set. Phylopipe was written by Rachel Colquhoun at the University of Edinburgh.
While we can’t predict the next pandemic, we can be all but certain that new threats will emerge. A silver lining of the present tridemic is that we have proven the utility of applying modern DevOps techniques to genomic analysis. Modern, cloud-agnostic tools like Nextflow have shown high value in helping organizations rapidly and efficiently scale their analysis capacity to meet new challenges. The crucible of the pandemic has also resulted in new software tools that improve our ability to collaborate on genomic data analysis and share research more efficiently than ever before.
Hopefully, the lessons learned from scaling surveillance and research to meet the COVID-19 crisis will serve us well in day-to-day disease and outbreak surveillance and when the next pandemic inevitably occurs.
Please note: An earlier version of this article was contributed by Seqera Labs and published in BioIT World.