Phil Ewels joins Seqera
Introduction
I’m really excited to announce that after eight wonderful years at SciLifeLab, I will soon be starting a new position at Seqera. I’ll be dedicating my time towards open-source projects, helping to further develop the nf-core community and tooling and continuing support for MultiQC. It’s a pretty unique job description and an amazing opportunity. I can’t wait to get started!
Other members of the Nextflow community have recently written personal reflections of their involvement with Nextflow and nf-core, namely Cedric Notredame, Paolo Di Tommaso and Harshil Patel. I’ve enjoyed reading these and thought that now was as good a time as ever to write down my back-story and perspective on how I got involved.
Getting started with pipelines
To understand where my interest in high-throughput bioinformatics pipelines started, we need to go back to my time at the Babraham Institute in Cambridge. I did my PhD and a postdoc there, the latter together with Simon Andrews and the Babraham Bioinformatics Group. The team there are bioinformatics open-source heavyweights, with a string of household names for anyone working in the field (FastQC, FastQ Screen, TrimGalore!, Bismark, HiCUP, SeqMonk and many more). I largely owe my transition from benchwork to bioinformatics to them.
As we worked with ever-increasing volumes of sequencing data, we realised that we needed some kind of pipeline tool. We spent a while looking at what was available but couldn’t find anything that matched how we worked. Reluctantly, we decided to roll our own solution and Cluster Flow was born. The initial commit for Cluster Flow was on the 20th Sep 2013, only a few months after Nextflow’s (22nd Mar 2013).
Cluster Flow is written in Perl and is pretty simplistic in comparison to modern day workflow tools. However, it worked very well and became the workhorse of Babraham Bioinformatics, slowly spreading to a few other institutes. It had a couple of nice features that set it apart from the competition at the time: it shipped with a number of pre-built pipelines for common genomics data types and it had a flexible reference genome manager built in. Cluster Flow was finally published in F1000 in 2017. I wouldn’t recommend it to anyone now as there are better options, but it was a lot of fun to work on and has had a far-reaching influence on how I approach bioinformatics tools.
Nextflow at SciLifeLab
In 2014, I moved to Sweden and joined the National Genomics Infrastructure (NGI) at SciLifeLab in Stockholm. The NGI processes tens of thousands of samples per year, so automation is at the heart of everything, including bioinformatics. Sequencing data was processed with a number of different tools at the time including bcbio, custom scripts and with my arrival, Cluster Flow.
Whilst the automated runs generally worked well, they were difficult to update and each bioinformatician needed experience with several different systems. What was worse was that the analysis was often ignored or repeated by the research groups that we worked with, as it was difficult to replicate and extend. We needed a new solution which was better standardised across analysis types and that, crucially, our users would be able to run themselves. We also needed a tool that could scale to large sample numbers, be very stable and that could reliably resume long-running workflows. Finally, I was (and am) convinced that the future lay in cloud computing, so I was very interested in writing pipelines that were agnostic to the underlying computational hardware. After a lot of discussion and a little testing, we settled on Nextflow. The incredible responsiveness and support of Paolo was instrumental in this decision (typical wait times for a reply on gitter in those days could be measured in seconds).
The success of Nextflow at the NGI was immediate. I started with a pipeline for RNA-seq data called SciLifeLab/NGI-RNAseq with help from Rickard Hammarén. Pipelines for methylation data, smRNA-seq data and ChIP-seq data followed (the latter by Chuan Wang). As the pipelines started to proliferate, we tried to standardise our coding style with some internal guidelines. Chunks of code that we wrote back then still persist in nf-core pipelines, such as the use of AWS-iGenomes, automatic emails on completion and more.
What really took me by surprise was the success of our pipelines beyond the NGI. We started getting support requests not only from Swedish research groups (which we had hoped for) but also from groups and centres outside of Sweden. It started to feel like we were on to something big.
The start of nf-core
I like to trace the beginning of nf-core back to the Nextflow meeting in Barcelona in 2017. These meetings are fantastic, brilliant fun and have certainly helped to build the strength of the Nextflow community. I gave a talk about our pipelines at the NGI and had some great discussions, several of which continued after the conference via email.
Transport yourself back in time at https://www.youtube.com/watch?v=IdLlhcux7jY
Two contacts were particularly important: Alex Peltzer (QBiC Germany) and Andreas Wilm (A*STAR GIS Singapore). I was lucky to be invited by both to give talks at their institutions and it was in those discussions that we decided to form nf-core. Both visits involved lots of excited conversations about pipeline collaboration and drawing on whiteboards. Between them, the nucleus of the nf-core guidelines and structure was created: I put together the first sketch of the nf-core logo on the flight home from Singapore. Subsequent invites to places like The Francis Crick in London and the CZ Biohub in San Francisco helped to get more key people involved such as Harshil Patel and Olga Botvinnik.
Andreas Wilm and Swaine Chen treat me to some fantastic food in Singapore
We kick-started nf-core by moving over the NGI pipelines and stripping the SciLifeLab branding, to be replaced with the generic nf-core logo that could easily be used by anyone. Also instrumental was the cookiecutter template, which scaffolded boilerplate code for new pipelines. Once Johannes Alneberg and Sven Fillinger figured out a method for automated synchronisation to keep pipelines up to date with that template at an early nf-core hackathon in Stockholm, we were set. Paolo boosted adoption a little sooner than I was expecting by tweeting about the nascent nf-core community and our numbers soon started growing. The rest, as they say, is history - numbers have not stopped or slowed since.
Taking in the view at one of the early nf-core hackathons, Stockholm 2018.
Major nf-core milestones
To say that the growth and success of nf-core has surprised me would be somewhat of an understatement. I couldn’t believe it when we hit 100 followers on twitter. Then, it was 1000. Then, 2000. Soon, we will hit 3000. We’ve had a paper published in Nature Biotechnology, our pipelines are used by institutes all over the world, and have been adopted by major consortia such as FAANG and EASI-Genomics.
As we’ve grown, we’ve managed to attract financial support despite our unconventional structure. Support from AWS allows us to test pipelines on full-size datasets in the cloud and host the results. More recently, the Chan Zuckerberg Initiative has funded us together with Nextflow in three separate calls: EOSS (Essential Open Source Software for Science) rounds two and four as well as the Diversity & Inclusion cycle. These awards provide fantastic recognition of our work as well as funding events and personnel.
I’m also extremely proud of our slightly less obvious achievements. We have defined new standards for Nextflow code such as the parameter schema, written coding guidelines, and best-practises for people to replicate in Nextflow pipelines and beyond. We’ve created tools for developers to aid the construction of pipelines. Most of all, we have built a thriving community that has provided training and support to hundreds of scientists. All of this has been done in an open collaborative framework, based on people volunteering their time and expertise. In many ways, I believe that nf-core represents scientific collaboration at its best.
Looking to the future
As we move into 2022, the future seems bright for nf-core. The CZI grants along with support from the SciLifeLab NGI, SciLifeLab Data Centre, QBiC, Seqera and others are providing dedicated personnel on a scale that the project has not seen before. We have big plans across the board: tools, code, website, training, hackathons and entirely new initiatives such as a diversity-focussed paid mentorship program. We are pushing geographical boundaries with a concerted effort to shed our European bias and reach out to new regions. Our community growth shows no sign of slowing and as we come out of the other side of the pandemic I am starting to see glimmers of hope that we may one day be able to meet in person again.
Group selfie at the last in person nf-core event, London 2020
Whilst Nextflow and nf-core are often uttered in the same sentence, we will continue to develop the two as separate, symbiotic entities. I think that our slightly different, yet complementary aims benefit everyone; Nextflow provides power and flexibility on the one side and nf-core gives out-of-the-box user friendliness / standardisation on the other.
Although I haven’t talked about it in this blog post, one of my other pet projects MultiQC has enjoyed similar stratospheric levels of adoption and I’m delighted that the folks at Seqera have agreed that I should continue to support it. I hope that this will lead to a new era of slightly faster turnaround times on issues and pull-requests, as well as finally getting time to spend on some exciting new features (perhaps the focus of a future blog post).
On a personal level, I feel incredibly lucky to be moving to a position where I can spend all of my time working with such fantastic open-source projects. Such jobs are notoriously rare and difficult to fund and I am incredibly grateful to those who have helped to make it happen. Great things lie ahead, and I can’t wait to see where this journey takes us all.