How Nextflow builds key software development skills for bioinformaticians
This post has been written by our valued community members.
I first used Nextflow at 19 years old, a year out of school during the first year of my apprenticeship. Over the next 5 years, I acquired a diverse set of essential skills that enabled me to explore a wide range of career opportunities by building Nextflow pipelines. Whether this is learning about containerization and image security or the differences in HPC vs Cloud architecture, all of this stems from wanting to develop Nextflow pipelines as a young apprentice. Looking back now as a Bioinformatics Engineer, I wanted to reflect on the most valuable skills Nextflow taught me and what career paths they can open up for any young, unsure bioinformaticians out there.
Containerization
Containerization is the concept of encapsulating the specific software needed into isolated environments (containers) that can be utilized and shared across different platforms. With that in mind, Nextflow encourages developers to use containers in their pipelines. Containers can be confusing to newcomers, as each container technology (Docker, Singularity, Podman, etc.) has a unique syntax for running containers. Nextflow handles this unique syntax for the developer so that they can focus on the pipeline’s functionality.
When I first needed to implement the episcanpy clustering method in a new module, I needed to build my own docker container image. This process involved finding the appropriate Python version base image, installing episcanpy Linux dependencies and then installing episcanpy and python dependencies via pip. All of this was captured within a Dockerfile and helped me learn about Dockerfiles, container image inheritance and the differences between a container image and a container. Once built I had to ensure this new container image had no critical common vulnerabilities and exposures (CVEs), so I scanned them using trivy. This taught me about the potential security risks when using container images and how to fix any CVEs found within a container image. Building container images for specific applications is a fundamental component of any software engineering job while identifying and fixing CVEs within existing applications are common job requirements for a platform engineer or cyber security engineer.
I wouldn’t have developed these skills if I had not been developing a Nextflow pipeline. This was a great learning experience, and I encourage everyone to experience it so that they get a better understanding of containers. However, the fact that Nextflow handles a lot of these issues so that I do not have to go through this every time is amazing and a massive time saver.
HPC infrastructure
One of Nextflow’s biggest appeals is how configurable it is across a range of different HPC environments. Each workplace will probably have a slightly unique setup, which will require customizing Nextflow configurations to match your infrastructure.
Early on in my career, I ran pipelines on a cluster using IBM LSF as the job scheduler, with time and resource-restricted queues. I then moved to an institute that utilized a Univa Grid Engine job scheduler without queues but with walltime limits. To ensure nf-core pipelines worked in these different clusters I had to understand the underlying structure of each HPC, such as how they had configured their job schedulers and where was the best location to direct high I/O work directories to.
The main problem with HPCs is that every job scheduler has a unique syntax that you have to learn in order to run a script on that scheduler. This means that the same script will need new syntax each time you change HPC infrastructure. Nextflow handles all of the HPC-specific syntaxes so that the developer can focus on the different scheduler options and job configurations needed within your pipeline, instead of the syntax and workload manager script building. This gave me a much larger knowledge base for infrastructure and scheduler configuration, opening opportunities to transition into Infrastructure architect/engineer roles.
Cloud configuration
Cloud computing can be very complicated, and there are many acronyms that can overwhelm someone new to the space. Nextflow pipelines can be run with all major cloud providers: AWS, Azure, and Google Cloud.
In one of my roles, I was tasked with setting up cloud infrastructure for our Nextflow pipelines. This led me to learn what AWS Batch was (batch computing manager) which taught me about the variety of EC2 instances (the different specs for compute nodes) I can run jobs on. Wanting to customize the environments of the EC2 instances taught me about AMIs (Amazon Machine Image) and how to create a golden AMI (a pre-configured AMI used as a reference template by other AMIs). These custom environment instances need to be accessible by AWS Batch and so compute environments and queues have to be created. Before I knew it, I had a whole cloud infrastructure set up, all spawned from the need to run a Nextflow pipeline. These cloud skills could allow me to become a Cloud Engineer, a Platform Engineer, or a Solutions Architect.
The software development lifecycle
When developing Nextflow pipelines, I often have to consider the design of the pipeline prior to development. I consider which processes can be run in parallel, the dependencies within the pipeline, any bottlenecks, and how the data is handled throughout the pipeline.
Planning and designing software prior to development can be at times hard to conceptualize, but using Nextflow can make your life a lot easier as it promotes reusing functionality, isolating different functionality and defining inputs prior to processing.
I did not realize this at first but all of those concepts are key components to software engineering. Nextflow’s modular nature encourages you to think like a software developer/engineer and to use the software development lifecycle whether you realize it or not. This can be highlighted through the Nextflow-specific testing software nf-test. This software can test an entire pipeline, specific modules, and even particular Groovy functions, and emulates the different types of testing in the software engineering world.
A pipeline test emulates an end-to-end test whereby the entire pipeline is tested to see the overall behavior. Testing specific modules or Groovy functions can be seen as unit tests, whereby each individual unit of logic is individually tested to ensure the functionality in isolation is as expected.
Utilizing pipeline, module and function tests when developing a pipeline to ensure the changes do not alter the logic of the pipeline is known as integration testing. Nextflow and nf-test encourage developers to use proper testing practices which are fundamental skills any developer needs.
Navigating your career with Nextflow
I know that identifying what you want to do for a career at a young age can be daunting, but this is where Nextflow is your friend! Nextflow is a versatile language with several key concepts. As you learn about the different components, I highly recommend delving into that area a bit. Using the helpful framework Nextflow provides as a basis for further learning can really help you discover new areas that you may not have known about previously.
A final note
An unsung benefit of Nextflow with regard to career development is the connections you can make. The community is full of people at a variety of stages in very different careers and so the community can be a real asset. Everyone is super friendly and you can ask anyone how they got to where they are. These insights could really help you understand what you need to learn to get to where you need to be.
Upskilling members is another core area of the community, with lots of free training, hackathons, and interactive talk sessions available. The community even has a set of key principles to follow when developing Nextflow pipelines, which instills proper software development practices in any developer.
Finally, if you ever get stuck on a problem, the community is always there to help and educate each other. I think Nextflow facilitates growth both through skill development and connection generation and I highly encourage any young informaticians to get involved!
This post was contributed by a Nextflow Ambassador. Ambassadors are passionate individuals who support the Nextflow community. Interested in becoming an ambassador? Read more about it here.