In this episode of the Nextflow Podcast, we dive into highlights from the recent Nextflow Summit held in Barcelona in 2024. The summit featured an incredible range of talks, demonstrations, and discussions showcasing the latest advancements in workflows, data analysis, and community collaboration.
With so much exciting content to explore, we’ve picked out some of our favourite moments to share with you. From impactful new features to inspiring keynote addresses and humorous presentations, this episode offers a curated selection of clips that capture the energy and innovation from the event.
Join us as we relive the summit’s best bits and discuss their significance for the Nextflow community.
Key links
Here are some of the links that were mentioned during the podcast:
- Nextflow Summit 2024 Barcelona Agenda and YouTube Playlist
- Blog post about Wave mirroring and security scans
- Conda lock files blog post
- Fusion Snapshots blog post and private preview application form.
- Electricity Emissions Dynamic Map
- Nextflow Ambassadors, with Application form
Podcast overview
Here’s an overview of the clips we selected, and what they were about.
Each clip title here links to the specific part of that talk, on YouTube.
The clips are also embedded within the podcast itself.
- Wave mirroring
- Wave Mirroring allows users to copy containers across registries, optimising workflow efficiency by caching them near the computation and adhering to regulatory needs.
- Throwing cat food
- Mathys Grapotte used a humorous live demonstration with their cat to illustrate neural network training and data processing in machine learning.
- nf-core/crisprseq
- The crisprseq pipeline was highlighted with a light-hearted focus on naming oversight, later celebrated with memes and merchandise.
- nf-test
- The nf-test adoption for nf-core modules demonstrated significant advances in testing efficiency with smart testing and community-driven plugins.
- Workflow output definition
- A new model for defining workflow outputs simplifies modularisation, replacing the old
publishDir
directive for better usability and composability.
- A new model for defining workflow outputs simplifies modularisation, replacing the old
- OSSF launch
- The Open Science Software Foundation was announced to ensure long-term sustainability for open-source projects like nf-core.
- FAIR code with nf-core
- FAIR principles were shown to be directly supported by nf-core tools, enhancing findability, accessibility, interoperability, and reusability of code.
- Teaching with Nextflow
- A structured curriculum for teaching bioinformatics integrates Nextflow and nf-core pipelines, building on foundational skills like Bash and containerisation.
- Conda lock files
- Nextflow’s integration of Conda lock files ensures reproducibility and consistent environments by freezing dependencies precisely.
- MultiQC in scripts
- MultiQC’s integration into Python scripts now enables seamless data parsing and custom report generation without relying on command-line execution.
- Small Nextflow Big Nextflow
- Nextflow’s flexibility caters to both simple ad-hoc pipelines and large-scale production workflows, scaling to diverse user needs.
- Genomics England
- Genomics England showcased Nextflow’s role in scaling genome analysis for newborn screening, reducing diagnostic timelines dramatically.
- Alphafold interactive reports
- Interactive AlphaFold protein structure reports visualise predictions dynamically, exemplifying cutting-edge bioinformatics tools.
- Data Studios
- Data Studios eliminates installation overheads by providing pre-configured cloud environments, simplifying data analysis workflows.
- Covid and Ebola
- Nextflow enabled rapid pathogen analysis during the COVID-19 pandemic, showcasing scalability and readiness for future outbreaks.
- Seqera AI
- Live AI-based testing in Nextflow showcased the automation of debugging and script validation, making workflows accessible to novices.
- Nextflow VSCode DAGs
- A new VSCode extension feature generates DAG previews directly in the editor, improving workflow navigation and debugging.
- Fusion snapshots
- Fusion Snapshotting enhances spot instance usage by saving computation states, minimising costs, and avoiding redundant task retries.
- Sustainable computing
- A keynote highlighted the significant impact of location-based compute carbon intensity and advocated for energy-efficient workflows.
- nf-core/tools
- The updated
nf-core tools
interface streamlines pipeline creation, providing an intuitive onboarding experience for new users.
- The updated
- Industry best practices with nf-core
- A talk emphasised how industry can leverage nf-core tools to adopt open-source best practices without compromising proprietary data.
- Federated data queries
- Federated query systems like AWS Athena enable cross-pipeline data exploration directly from output files, simplifying data analysis.
- Nextflow Ambassadors
- The Nextflow Ambassadors programme supports local and global community engagement, spreading Nextflow knowledge worldwide.
Many thanks to all speakers and attendees. We hope you enjoyed the event as much as we did!
Full transcript
Podcast introduction
Phil Ewels: Hello and welcome to the Nextflow Podcast by Seqera. You’re joining us for Episode 48 going out on November the 26th, 2024. Today I’ve invited basically everyone I could find at Seqera to join me on the podcast.
And we’re going to go through and talk about all of our favorite bits from the Nextflow Summit, Barcelona, 2024, which was a couple of weeks ago. There was so much good content in those talks, all of them on YouTube, but it takes you, well, two and a half days to go through all those talks. So we thought we’d pick out our favorite bits for you.
We’re recording in two separate sessions. So I’ve got European folks on call with me right now, and then later in this podcast, I’m going to pull some people in from the from America’s time
Introductions: Group 1
Phil Ewels: zone.
So Thank you very much for joining me, everyone. Should we do a quick round of introductions?
Maxime Garcia: Uh, yes, hello everyone, I’m Maxime Garcia. I’ve been a long time contributor to nf-core. I love Nextflow, and that’s how I ended up working at Seqera at some point. I developed pipelines. I help with the community, and, yeah, I am living the dream, definitely.
Adam Talbot: Hi, I’m Adam. I’m a bioinformatics engineer here at Seqera in the scientific development team, spending a lot of time doing cloud infrastructure stuff and deployments of Nextflow pipelines, a huge range of different environments.
Jon Manning: Hi, I’m, I’m John money. I’m also in the scientific development team. I also do lots of customer workflow development tasks and some infrastructure things as well.
Chris Hakkaart: Hi everyone. So my name’s Chris and I am a technical writer at Seqera. I’m a part of the education team and I primarily focus on the Nextflow documentation.
Phil Ewels: A well known face to this podcast audience, I’m sure. Great. Okay. The way this will work is we’re going to go around, talk about a particular snippet of a specific talk which resonated with us I’m going to splice in the little video clip into the podcast so everyone listening can hear which bit we’re talking about.
Clip 1: Wave mirroring
Phil Ewels: So Adam, I think you’re up first. What have you got for us?
Adam Talbot: I, I struggled to pick out all of the good moments, but I wanted to choose one of my favorite new features, which is Wave Mirror, which I really like.
Phil Ewels: Let’s watch it, shall we?
Paolo DiTomasso: There is new. mirroring capability. That means that Wave can be used to copy container from one registry to another. This can be useful when you deploy the pipeline across different cloud region or across different cloud. Keep the container where the pipeline is running. So improve the efficiency because you cache the container in the same region where the computation happens.
But also for regulatory reasons, it may be required that you keep the container there.
Adam Talbot: I just love this feature because it’s kind of subtle and understated, but has such a big impact. Securing your docker containers is a recurring problem that people have. Relying on these public external services, you’ve got to trust that someone else is doing their job right.
It’s just been such a thorn in the side of trying to get these reproducible pipelines. And just having a quick way of copying the containers into your own box. It’s just going to make things so much cleaner and simpler for anyone who’s working in regulated industry or just want some good security or clinical lockdown environment. So I like this feature a lot. I think it punches way above the announcement in the little talk.
Phil Ewels: We’ve got, container. registry, right? Which is what we used until now in nf-core. So you can change the base registry for all the images used within an nf-core pipeline.
How is this different to that?
Adam Talbot: Currently, what you’ve got is you point your container.registry at the base URL. So you may say Docker Hub or Quay.io or something. And then you go and grab all the containers from something like BioContainers or your own container registry. And then you run them and you trust that quay.io is going, but with Wave and Wave Mirroring, you can essentially grab those containers, slide them into your own container registry and have a complete duplicate there.
You no longer have rely on configuration items, setting, exactly to the right destination. You can just sort of grab them, redirect to the, your own registry. Cut out the middle man of quay.io and then just run from your own private ECR. So you can put everything in the right region. Which then reduces costs, reduce the greenhouse gas emissions. It’s just a good thing all around.
Maxime Garcia: I think that was a great pick, because yes, that was not one of the highlights that was like announced, but definitely it’s a great thing as well, so I don’t know, I really like that, that was a great choice.
Adam Talbot: I think there are quite a few announcements that kind of slipped under the radar. There are so many that we only mentioned the big ones and then we didn’t notice all the small ones are going to make such a big contribution to the day to day work of the average bioinformatician.
Clip 2: Throwing cat food
Phil Ewels: Alright, Maxime, what have you got for us?
Maxime Garcia: Yes so Mathys, who’s been working on the nf-core pipeline for machine learning. He’s explaining how deep learning is working and how an LLM and a neural network is working by showing how he’s trained his cat to get some food after he’s throw some cat food at it , or not. Real throw or fake throw and I think that was brilliant, so let’s just watch like this tiny bit.
Mathys Grapotte: At this stage you start to throw another cat snack. So I’m going to demonstrate this live. I have a cat snack in my hand here. I’m going to throw it. Okay, you hear that the sound of the snack, it goes after the hand movement. The cat doesn’t move straight away, it waits for the sound cue. So, let’s formalize the problem a little bit. Uh, so we have a cat here, at the top you see the cat symbol.
Maxime Garcia: So yeah, it was super fun, and I think Mathys is a great speaker. Just explaining by using his cat as an example, was great
Phil Ewels: I must admit, I was sat on the front row, and when I saw him come up onto stage, I was like, Is he holding something? What’s he, what’s he got there? He’s wearing, a cat t shirt, and then he goes into algebra in his slides, but instead of algebraic symbols, he’s got little emoji of cats and stuff like that. It’s really good.
Jon Manning: It’s wonderful when people do science communication like that, they take something very removed from what they’re trying to explain, and it really facilitates understanding. I think that’s a excellent piece of science communication going on there.
Adam Talbot: Real empathy with the audience, right?
Phil Ewels: The questions in Slack, if I remember, were mostly about whether a cat could come back at next year’s event.
Clip 3: crisprseq
Phil Ewels: Chris, I believe you’ve got one for us next.
Chris Hakkaart: Yeah, so one of the real highlights for me was Lawrence, she was talking about the pipeline crisprseq.
For context, she’d come to me the day before and a bit of a panic saying. You won’t believe it, but I haven’t actually given the name of the pipeline anywhere on any of my slides. And she was wondering if she could update the slides and we’re going to make that work. And then she just thought, no, actually, I’ll just make it work. This is kind of fun. Let’s just do it.
So halfway through the talk, she says, look, I don’t have this anywhere on any of my slides, but the pipeline is called crisprseq. And then she proceeds to repeat that several times. And it was just the most organic and fun way of, one telling us what the pipeline was, how it works, some of the fun parts, but also just really hammering at home. And if you monitor the nf-core Slack as well, and the new nf-core shop, which was also announced, I think there’s actually a t shirt now that does, mention that it’s CRISPR Seq. CRISPR Seq, CRISPR Seq, CRISPR Seq. And it’s, it’s just a lot of fun. It was a really fun moment. Definitely one of the lighter moments, but something I think it’s quite memorable.
Phil Ewels: You know you’ve done well if you inspire a meme.
Chris Hakkaart: Yeah. And a t shirt, and a t shirt.
Laurence Kuhlburger: So this is what our pipeline does. Um, I have to admit I had a minor breakdown in my hotel room yesterday because I was practicing this talk and I noticed that in none of my slides did I write the name of the pipeline. Um, so, okay, cool. Now that I got a giggle out of you, uh, the pipeline is called crisprseq
it’s called crisprseq. It won’t appear in the next slides. It’s on nf-core and it’s called crisprseq. Um So, like, CRISPR and Seq, like, sequencing, like, Seqera. Um, so look it up. If you’re interested, come talk to me also. Um, I didn’t put any barcode. It’s, like, you won’t find it. It’s called crisprseq.
Phil Ewels: So, Chris, do you know anything about this pipeline?
Chris Hakkaart: I can tell you it’s name. It’s definitely like a highlight in that we can poke a little bit of fun at ourselves and everyone takes that in their stride. I think it embodies nf-core in a way as well, is that we do have the meme channel, and people are ready for a good laugh. So for me, that was a real highlight.
Phil Ewels: Think it took real guts for Laurence to do that as well, to just roll with it and make it into a feature, not a bug.
Clip 4: nf-test
Phil Ewels: Adam, I’ve got you down next.
Adam Talbot: Yeah. nf-test. I am not ashamed of my love of nf-test and all of nf-test is good, but I wanted to pick out a moment where he just highlighted the adoption of nf-test by the nf-core modules community. nf-test was never built to deal with a monorepo like the nf-core modules repo is, this colossal thing that can do thousands of modules. But they’ve adapted and they’ve worked with it and given it loads of features, and it’s just a really good positive feedback loop.
Lukas Forer: We analyzed 500 commits of the nf-core modules repository. Over the time, a lot of test cases were created, and on each new commit, all the test cases have to be executed, so the number of executed test cases grows with time.
We saw that a lot of commits changed only one or two modules.
And when we apply smart testing on this repository, we can see that nf-test automatically detects the changes on each commit and runs only the tests that are affected by these changes.
Adam Talbot: So yeah, I just think that’s, that’s, that’s cool. That’s a, that’s a huge repository, like with so many tests and it’s so hard to keep on top of everything. Between automation and people working hard and a bunch of really cool features coming together, we’re able to test thousands of things, using only a sliver of the resources we might need if we were, testing everything at once.
So it’s cool. And I just love the power of community . In one little snippet there
Phil Ewels: I think this one sort of ties into a wider theme of the conference , where we had Loic, who’s one of my picks for later talking about sustainability as the keynote speaker. I feel like it popped its head up a bunch of times in different and slightly unexpected ways. People already taking that seriously.
Jon Manning: Yeah, I think, we’re all in the kind of science field to try and make things better. The point that Loic makes is that sometimes we’re by trying to make things better, we make things worse. So it’s, it’s, it’s a good to try and avoid that.
Maxime Garcia: Yes, I think I really like, the plugins in nf-test.
Adam Talbot: The plugins are great as well. Anyone can write an extension now to nf-test. We’ve already taken them up and started to use them pretty widely.
Jon Manning: Well, Maxine’s done some great work on generating file complements, for example, to use in your tests, which is cool.
Maxime Garcia: Being able to just check, the content of a BAM file. So you can separate what is in the header or what are the actual reads. FASTA file, VCF file, and so on. So you can check the content of the file in a smarter way than just checking a particular line. I think this is great.
Adam Talbot: If the problem is shared, someone has a problem and then they create a solution and share it and so everyone can benefit . Just, just great.
Clip 5: Workflow output definition
Phil Ewels: All right, John, you’ve got one in the list next.
Jon Manning: So this one’s about outputs from workflows. Traditionally, Nextflow’s used a system, which I found quite counterintuitive when I started, which is this publishing logic whereby you mark a file from a process as being important. There hasn’t been a proper way of modeling direct outputs from workflow up till now. And there’s this change in Nextflow that allows us to do something more like what I expected to find as a newcomer.
Paolo DiTomasso: We redesigning, how you define the output structuring into your pipeline. This new model to move away from this publishDir directive, something that is more structured, but at the same time also more declarative, and enable the modularisation of the pipeline.
Essentially the publish section is move out from the task, instead in the workflow level. This will allow us to have better control, to have better model that are more composable because this information is not bind to the model itself. So we can reuse better.
Jon Manning: It’s just a kind of a cool way to bring things more forward to what I think a user would expect when they encounter a workflow system inputs and outputs, and what happens in between should be the kind of the core of a workflow system really.
Adam Talbot: Paolo drops the sentence at the end, it makes it more reusable and I think that’s really key because reproducibility and portability start to be a little bit easier because we’ve got this clear output flow. And we can start doing things like wiring workflows together . It’s a really nice feature that’s coming on quite fast.
Phil Ewels: And importantly, also, this is going to be the way that you have to write your Nextflow pipeline. So whether you like it or not, it doesn’t really matter.
Uh, you’ve got about another six months or maybe a year to get rid of all those publishDir statements.
Jon Manning: Yeah. It, it never made a great sense that you marked individual processes as output points. It was a strange concept to begin with, so I’m very glad it’s gone.
Phil Ewels: I think Paolo’s talked before about saying this is a relic of DSL1 . But now we’ve got this modular workflow system where you have emits and sub workflows. And I think this is going to make a lot more sense to people.
Adam Talbot: There was a theme at this Summit, of developer experience, like developer experience is getting better and better. And this is just one small component that will make developers lives easier.
Clip 6: OSSF launch
Phil Ewels: All right, then, Chris, I think you’re up next.
Chris Hakkaart: So the next thing I wanted to talk about was the closing of the Summit. It was the announcement of the Open Science Software Foundation. I think the better thing to do here is we’ll watch the clip first and then we can maybe talk about it a little bit afterwards.
Evan Floden: I think in Phil’s talk, you talk a lot about all of the stuff that happens behind the scenes that you don’t really see. There is a lot of work, whether it’s just stupid stuff like, who’s gonna make sure that the domains get renewed, pay for the servers, make a lot of the projects kind of like, uh, happen, um, in that way.
So, we’ve been thinking how can we ensure that the future of those resources are built in a sustainable way. And today I’m very proud to announce that we are launching the Open Science Software Foundation.
This is a foundation which is a public good. Whose mission it is to empower scientists through open source.
Chris Hakkaart: Yeah, so I think that was a nice way to finish the Summit. Introducing everyone to this idea that we want things like nf-core to last, we want open science to last, we want projects like these to be established and to be sustainable. For me, that was a real highlight. But I know Phil, you’ve done a lot of the thinking behind this. Did you want to jump in on that as well?
Phil Ewels: Yeah, it was, quite a strange sensation going on stage for that, because we’ve been talking about this behind the scenes for such a long time. This has been years in the making.
And really, when we started nf-core, it was the whole point, right? To reduce the bus factor, to make these things collaborative. That’s the whole thesis of nf-core, that no one owns the pipelines. It’s all about sustainability, collaboration. And this is a real milestone for that I think.
On a personal note, that was the very last moment of the Summit week. And I was so tired and I watched back the recording afterwards. I was almost delirious on stage. I don’t think I did my best performance, but, uh, I appreciate you picked a snippet of Evan talking, Chris. good.
Adam Talbot: It’s nice to see just future of nf-core being a bit more secured.
Phil Ewels: And I think really the critical point is independence as well, because we’ve got backing already . We’re supported by Seqera and we’ve got other people who help out at SciLifeLab and others. But now this kind of hopefully forms the foundations for safeguarding all of that in a neutral manner.
Jon Manning: People sometimes have a misunderstanding that Seqera is more involved in nf-core than it actually is. I think it’s really nice to underline the fact that nf-core is its own community and it sets its own standards.
And in fact, we don’t always agree. There are times where have disagreements and there’s nothing that Seqera can do about that when it happens, the community does what it needs to do. I think that’s a really nice thing to underline.
Clip 7: FAIR code with nf-core
Phil Ewels: All right, Adam, what have you got for us?
Adam Talbot: So Ken gives great talk, rather controversially titled how not to contribute to nf-core. Lots of lovely, great moments in it. But I think my favorite one was , there was a moment where he got me to rethink something that I hadn’t really thought about, which is always great in a talk when you reconceptualize something you thought you knew. So yeah, should we watch.
Ken Brewer: nf-core has built the tooling to make code FAIR in a way that I don’t think any other organization has done. And each of these findable, accessible, interoperable, and reusable principles have CLI commands that work for nf-core tools.
And if you set up your own modules library, you can simply add —git-remote. To begin using this same kind of tooling for your personal code bases that you need in your organization.
Adam Talbot: Cool. Yeah. So I really like this I worked a lot on the end of core modules repository. I’ve worked with people setting up their own private versions and I’ve also worked with FAIR stuff a lot. But I’ve never really sat down and put the two together and realized that actually what nf-core modules was doing was implementing FAIR, for code.
It just helped me rethink everything I knew about nf-core modules and, how it works and what its purpose is. And I think he’s absolutely right. It really does nail down all the major issues you need to address to do FAIR principles with your software development.
Jon Manning: I was going to say it’s definitely an underappreciated feature of nf-core that you can use it that way. It’s a really good use case because it shows the power of taking nf-core components and using them in non nf-core workflows. You don’t actually have to use nf-core workflows to benefit from the power of those components.
Adam Talbot: Before I joined Seqera, we were using nf-core all the time, but just never directly on a pipeline, always with modules, always with the configs and stuff like that. Dark Nextflow.
Phil Ewels: I like that.
Clip 8: Teaching with Nextflow
Phil Ewels: Maxime, you’re up next. What have you got for us?
Maxime Garcia: Yes, Francesco explaining how we use Nextflow to teach, bioinformatics. In at the master’s degree in Italy. And I think it’s a great one, so let’s watch it.
Francesco Lescai: So we really need to start from the basics here. We start by teaching them the basics of bash, approaching the terminal, then we teach them R, and then we explain them about the containers, and we teach them how to use docker and singularity.
Once we’ve done that. Then we start discussing the infrastructure. We explain how to use a local HPC, and then we build on that knowledge to move on to the cloud. Once they have the foundation, then we start introducing Nextflow, and then they can use nf-core pipelines for specific biological applications.
So we basically cover three big categories. The coding, the infrastructure, and then the orchestration of the analysis.
Maxime Garcia: So for me, this one was great because we do a lot of training as well. We always struggle a bit to explain people where do we fit Nextflow. What Francesco has been explaining is that first they teach to the student like a bit of bash, a bit of R, a bit of containerization, a bit of infrastructure with working on HPC and then working on the cloud.
And that’s where we put Nextflow and then later on nf-core. Because yes, you try to explain Nextflow to people that don’t know how to use bash, and you expect them to write a pipeline, it’s a bit hard to understand.
Jon Manning: I have a complaint about this one because it makes me feel old. was, I was. 10 years probably past my PhD or something before I discovered workflow orchestration, and now they’re getting it in their undergrad as it should be and I think that’s a wonderful thing.
Adam Talbot: Just showing you the layers that you have to build up. Francesco’s obviously thought about it a lot.
Chris Hakkaart: It’s a massive achievement to piece all of this together, the coding infrastructure and the orchestration into a university course across two semesters is a massive achievement.
Clip 9: Conda lock files
Phil Ewels: Okay, next up, John.
Jon Manning: Yeah, so this is probably the thing that makes me happiest about recent Nextflow developments. We often build containers in bioinformatics using Conda. But we just let those things build and then send them off into the wild and the container acts as a freezing mechanism, and that works. It works brilliantly. It’s what we do.
But the problem is that if you then try and rebuild that container in a future time, or if you try and use the bare Conda environment at a later time, you get a different set of software. And that’s because of transitive dependencies: the dependencies of our dependencies that can change under our feet when we build Conda environments.
And what Paolo is talking about in this clip is a feature of Seqera Containers or Wave underneath, to provide what we call Conda lock files as part of the container definition.
What that allows us to do is to rebuild the contents of that container at a later date, and to build a Conda environment that mirrors the state of that container as well, so that we can have kind of a uniformity of reproducibility. So that you can get the same environment again and again and again. And that’s pretty cool.
Paolo DiTomasso: And the conda file that was mentioned before. This is nice because it allows us to recreate this container content because here there is the precise resolution, each single package that is needed to recreate this environment if you need to recreate without using the container in another environment, whatever.
Jon Manning: So yeah, this makes me happy. The conda lock files are actually the URIs to the conda packages. So you bypass dependency resolution with conda. It’s quite a computationally expensive process to do that solving, and it takes a while. So if we can bypass that, it also makes conda environments a lot faster.
So I’m excited about this small thing in a number of different ways. And I think it’s a very small thing that could actually produce really significant gains in bioinformatics.
Maxime Garcia: Think you’ve been complaining about that for a while. I know how much it makes you happy. I didn’t realize it was such a problem until I stumbled onto it when I was trying to update some modules. The version was not the same in between Docker and Conda. When I discovered that, I was all on board with you and defending the conda lock file, and I’m so happy as well that it came.
Phil Ewels: Those URLs to the package. They also include the MD5 sums of the downloads, so you have verification of the downloads at runtime with Conda as well. So it’s not only pointing to the full dependency stack, but it’s also verifying every single package is exactly the same down to the byte as the original time it was generated.
Jon Manning: And hopefully this will be part of nf-core soon as well. So we’ll have them on nf-core modules, these lock files.
If you want to learn more, there’s a blog out this week on exactly this subject.
Clip 10: MultiQC in scripts
Phil Ewels: All right, Adam, you’re up.
So this is the absolute simplest script that you can write with MultiQC. We are just pulling in at the top , and we’re calling two MultiQC functions saying parse logs, and we’re giving a file directory and then we’re saying, great, write a report. And now instead of doing MultiQC dot, instead of running a MultiQC command line, I can just run this script.
I say Python script file name, and this just generates exactly the same report you would normally see.
Adam Talbot: Yeah, I like this feature purely because I was going to use it immediately. I was like, right, there’s four use cases here. I can immediately grab this and go.
The ability to use MultiQC in your Python is going to simplify a lot of stuff. The amount of times I’ve had to run a tool, grab the logs, put it into MultiQC, get them out of MultiQC and then pass them in a second Python script.
So this is going to be just really useful.
Phil Ewels: I ended up basing most of my talk around that feature because although it’s a fairly small change in MultiQC and it won’t impact the majority of users, it’s very invisible. Unless you know it’s there, no one’s going to use it, but it’s really powerful.
Just having a Python scripts is just, no one’s got any excuses anymore. It’s super easy.
Adam Talbot: And there’s so many tools in bioinformatics that produce strange logs outputs and MultiQC is this huge repository of the ability to parse them and standardize them. So now suddenly you can grab whatever tool you want and just put it into a nice pandas data frame or something and just go.
Clip 11: Small Nextflow Big Nextflow
Phil Ewels: Maxime, you’re up next.
Maxime Garcia: Yes, this talk was nice because we have different view of seeing like how to develop a pipeline. And Rob explains that like in his talk.
Rob Syme: So your workflow exists on a continuum somewhere between what I’m calling small Nextflow and big Nextflow. At the big Nextflow end, we’ve got workflows like the big flagship nf-core pipelines, some of the in house production pipelines that I know many people in the room are contributing towards and running at a massive scale.
Okay. Nextflow is extraordinarily flexible, with many optional features. And Big Nextflow has basically everything switched on. You’re using all the features.
At the other end, we have what I like to call, small Nextflow, and to be honest, I love small Nextflow. I really love small Nextflow. There are many people, myself included, who start small, who use Nextflow to manage individual projects, with no intention, really, of releasing them into nf-core, or to building them into something that’s going to do a hundred thousand genomes.
It might be as simple as a single main. nf. But don’t be fooled by this modest little start. Like, even here, right at the very end of the spectrum, Nextflow provides massive benefits.
Maxime Garcia: So yes, I really like how Rob explained how we see Nextflow development in a continuum between small Nextflow and big Nextflow. I can see that he sees himself on a small Nextflow side. I see myself on the big Nextflow stuff because when I start talking building a pipeline I’m already thinking like nf-core. Okay, so I want module, I want containers, I want a CI/CD, I want nf-test. I want all of the things.
Rob likes to start small and it’s not bad way to start either.
Jon Manning: Yeah, I think this is a good reminder for us that there’s this whole universe of Nextflow outside of nf-core . There are people who don’t care about the CICD, they just want to get something done quickly to get their own little bit of work done. And so a good reminder of that.
Adam Talbot: We often casually throw out the term Nextflow scales really well. It means very different things to different people. Sometimes that means your samples are really big, sometimes it means lots of samples, sometimes it means a simple pipeline, sometimes it means to a really big complicated pipeline.
And I think one of the things is an analysis, there is a little analysis that someone wants to do versus a production level pipeline. And there’s everything in between. And Nextflow scales really well between those two things, as well as thousands of samples or big samples or something.
So yeah, Rob highlights that really well here in this talk.
Phil Ewels: Okay, I think that’s everything for our highlights. It’s really fun going back through these videos. I can enjoy them again a second time
Thank you for suggesting your favorite snippets.
Introduction: Team 2
Phil Ewels: Hi, and welcome back. This is recording session two for the different time zones. So I’ve managed to grab a different group of Seqerans together to talk about all of our favorite bits.
Let’s just do a quick round of introductions say who you are.
Rob Newman: Hey, I’m Rob Newman. I’m a product manager at Seqera. Everything related to the platform, specifically my focus recently has been on data studios.
Rob Syme: I am Rob Syme, scientific support leader at Seqera. I spent a lot of time helping out people writing and executing Nextflow pipelines.
Sasha Dagayev: Hey guys this is Sasha. I’m a product manager at Seqera. I mostly focus on growth and AI.
Ken Brewer: Hi, my name’s Ken. I’m a developer advocate at Seqera, and I work with our community team, both with outreach and with academic institutions.
Florian Wünnemann: I’m Florian. I’m a bioinformatics engineer at Seqera and I’ve recently been working on benchmarking and work with our customers to enable them to run Nextflow at scale.
Clip 12: Genomics England
Phil Ewels: Great stuff. So we’ve got some highlights picked out here. Ken, you’re up first.
Ken Brewer: Yeah, so this was a fantastic talk from Genomics England about how they are helping people with rare diseases understand what is causing the symptoms as early as possible.
Edwin Clark: So what this kind of brings into the fore is the concept of the diagnostic odyssey. A baby’s born they may develop symptoms. At that point, they’re taken to the doctor, they go to various specialists, they might undergo various tests, and this whole process can take months, possibly even years, and they may get a diagnosis at the end of that. But unfortunately, over this whole time, that child is living with whatever condition they’ve developed.
What the Newborns Generation Study aims to do is to skip out those middle steps and move to a much more proactive and preventative form of medicine. With the aim of sequencing another 100,000 newborns here shortly after birth, looking for a list of around 200 conditions where we have quite good confidence that there’s a good treatment pathway available, and that if we intervened, we could vastly improve the life of that newborn baby.
My team are tasked with the challenge of building a system to automatically and continuously process 100,000 whole newborn genomes over a period of about three years.
Ken Brewer: Yeah, so the part of Edwin’s talk that I enjoyed the most is, talking about the impact , where they want to shorten the diagnostic odyssey of the seven to eight year average timeline that it takes for someone with a rare disease to get a diagnosis that explains their symptoms. The company that I was working with before I joined Seqera was GeneDx, which is also working on a very similar study in the U.S. called the Guardian Study.
What Edwin’s team is working on , how do you build a system that can automatically and continuously process 100,000 whole genomes? He talks about what a modern Nextflow with Seqera Platform based system is able to deliver for them in terms of scalability, cost, monitoring, logging, and all of that.
Phil Ewels: Yeah, that was a really impactful start of his talk, where he’s got some case studies of real life children. Makes everything else feel a bit more real .
Genomics England, it was not long ago that they ran the first 100,000 Genomes project in the UK. And that was a huge effort that take years. And now we’re launching into these projects where we’re doing the same number of genomes. And it’s not such a big deal. And the timelines are much shorter.
And it’s the pace at which things are accelerating, it’s phenomenal to see.
Rob Syme: It’s the first time I’ve heard this term, diagnostic odyssey, which I think is wonderfully evocative, and a perfect term. I really love this focus on the patient experience.
Sasha Dagayev: There’s a lot of information that we can provide people that’s much more definitive than anything else that they’ve been exposed to before. So, very exciting.
Florian Wünnemann: It’s such a relatable thing to sequence someone’s genome and then being able to tell them, this is what you have.
I’ve previously worked in a children’s hospital where we work with children with congenital heart diseases. And sometimes a single variant that’s causing their disease. And before genome sequencing or exome sequencing, it was really hard to identify those genetic variants. But now that we can do these screenings, we can really offer these people like hope. You really have something that you can go to the patient and say, this is why you’re sick.
Phil Ewels: And with this case study in particular, you don’t even have to wait until the child’s sick. You can catch those diseases before the sometimes irreversible impact starts to take hold, which is huge. You can really change lives and save lives in a way that’s not possible any other way.
All right. That started us off on a bit of a serious note, but let’s see if we can turn things around.
Clip 13: Alphafold interactive reports
Phil Ewels: Florian, you’re up next, what have you got for us?
Florian Wünnemann: Yeah so I chose a talk from Ziadd from the Australian Biocommons Project. And he is showcasing some of their interactive reports for visualizing protein structure prediction from AlphaFold. I’ve chosen this particular snippet because I think it’s quite pretty and it showcases these protein folding results of how protein look. But in general, the whole talk was quite interesting. It’s just, it’s just pretty to look at, right? These types of methods have won the Nobel Prize this year. So I think it’s also quite timely.
Ziad Al Bkhetan: Just directly from the platforms, AlphaFold will produce five different models. The users will have a visualization of these models in the three dimensional structure. They can interact with them. They can see some general information about the samples, can zoom in, zoom out, rotate. Even navigate between different ones.
You can see the PLDDT confidence. As well reflected on the structure, and you could even change the representation of the models, and I think four of them are supported.
Florian Wünnemann: What I really think that the whole talk from Ziadd highlights is BioCommons, Australia really taking responsibility for driving a reproducible and scalable science in Australia. Doing that using Seqera platform and also directly contributing back to the community.
So Ziad and his team, they were pushing this code, the nf-core/proteinfold, pipeline. They worked during the hackathon on this. They are giving back the tools that they are developing back to the larger community and not just keeping it for themselves. And I think that really just represents the best of the Nextflow community to me.
Sasha Dagayev: Yeah, I think , whenever I speak to folks that are starting out, I always joke that bioinformatics, there’s this graveyard of people that wanted to be bioinformaticians, but could never get the damn thing to run or even turn on. So anytime I see stuff like this, which is more focused on , Hey, how do you just get, how do we widen the net.
Phil Ewels: And that was one of the bits I really liked of his talk where he showed that nf-core/proteinfold pipeline could run over was it three different models, I think and then render those and also show the differences all overlaid on top of each other. Just with almost no setup and no real prior experience, you can run all these state of the art models and compare them all in a dynamic visual report. I mean, that’s pretty awesome.
Sasha Dagayev: I think it pushes other developers that are putting out models , the most impact that you can have is by contributing to this community because it’s well supported.
There’s all these other tools that will make it easier for people to run.
Rob Syme: My favorite thing about the talk was this looked amazing and it required no input from Seqera or Nextflow. It speaks to the Nextflow and Seqera ethos of giving scientists and researchers the tool they need to build other tools.
So yes, Seqera Platform could absolutely build a alignment viewer into the platform, but it wouldn’t be as good as the tool that the researchers themselves have developed. It wouldn’t be as good as the one that Ziad developed because research moves so incredibly quickly.
So we’re used to seeing MultiQC reports in the reports feature, which is great, but it can be so much more. Nextflow gives you the option to generate arbitrary reports. HTML is so flexible and it’s actually not a huge lift to start building custom things that integrate with Seqera Platform and can be interactive in the way that we’ve just demonstrated here.
So I love this. Seqera builds tools that allow people to build tools and don’t dictate the sort of analysis work that you’re doing,
Phil Ewels: And no lock in, right? Because Ziad also said that this is just from a pipeline. If you run Nextflow on the command line, no Seqera Platform, nothing, you get the same reports. It integrates really well with Seqera Platform, but there’s absolutely no requirement to use that.
Clip 14: Data Studios
Phil Ewels: Alright. Sasha, you’re up next.
Sasha Dagayev: Yeah. So for me, I think that Evans keynote with the custom environments was very cool. So yeah, I picked out this little snippet here.
Evan Floden: Importantly, as I mentioned, the data is all mounted there. So you can see they’ve got iGenomes mounted. I can perform a task as I would. So I’ve got samtools and I’m just going to do a view on chromosome 21 of data, which is in a public bucket, but I can treat it like it’s just local to me here.
And you can see that it’s just basically treating it like that. So that’s a fantastic.
That’s all really enabled by a lot of the Fusion technology that we’ve been building in. But again, it makes it seamless in this way to work across cloud and across different regions as well.
Sasha Dagayev: Yeah, so this was my favorite part. I think that installing stuff is my least favorite part of bioinformatics. So really anything that gets rid of that for me personally is a huge win. So just having it go from something like containers straight into something that you can run on is amazing. Bioinformaticians in the future, they’ll just expect it to be how things work.
Rob Newman: I think one of the big things for me that was really nice is, Data Studios uses the Fusion file system under the hood. And this was a great example of just how fast it is. Just the ability to mount those buckets in S3 into the environment and then run a command like samtools and just have it just go boom, like straight through.
I think that was a really nice demonstration of the efficacy and speed of fusion . It was kind of nice to see all those things together.
Phil Ewels: Sasha, like you were saying, software with the containers, I think Fusion is the same thing for data where you don’t have to think about it anymore. You don’t have to know the difference between POSIX file systems and S3 buckets . You just have some data, you just chuck it into a terminal, you just try and run it and it works.
Lots of complicated software making the end result very, very simple.
Rob Syme: It feeds into this story, scientists shouldn’t have to think about infrastructure. More time thinking about the science. Which is like an unambiguous win.
Clip 15: Covid and Ebola
Phil Ewels: Alrighty. Florian, back to you.
Florian Wünnemann: Yeah, I chose this clip from Sam Wilkinson’s talk, talking about CLIMB-TRE, he presented how they utilized Nextflow and, how they had to set up database systems to look at the COVID outbreak in the UK.
The snippet itself is more of a funny clip , where he’s showcasing one of the reports. Because there’s, an Ebola patient that, is reported in the UK and he makes very clear that this is sample data and there is no Ebola case identified in the UK. But it got a bit of a chuckle in the audience, just because of how he presented it.
Sam Wilkinson: Uh, again, synthetic. No, we haven’t spotted someone in the UK with Ebola, despite what that says.
Florian Wünnemann: So obviously that was just a soundbite. But what I really enjoyed, about this talk beside the humor that Sam put into it, it’s also how they named their softwares, which is all based on a popular Nintendo, video game, Zelda. And he’s like jokingly saying the talk like, he still expects, Nintendo to show up and be like, you can’t use that logo.
But all jokes aside, scientifically, this was a great talk as well, because again, he talked about the importance of something like the COVID pandemic and how tools like Nextflow and the systems that they build around it and with it really facilitated them to do a type of monitoring and analysis of a pathogen that I think 10 years ago would have been almost impossible at the scale that they were doing it.
And also how they learned the lessons from doing this . If there’s ever a new outbreak of another pathogen in the future, they have these systems set up and because of the way that Nextflow work, a lot of these things will still work in very much the same vein in two or three years as they do right now. So this reliability of Nextflow as a language, as a software, was something that he highlighted a lot, which I thought was just really, really cool.
Sasha Dagayev: I think it’s also important that it’s open . So if there’s a government body that doesn’t have the infrastructure that University of Birmingham has, this is available for everyone. The CDC can run this and it will work. This isn’t just reproducibility for citations.
Clip 16: Seqera AI
Phil Ewels: All right, I’m going to jump into the next one. Sasha, I’m going to put you on the spot, because I’ve picked out your talk. Let’s give it a listen.
Sasha Dagayev: Within each individual process, you guys might notice a script testing module that allows you to test it in an AI sandbox. And we have this button here called Start Test. Should we push the button, Evan? You guys want to push the button? Should we push the button? Yeah! Alright, let’s push the button.
So when we push the button, what happens is that the AI spins up a micro VM that has Nextflow and Wave already installed on it.
Then it goes through nf-core test datasets to find the right sample for the process and then it writes it to a file.
And what it’s doing right now is that it is actually executing this code and seeing if it will actually run. So if we scroll down further, awesome. Amazing. It ran.
Phil Ewels: I think this is probably one of the most memorable bits of the whole week for me. There’s two things that I love about this clip. Firstly, made me think how much you must have practiced this with Evan.
Sasha Dagayev: We spent a lot of time together that week.
Phil Ewels: It was a nice dynamic. It was tongue in cheek, but I felt like brought the whole audience along with you with that.
So that, that was a lot of fun that the push the button thing.
And then the other bit, which is you had this buildup where you’re walking through all the steps and explaining what was happening. And you’re saying it does this and it does this and it finds a test data for you automatically. And I could feel everyone in the audience going, where’s this going? And then you you scroll down and you see that output from Nextflow, which is so familiar to everyone in the audience where you can see Nextflow was actually run and I could just hear audible gasps just all around me and I was like, yes, that is, that was amazing.
Sasha Dagayev: I think for us, with that demo, we were talking back and forth whether to do it live or not. And that’s one of the things I like about Seqera , there’s a bit of a builder mentality and part of building is demoing the real thing, you know?
Just looking back at this, I can hear the nervousness in my own voice where I’m like, okay , this is an AI system and obviously, I had high confidence that it was going to succeed in this specific instance. But yeah, it did end up working pretty much how we had expected it to do.
Phil Ewels: And, if I’d been able to pick out a second clip from the same talk, I think a little bit after that, you scroll down and you’re like, Oh no, it’s broken. Everyone in the audience is like, Oh no, this demo is about to go horribly wrong.
But then again, you had us all wrapped around your little finger. Cause it was on purpose.
Sasha Dagayev: Yeah. That one I was actually much more sure on. That one is a very consistent error that does happen for featureCounts. But it’s one of those things where you forget that when people come into our world, how many rakes you have to step over. So the purpose of a lot of these systems is to step on the rakes for you, especially if you’re learning. I think the reason why we started off with chat was because it’s a good onboarding experience, especially folks that are novices, that’s the place that they’ll start. And the way that a lot of the basic software in our ecosystem works is that the errors are so verbose and so scary. So that’s a big focus of ours is to make it more approachable, less scary. And so that way you can just jump in and get going.
Rob Newman: I couldn’t believe you, you were nervous. Like you were cool as a cucumber, man. I was really impressed. It was really a lesson in delivering a good demo. So kudos, Sasha.
Florian Wünnemann: My, my favorite moment was when you were writing, my boss is watching, please do a good job.
Sasha Dagayev: Yeah, you know, I’ve been playing around with these models for a long time and one of the few consistent prompt hacks that still works to this day is threatening the model where you’re like, Hey, the application will break if this isn’t standard JSON. And so that part actually improves the performance on stage.
Phil Ewels: The AI models were rooting for you as well. Not just the audience.
Good stuff.
Clip 17: Nextflow VSCode DAGs
Phil Ewels: Alright, Ken, what’s next?
Ken Brewer: Next one is part of the talk from Paulo and Ben about some new functionality built into the VS Code extension update.
Ben Sherman: I left a little gift for you, which is that if you do manage to fix all those errors, you’ll see this little preview DAG button pop up in front of your workflow. So, should I click the button? Alright. You click on that, and it’ll pull up a little DAG preview to the side.
So, this is very similar to the DAG you can get from running Nextflow, but you don’t have to run the pipeline to get it. And you notice I did this for just a sub workflow. So you could just look at a sub workflow instead of having to look at the whole pipeline.
And it’s also a much, much more condensed. You don’t see all of the channel and operator boilerplate. So this is a way to get a very condensed description, a visual of, of what’s going on.
Ken Brewer: Yes, so this is another one of those great ” should I click the button” moments. This time featuring a new way to interact with Nextflow through the VS Code extension. It highlights the real power of the work that Ben has been doing putting together a proper language server that can understand how a Nextflow workflow is composed and then use it to pull up documentation from within the workflow about how a particular process works or generate a DAG of a workflow before you run it,. That’s something that users have been asking for from Nextflow for some time, and it’s great to be able to deliver that in a really developer friendly format
Phil Ewels: It’s really good when you’re exploring big workflows written by other people. So if you dive into one of the nf-core pipelines and you’ve got 50,000 lines of code, you can start digging into the sub workflows and viewing the DAG for specific sections of the pipeline. And it’s much quicker than reading the code. You can visualize it and get a feel for the shape of it quite quickly.
I don’t think this one was scripted at all because none of us knew what you were going to do Sasha, so that was just Ben.
Rob Syme: Such impressive work. It’s one of those problems I assumed it would be impossible. One of the greatest strengths and also one of the common criticisms levelled at Nextflow is its flexibility. And I always thought that a lack of language server was the price we had to pay to get all of this wonderful flexibility and developer dynamism. And it turns out if you just throw enough smart people at it, well, one smart person in this case, you can actually get both. It’s just a wonderful win for the community.
Florian Wünnemann: One of the really cool things that was just released, like I think today, is that now when you preview the DAG, you can actually click on modules or subworkflows and you’ll actually take you to the code. So if you’re exploring a workflow, and you want to see the code for a specific process, you can just click on it and it’ll open that process code within your IDE. Which is just amazing, which makes it even more straightforward and easier to explore code using the stack rather than just having to go and find that process afterwards. You can just click on it.
Phil Ewels: That’s seriously cool.
Clip 18: Fusion snapshots
Phil Ewels: Okay, Rob Syme, you’re up next. What have you got for us?
Rob Syme: I So I love fusion.. Fusion is one of those technologies that I’m in awe of the engineering talent at Seqera for developing seriously impressive work.
This is a commonly requested feature, because as I was talking about earlier, like a lot of the work we do at Seqera is ensuring that people can think less about infrastructure.
But one of the things that is stuck for a while is thinking about instance types. You have this option when you’re running on the cloud to use on demand or spot instance class. You have to think, ” is this a process that’s going to run for a long time? And if it is, I should probably not have it on spot instances”, and then you have to ensure that those jobs get ferried to the right queue.
You can absolutely do that in Nextflow with some dynamic configuration and Seqera Platform supports it. But it’s one extra thing to think about.
Fusion Snapshotting gives us the option to just run the pipeline on the cheapest instances you can, and you don’t have to think about that anymore. We’ll have a look at the clip, and maybe we can talk a little bit more about how it works.
Paolo DiTomasso: Spot instances is the provisioning model that more or less all cloud providers have to provide cheap computation, but with the caveat, they can interrupt the virtual machine at any point in time because they relocate for another job, another customer, whatever.
It’s the one that we want to use because it allows us to have much more efficient computation from cost point of view. But it works well if you have small jobs in your pipeline. The reality is that genomic pipelines have very large jobs. It takes hours.
So, you run your pipeline, use spot instances, they get interrupted, and so Nextflow, what is doing? Restarting the task.
With a few reinterruption restart, you get pipeline completed. When there is low availability of specific instance type, you bark just a lot of money for nothing.
What it could be done instead is that we take the state of the computation, we save into a file, and then, when the job restart, we continue. That is exactly what we are introducing today with Fusion Snapshotting.
Rob Syme: Cool. I think this Fusion Snapshotting is the first of Fusion features that extend out beyond simple provisioning the data in efficient manners. I’m excited both for the feature in and of itself, but also what it means for the future of Fusion.
Rob Newman: It essentially will allow you to do Data Studios, right? And if that spot instance becomes reclaimed in Data Studios, then you’ll just get a snapshot created, and you can just start up the Data Studio again, and it’ll pick a different spot instance. It has wide implications across the whole platform, which I think is just fantastic, and it’s going to be such a cost saver for our customers.
Sasha Dagayev: Yeah, I think the biggest thing, it’s one of those like basic building blocks, I’m excited to have another building block to mess around with. It opens up so many interesting product experiences and yeah, it’s just very exciting.
Phil Ewels: I saw a really nice figure where it showed the availability of spot instances on the spot market. Because when I started running Nextflow on the cloud quite a few years ago, we always ran on spot. It was a no brainer because it was much cheaper and almost never had jobs being canceled, or it was very rare.
But the spot instances are much, much more competitive now than they used to be. So now you really are having to think twice or build in dynamic retries because getting instances canceled having to retry jobs is becoming almost a certainty for bigger pipelines, especially if there’s long running jobs.
And that starts to take a dent out of how cost effective running on spot market is. This is so cool because it takes us almost back to the old days where there’s no compromise. You can just run on spot without any fear of racking up unexpected costs by three retries. It’s really exciting.
Rob Syme: money or not burn money?
Phil Ewels: Exactly, it’s a no brainer.
Florian Wünnemann: I think the opinion of the audience is very clear because I think this was maybe the loudest cheering and excitement of people in the room from all the talks. So I think people were genuinely really excited about this.
Phil Ewels: Yeah, and we’ve got a private preview open for this at the moment. So if anyone listening wants to give it a try, I’ll put the link in the show notes and you can jump over, sign up and try this stuff out early.
Clip 19: Sustainable computing
Phil Ewels: Alright, I’ve got my next one here. For this one I’ve picked Loïc’s talk. He was the keynote speaker that we invited to Barcelona this year. He wrote this piece of software called Green Algorithms together with Mike Inouye, and he’s become a real figurehead in the community for sustainable computing.
His keynote was all about that. It’s a great talk. Even if you’re not using Nextflow, even if you’re not working in life sciences, necessarily, if you’re just using computing, it’s a great talk and it tackled some really interesting dynamics.
The particular clip I’ve put in here, he demonstrates very visually using a map about the variances in carbon intensity of electricity markets in different geographical regions. Anyway, let’s give it a listen.
Loïc Lannelongue: Uh, and so we’ve got this electricity map, which is really interesting. They show a real time, so the map changes all the time. And it’s basically the carbon intensity per country. And that’s the biggest difference between locations.
Let’s look at a few examples. So, focusing on Europe, let’s say you decide to move your compute from Sweden to Poland. You multiply your carbon footprint by 46. Same hardware, same software, same everything else, just because the electricity is more renewable in Sweden and more powered by gas and coal in Poland.
Okay. But that’s a bit extreme moving your compute between countries like that. Let’s focus on the West Coast in the US.
So we’re in Washington state and we just moved between two different counties. Suddenly we multiply carbon footprint by nine because again, the energy providers are not the same. And therefore the share of renewable is not the same.
So that’s a little bit scary , but it shows that there is great opportunity because we are quite lucky in terms of research that you don’t have to be next to your servers. So that means we do have a lot of opportunities to leverage renewable energy.
Phil Ewels: I thought it was a great visualization and a good reminder that we do have a choice in this. If you are running on the cloud, this is such an easy thing for you to change. You get that dropdown list of which Amazon regions you want to pick, and you can just move your compute around the world. It has relatively little effect on you, but it has quite a big effect on the energy you’re using for that compute.
Ken Brewer: I really liked the part of his message that, it isn’t that you should avoid doing the high impact science that we’re working on. There’s so many really important things that are happening in these HPC centers. It’s just being judicious and weighing those costs and benefits and being more optimal about how you request your resources.
And Nextflow allows you to do that so elegantly. It’s really great to have the summit for a software that makes green computing so much more accessible, having someone who is talking about the impact of that as well.
Florian Wünnemann: I think there’s a Nextflow plugin, it’s called CO2 footprint. After you run the pipeline, it calculates energy costs. They give you an HTML report, like what is the usage? What’s the CO2 emissions you’re going to have running it here versus there.
Phil Ewels: Yeah, that’s from Sabrina Krakau’s group in, QBiC in Tubingen and it is a collaboration with Loïc. So that works by taking hard coded metrics from Loïc’s research. So he’s got basically on GitHub some big spreadsheets about if you’re using this AWS instance type, it has this kind of C.P.U. which has this kind of power usage. And then a bunch of mathematical expressions, which calculate CO2 emissions from that.
My favorite aim for that plugin is just to make it visible. Just when you run your pipeline and say, Hey, just so you know, you use this much and then try and nudge.
You had this many retries and those tasks cost this many grams of CO2. Maybe think about optimizing your config a little bit for next time. Or say You ran on this AWS region. If you’d chosen this different AWS region, your impact would have been this many fold less.
So not changing behavior necessarily, but just informing, making it visible and then trying to give nudges in the right direction.
I can tell my dishwasher to try and pick an ecologically friendly time to run at night by looking at the spot electricity market in Sweden. So I don’t think it’s beyond the realms of possibility that we could build out functionality like that in the future. I’d love to see that.
Clip 20: nf-core/tools
Sasha Dagayev: So I really liked the talks that are very focused saving people when they’re aspiring bioinformaticians. And a lot of the stuff with nf-core tools is one of those. It’s such a cleaner onboarding experience into our world. Because the learning curve is, it’s not a curve, it’s a cliff, right? And tools like this, is like exactly what we need to push everybody forward. So I have this clip with Julia showing how easy it is to create a new pipeline using nf-core tools.
Júlia Mir-Pedrol: We also introduced the text user interface. So now when you run nf-core pipelines create, you will see this interface, which guides you through the process of pipeline creation. You will see information, documentation, um, of every step. You will see these kinds of forms to provide all the required information for creating a pipeline.
You will also see help messages describing each of the features, helping you decide if you need this feature, you can safely skip it. And finally, it also helps you create a GitHub repository to hold your pipeline, which makes things a little bit more automatic and automated, and reduces the steps that you had to do manually after creating a pipeline.
Sasha Dagayev: So this is so much better. When I started in bioinformatics, they gave me a tome that was like this big. It was insane. The first like three chapters were about some NCBI module that no longer existed. And I was like, I can’t believe this.
And this is so much better. This is such a better onboarding experience for new users. It’s an interface that if you’re not a command line hero yet, you can get by.
But number two, the best part about it is, it makes building things the right way, the only way you know,. So for new users, you really should be pushing them towards tools like this. So that they build the muscle to build things in a way that, people that have stepped on all the rakes, know that, like, Hey, these are a lot of the things that you pretty much should be doing every single time that you’re starting this process.
Ken Brewer: Yeah and I think the modularity in terms of turning features on and off in the pipeline template, is just such a valuable unlock in terms of being able to start with a very minimal pipeline template to begin with. Because a lot of times like you don’t need to be going for a full production grade well documented nf-core pipeline destined for publication in a journal. So being able to choose only the features that you need for your individual small project is something that I’ve felt very passionate about for some time.
Phil Ewels: Yeah, a lot of credit goes to you actually, Ken, and also Adam, the big advocates for having a more minimal template. Maxime was saying how different people have different styles, but it’s this big and small Nextflow thing again. Maxime likes to go in with all the bells and whistles, even if he’s just doing one or two processes, others such as yourself like to just use what you need. And if you only need a few files, then that’s great. And now we have that flexibility in the tooling for nf-core template.
Florian Wünnemann: Yeah. And it’s not just like big and small Nextflow production versus my own little pipeline. It’s also different fields from genomics, right? Like I had to write like image analysis pipelines in the past. And when you don’t use genomics, it’s really like a bit of work to start with the template before version three. And pick out all these things. And you’re doing this and maybe you’re doing this for multiple pipelines. So either you keep the work you’ve done once, so that you can just copy this, but that’s also not the way you want to do it. And so every time you want to start a bigger project, you’re like, Oh God, I have spent like half a day, just like, ripping out the roots of genomic science out there.
So this template update is really also opening the, these templates for other fields than genomics and transcriptomics, I think, which is great.
Clip 21: Industry best practices with nf-core
Phil Ewels: All right. Rob, you’re up next.
Rob Newman: Yeah, I think this lends itself really nicely to follow on from this last one, actually. I really liked Ken’s talk because he was saying, Hey, if you are an industry, you don’t have to reinvent the wheel. There’s still so many things out there to help you. There were many things that I didn’t actually know. So for me, it was a really good reminder about some of the functionality that nf-core offers.
And also just these really concrete and tangible steps. Maybe you can’t go all the way and completely open source everything, but there are avenues and pathways by which you can leverage this great body of work that’s already out there.
One of the things that Ken said right at the end was, propriety data does not equal proprietary analysis. And I love that.
Ken Brewer: So I’m gonna talk about a couple of different options that are ways to do this from setting configuration for existing pipelines and files, setting up version controlled institutional configurations, best practices for setting up private forks of nf-core pipelines. Writing custom Nextflow plugins that allow you to extend the capabilities of Nextflow. How to build custom pipelines from scratch, also using nf-core tooling. And finally, how to set up a modules library.
Phil Ewels: You’re obviously done well Ken with this talk because you’ve twice on a single podcast here.
Ken Brewer: Thank you. I mean, I’ve told several people, who’ve asked me about, Oh, tell me about the work trip that you had to Barcelona. It was an incredible experience. And I had the chance to give a talk that I’ve been wanting to give for a couple of years now.
Because I think these messages, there’s part of it that’s specific to nf-core, and the ecosystem of nf-core and the tools that are available to really make it easier for big organizations with lots of proprietary code and data to work with best practice tooling.
But then there’s also the general principles of the benefits of working with open source. And that’s something that the tech industry has really embraced and knows how to do well, and bringing more of that open mentality and open source thinking to the biotech world is something that like I’m incredibly passionate about.
Rob Newman: Yeah, really, really resonated with me anyway. And I think also from what I’m hearing many other people as well.
Sasha Dagayev: Yeah, I think changing the mentality, cause it’s a culture shift, right? The culture shift becomes easier if you make it easy to do the right thing. .
Clip 22: Federated data queries
Phil Ewels: Right, then, nearly at the end of our list. Rob Syme, tell us a little bit about the talk you’ve picked here.
Rob Syme: So I’ve picked a talk from Kevin and the Quilt team and I’m such a big fan of Kevin and everything that the Quilt team is doing. And I think like it follows on from what you were saying there, Phil, about this idea that this work exists on a spectrum, you don’t have to do full production scale work all the time from the get go.
I think giving people analog ramp towards production. And then this particular feature that Kevin’s going to talk about here, I think is a great example of that.
It’s very easy for a computer science person to come in and say, the correct way to do this is to write your pipeline to store stuff in Solid database that stands up and has this perpetual storage. And you have to manage this infrastructure. It’s worth it because you get the faster write and read times.
It’s a non starter for a lot of people, because we want to optimize for time to results and velocity of the developer experience. And the feature allows you to do queries for a tiny performance tradeoff, which is, I think, in my mind, absolutely worth it for people on that spectrum.
Kevin Moore: You don’t actually need one database to do this kind of querying. Like our name, Quilt, would suggest we’re here to help talk about ways you can stitch together data sets from many different sources. And in this case the tool that we’re talking about is, a federated system like AWS Athena that can do also query on read capabilities.
And those two capabilities of federating merging queries from multiple databases or data sources and query on read, which means you can query directly from the output files from your pipeline without actually extracting and transforming or loading them into a data warehouse.
Will let you, not necessarily have the fastest query time, but it will give you the fastest time to your answer.
So in the way that Brendan Bouffler would always say, and because it takes a lot of time to grow a scientist, saving that scientist time, getting that answer faster is really what we’re all here to do.
Rob Syme: . Like, you can save your scientists and researchers time. That is an easy win.
Phil Ewels: Yeah. You’ve got the embedded cost of doing the setup and the management work. Absolute speed of a specific query is far from the full picture. Okay.
Sasha Dagayev: Even from the dollars and cents perspective, it’s this stuff is really fun. Let’s more of the fun stuff.
Rob Syme: I think Kevin understated the power of this feature. I don’t think it’s quite clear. It only occurred to me after the fact. This allows you to have your Nextflow pipelines, dump data, like your CSVs, your counts and a CSV files in S3, and then query across all of your runs, with a single SQL statement. That’s It’s a big deal, and it requires no changes to the pipeline. Your pipeline doesn’t have to know about this or care about this. They’re still CSV files. So if you have downstream analysis that are not Athena, you could, they’re absolutely still works. So you have all this beautiful backup compatibility, but to be able to query across runs, like at this pan project view is significant.
Rob Newman: Athena is definitely an understated technology. It’s a really good Amazon service
Clip 23: Nextflow Ambassadors
Phil Ewels: okay, Ken, you’re going to take us home. What have you got for us?
Ken Brewer: Yeah, so the last clip on deck is from Geraldine talking about some of the efforts from the Seqera Community team, of which I am a member. She highlights the work in the Ambassadors program, which is an initiative led by my colleague Marcel.
Geraldine Van der Auwera: Uh, the Nextflow Ambassadors are volunteers, Nextflow enthusiasts in the community, who volunteer their time and effort. Just spreading the word, being a, a point of contact to the local community, to their domain community as well.
They write blog posts, they organize events, they do a lot of really cool activities that supports the community in a way that really amplifies anything that we could possibly do. So supporting them is really big for us.
They’re distributed very widely geographically, almost pole to pole. We’ve got ambassadors on every continent except Antarctica. We’re working on that. So this is a program that we’ll definitely continue to grow and nurture.
Ken Brewer: Yeah, so the part that I loved from this is just seeing the map with all of the Nextflow ambassadors across the whole globe on all six continents.
And just knowing a little bit about some of the activities that they’re working on, whether it’s running trainings, attending scientific conferences, representing Nextflow and helping teach Nextflow. It’s such an exciting impact.
And especially that it’s not just happening in U. S. and Europe, but also in Africa, Southeast Asia, and a lot of places with not as great access to some of these tooling. And so that’s one of the really cool impacts that the community team has been working on.
Florian Wünnemann: Yeah, absolutely. The ambassador program is awesome. I mean, I’m an ambassador myself, here in Quebec and we’re trying to establish groups that help each other , it’s also a great networking opportunity to bring together these people.
Especially in places where Nextflow isn’t as big of a presence yet. I think having people drive this, that are enthusiastic, that love using Nextflow that want to share why it’s great and why it’s such a fun and awesome tool to use in your research is the best way to do it. It brings a personal connection that, that you can’t just get by saying here’s a tool, use it. Because there’s people that are passionate, that are driving it.
Sasha Dagayev: Yeah. And it serves validation, right? They’re real people, is a sort of validation that I think is special for the ambassador community.
Ken Brewer: And one more note about the Ambassador program is that applications are currently open for our next round of Ambassadors, and they’ll be open for the next month. So if you are someone who’s super passionate about Nextflow, whether you work in industry or academia and are excited about sharing that with other folks, please go to our website., Nextflow. io and the Ambassadors section, and there’s a link there. Especially if you are working in Antarctica, it would be really great to, to get all seven continents and not just the six. So,
Conclusion
Phil Ewels: All right, I think that’s a wrap. Thank you very much everyone for your time picking out all these highlights and chatting about them.
It was a packed agenda at the Nextflow Summit and there, it was just so many good talks. But I’ll leave that as homework for the listener then to go and dig out the full playlist. You can go to summit. Nextflow. io and you’ll find the full agenda, details of all the speakers, all the abstracts for each talk and the YouTube video . Or you can go directly to the Nextflow channel on YouTube and you’ll find a playlist and you can see all the talks there in the list and play through. You can cancel that Netflix subscription, if you like, save a bit of money.
I hope to see many of you in the next summit, which will be in Boston in May, 2025 and failing that, back in Barcelona in October, 2025, which I’m sure will come around faster than any of us like to admit.
Thanks everyone. And, we’ll see you all soon.