Seqera
/
Podcasts

Fusion Snapshots

In Episode 54 of the Nextflow podcast, Phil Ewels sits down with Lorenzo Fontana, an engineer at Seqera with deep expertise in Linux kernel internals, eBPF, and systems programming. Lorenzo is the co-author of the O’Reilly book “Linux Observability with eBPF” and a key developer behind Fusion and Fusion Snapshots. In this episode, they explore Lorenzo’s fascinating journey from Linux security tools to bioinformatics infrastructure, and take a technical deep dive into how Fusion Snapshots actually work under the hood.

Fusion & Seqera:

Lorenzo’s Background:

Open Source Projects:

Summary

Lorenzo’s Journey to Seqera

Lorenzo Fontana brings a unique background to Seqera, having spent years working on Linux observability and security. As a subject matter expert in eBPF (extended Berkeley Packet Filter), he authored the O’Reilly book “Linux Observability with BPF” and was a maintainer of Falco, a CNCF graduated project for runtime security. His journey into eBPF began when a colleague at InfluxDB asked a deceptively simple question: “When I write to disk, how big is the actual physical write?” This curiosity led him to create kubectl-trace, a tool for running BPF programs in Kubernetes clusters.

What drew Lorenzo to Seqera was the mission. Having lost his father to cancer, he found deep meaning in contributing to infrastructure that accelerates drug discovery and research. His expertise in making things faster, more efficient, and more secure now serves a purpose that resonates personally.

Understanding Fusion

Fusion is a FUSE (Filesystem in Userspace) implementation that exposes cloud object storage (S3, GCS, Azure) as a POSIX filesystem. But it’s far more than a simple mount point. What sets Fusion apart from alternatives like AWS Mountpoint is its deep integration with Nextflow and purpose-built optimization for bioinformatics workloads.

The Fusion team, including Alberto Miranda and Jordi Deu-Pons, has developed intimate knowledge of how bioinformatics tools actually behave. When pigz exhibits unusual behavior or nf-core/crisprseq has specific file access patterns, Fusion is tuned to handle these cases optimally. Generic solutions optimize for general use cases; Fusion optimizes specifically for pipeline execution, GPU workloads, and multi-task scenarios.

A key architectural decision was shipping Fusion within containers via Wave, binding the filesystem’s lifecycle to pipeline execution. This enables optimizations that would be impossible with a traditional machine-level mount.

The Magic of Fusion Snapshots

Fusion Snapshots solve a critical problem in cloud computing: spot instance interruptions. When running long tasks on spot instances (which can be 90% cheaper than on-demand), AWS can reclaim your instance with just two minutes notice. Without snapshots, an eight-hour task would restart from scratch—potentially costing more than on-demand pricing due to repeated failures.

Fusion Snapshots leverage CRIU (Checkpoint/Restore In Userspace) to freeze a running task’s complete state—memory, file descriptors, network connections, kernel state—and persist it to storage. When the task is retried on a new instance, it resumes exactly where it left off.

The implementation is elegant in its integration with existing Nextflow semantics. When you set maxSpotAttempts, Nextflow already retries tasks in the same work directory. Fusion Snapshots simply intercepts this retry, checks for existing state, and restores it rather than starting fresh.

Incremental Dumps and Timing Challenges

For huge tasks using massive amounts of memory, even two minutes isn’t enough to dump everything to storage. Fusion Snapshots addresses this with incremental dumps, periodically capturing memory state throughout execution so that when interruption occurs, only the difference needs to be written.

The system intelligently detects when it’s appropriate to take snapshots, avoiding disruption during critical operations. A built-in visualization tool (fusion-snapshot-utils) generates HTML graphs showing snapshot timing and sizes, invaluable for troubleshooting.

Process timing synchronization relies on the kernel’s cgroup freezer, which completely halts all processes in a container. This ensures perfect consistency between memory state and filesystem state: for example, no duplicated or missing data when BWA is aligning millions of reads per second.

The Full Stack Advantage

Lorenzo highlights how Seqera’s ownership of the complete stack, from Platform UI to Wave to Fusion to Nextflow, enables innovations like Fusion Snapshots. Each layer is “machined” to fit perfectly with the others, similar to how Apple designs hardware and software together. A simple checkbox in the Platform UI triggers sophisticated kernel-level optimizations, all invisible to the user.

Contributing Back

Seqera actively contributes to the CRIU project, submitting patches for issues discovered while running bioinformatics workloads. Lorenzo created criu-static, a project providing statically linked CRIU binaries that work reliably across the diverse container environments bioinformaticians use. He also built staticreg, a web interface for browsing OCI container registries.

Looking Forward

Lorenzo shares his perspective on AI’s impact on software development, noting that the act of writing code is becoming a hobby rather than the primary activity. The important skill is thinking—providing the right context and understanding the problem. He encourages the community to try Fusion Snapshots and share feedback, even bug reports: “I look at the tickets just for the Fusion Snapshot bugs, and when they are not there, I’m a bit sad.”

Full transcript

Welcome and intro

Phil Ewels: Hello and welcome to the Nextflow Podcast. This is episode 54 going out in February, 2026. Today we’ve got a special episode together with Lorenzo Fontana. He’s an engineer in Seqera and is quite a special background of deep expertise into how low level operations within Linux happen, and he’s one of the key people behind some of our core software within Seqera, namely Fusion and also the new Fusion Snapshots. Lorenzo works in a small group in Seqera with Paolo, who work on all the kind of crazy cutting edge developments that we like to push out at Seqera. And he is got a really interesting background and insights into the kinds of problems that happen with bioinformatics software and how to solve them.

In this episode, we’re gonna go a deep dive on how he got to that position, and we’re also gonna talk about Fusion and Fusion Snapshots and how Fusion Snapshots actually work. with no further ado, I’d like to welcome my guest, Lorenzo, thanks for joining me on the podcast.

Lorenzo Fontana: Thanks for having me. I am excited to be with you.

Phil Ewels: So when did you join Seqera, and tell us a little bit about what you do.

Lorenzo Fontana: I, keep on saying I recently joined. It’s been a while. I joined, more than one year and a half ago. Seqera was very different. Like I was, not coming from a biotech background.

Background in eBPF

Lorenzo Fontana: I was, I’ve been working in Linux observability and security for a long time. I am the author of a book with O’Reilly called Linux Observability with eBPF.

Phil told me that I don’t have to give for granted the acronyms because, everybody, doesn’t have a little book with all the acronyms we don’t have all the time.

BPF stands for back, back play packet feature. I’m a subject expert of that. It’s a technology in the Linux kernel that allows you to interact with the kernel, get information and do certain activities from a kernel point of view, but without writing a kernel module. Usually we write programs for the, OR for our computers and Linux, we writing them user space, we just compiling random.

It’s very nice to be able to ride them for the kernel itself so that they run at that level. But it’s also very, hard to do because then you are gonna break the machine. if you do it wrong

Phil Ewels: Very low level

Lorenzo Fontana: with BPF, you can do it without, exactly, without breaking it, without a virtual machine.

Phil Ewels: Where does the name come from? Berkeley Packet Filter.

Lorenzo Fontana: Berkeley packet filter because initially this, was developed for a very different reason. It was developed, it was a generic Unix feature, developed, for Unix, not for Linux. That would allow you to, filter packets. No.

Think about TCP dump. TCP dump is a utility in Linux that allows you back the origin, that allows you to, see the packets go into a network interface. The ability to, make firewalls and make, and filter packets and see what is going on in the network was one of the first, use cases for, it was one of the first, needs that people had, because they wanted to be able to block packets and see what was going on.

So there was the syntax for users that was like, if IP source this and port source this, or port destination is 80 block or allow, do this. And there was this language that, it was essentially an instruction set, that was very, accumulating on a variable. Now it will take a simple decision.

Now, eBPF, which is eBPF, so extended now, is all in the Linux kernel, and it is a lot more complicated. The instruction set is, is bigger, is more, complete. More similar to more high level instruction set, but still not as powerful because it has to run in a controlled environment, which is the, EBPF virtual machine that runs in the kernel.

So you can put your programs in a virtual machine inside of the kernel. I hope I answer your questions.

Phil Ewels: it’s basically a filtering level that filters all the data that’s going in and out of a kernel. So you can modify how base level processes happen on the os. Is that right?

Lorenzo Fontana: Initially it was called Berkeley Packet Filter because that was the capability, huh? Now you can do things, for example, attach to a function in the kernel every time it gets called, give me the variables.

I’ve met a situation where there was something, not allowing you to do something, then, oh, you don’t have permissions. You search it, then you have all write this, APAR rule, write this Linux rule now, just to bypass it, or disabled, you know.

All of this is, there for a reason, because processes don’t have to go wild, and do the now it’s even more important with ai. No. And, there is the ability with BPF LSM to actually change how our process behaves, no. And what it can do, block it, block certain executions. You can now, nowadays pretty much do anything now that you could do with a kernel module. It’s becoming more like writing a kernel module than what it was before. No.

Packet and filter now are just in the name. it’s it’s because it originated from that idea and from that concept.

Phil Ewels: nice.

So you don’t have to do “sudo go make me a sandwich”, from xkcd.

Lorenzo Fontana: Yeah, no, it’s more I dunno, a AI agent, please disable this because it’s getting in my way for doing my research.

And I am trying, I’ve always been trying in my career to make sure that users are able to not disable those things. Because they are important to have. and they can be secure and they can be doing their work without having security in their way.

Becoming an expert in eBPF

Phil Ewels: How did he get to be a, an expert in this? It’s, it’s obviously, it’s pretty niche, it’s pretty low level. What took you to this point in your life where up being an expert on eBPF.

Lorenzo Fontana: That’s a question that I love because I, I remember the epiphany for it. I had a colleague called Ed, at Influx Data. He was a very talented, he is a very talented engineer, is able to, do reasoning that, still is unknown to me. And, he was writing the storage engine for InfluxDB 2.0 and started, writing a storage engine for a database is one of the most rewarding but difficult tasks that you can do.

And, when you write a storage engine for a database, you have to take into account the, fact that people are gonna run the database now and they are gonna use it against an actual storage medium, a disk.

I was, an SRE then? No. Site reliability engineer for the book of the, acronyms, no. And this guy asked me, can we, can you tell me, no, when I write and the storage engine writes disk, how big is the actual write to disk? Because when you write to, with a Linux sys call, you, you issue a write sys call? No. And you are saying, I’m writing this chunk of data and saying I’m writing these 10 bytes, but how are those 10 bytes flushing from the VFS New acronym, Virtual File System, to which is the abstraction in the Linux kernel that, is given to file systems like XT4, JFS or these, how this abstraction layer, how is then translated this to an actual physical write to the medium? No, because this is important because the, for example, discs have a fixed capacity. No, their throughput is limited.

But with all this in mind, they just wanted to know if I issue this write, what is happening? And I started scratching my head, because I, we were porting all the infrastructure to Kubernetes, to new mediums because, it was changing.

No. And there were no instruments. what do you do? Write a kernel module to know this. I started researching and, I already even before use eBPF for other things. It’s like, why don’t I give a try to this? I started being loud about it and made a project for this specific reason. It’s called kubectl-trace, which then got donated to iovisor, which is a group in the Linux foundation.

If you got on GitHub iovisor/kubectl-trace or kube ctrl trace allows you to run, BPF programs in your Kubernetes cluster. In particular. Back then I used, BPF Trace, which is a way to write BPF programs with a DSL and also you can just, write a small program without having to compile, without having to create a order. There are a lot of things that you have to do to make a BPF program work. So just load this program to all your nodes in the cluster. And the program was, tell me this.

So I got excited. Probably for my colleague it was just a small request. Who cares? Tell me no. But for me it was an addendum to my own, set of things that I, like to call myself a deep generalist now ‘cause I tend to do many things. I feel like know this at Seqera, like I can a MacOS app or do something else in the meanwhile, going back to the specific feature in the cabinet. And I tend to do this. So and this was something that could dig, deep into, started doing it, started being vocal about it.

O’Reilly Book

Lorenzo Fontana: At some point I had opportunity to write a book with O’Reilly and that changed things a lot back then because there was not a lot written on eBPF. So I studied a lot, low level networking and now things work in the kernel, but, and eBPF gave me the opportunity to study those low level things. For a reason because eBPF is touching every, area of the kernel because it can interact with all other subsystems. In fact, it’s a subsystem itself, but interacts with all of them. If you enter, I dunno, a subsystem for, I dunno, networking, you don’t touch a system, you don’t touch other things. So it’s perfect for someone like me who defines themselves in that way.

And so I got excited, started writing the book. The book had me, sit down for hundreds of hours and study because there was no ai. You cannot just prompt and say, I want to help me research and let’s do this and let’s do that. I just had to go, in the code, look and understand. And probably some things that I brought or I said, or even saying right now, are still wrong because, you learn every day, but at least no, I got, that then, when you do things, when you get the ball rolling, you start working on things.

Falco and linux kernel bugs

Lorenzo Fontana: Started working on Falco, which is quite popular. It’s now a graduated Providence project in the CNCF, Cloud Native Computing Foundation, which is, branch of the Linux Foundation that is doing cloud native tools. And, being a maintainer for a tool like that allows you to really, be with the community and understand the users.

And I really liked it because it was kind of doing security now for a community that is, vibrating on, is creating, solutions and is creating, giving you new problems. I remember, one day I was in, I was on Lake Tao, at the house of the founder of Sysdig, which was the company that was employing me, that was making Falco, like a company that created it. And, I was on an issue that said, Falco makes my kernels crash. I did the mistake of going on Twitter ‘cause I, just no filter sometimes. And I said, oh, because I, understood though, why? So I created a reproducer. No. And I said, don’t tweet there. Oh, I am able to crash a Linux kernel with, eBPF, and I started receiving, DMs, direct messages on Twitter saying, oh, like from new accounts. what did you do? What strange things? Like little threat actors. Like two days later, the one of the kernel maintainer sends me an email, can you please the next time, just tell us, oops.

And then, we patch it and everything went the right way. I didn’t disclose it irresponsibly, I just said it. No, but I didn’t say how, at least. But and I was pretty aware that you have to do it, but I, the sometimes the excitement no is just too much, to stay silent. And I didn’t think about it. I was just like excited that I literally just boop and the computer starts shutting down every single CPU until the last one and then boom. It’s oh my God. So nice. It was the first time I was able to do something like this and then I became, I started doing it more and more, but on different software set up.

But it was an epiphany as well.

Move to Seqera

Phil Ewels: What, what took you from, these positions of working on eBPF and kind of low level kernel stuff? What, brought you to Seqera?

Lorenzo Fontana: He, Bernat, our hiring manager is, very good at his job. He, reached out and I immediately understood that I spent a lot of time, like I’m happy that I did, but I spent a lot of time, creating value in security and creating, security software and, making application run faster, et cetera, et cetera.

But the mission at Seqera of like making, drug discovery faster, making research faster has a deeper meaning in a way, for me, it’s a little bit personal, but my, dad died of cancer. No. So I had, a little bit of, it’s unfair no, around that it’s unfair that people can access, medications in different ways, not everyone has the same thing.

However, there is not much I can do in terms of actually making it happen for everyone, but the kind of customers that Seqera has no, and the kind of situations that, are around it can, and the kind of people in the community of Nextflow, that are there, they can, they totally can. No.

So I, I said why I don’t try to put my experience in making things faster, making things smooth, more secure, know different things or even create new experiences for myself, to serve a purpose that is, a little bit different than I did before. So I was like, let’s try this. No.

Phil Ewels: I’m glad you did!

Lorenzo Fontana: And, I try with now it’s, just, not like you, but everybody, I try different things. It’s, just, I don’t consider myself a static person. No. So I just try and see,

Phil Ewels: and you joined Seqera in the Fusion team working Jordi and I think I had Jordi on the podcast a while ago talking about Fusion, regular listeners might recognize that, that name. And then we’ve had a bit of a shuffle at Seqera, so Paolo was the CTO and now he’s chief architect. And you are part of a special group, nicknamed The Magicians Guild, which I quite like by, Rob Newman.

Lorenzo Fontana: Yeah, we are a bunch of people that in general, we are very pragmatic. No, we like to get things done. We like, to not float around, stuff, just like to do things. and Paolo is the, Is the most, out of us. I look at him with that. I, want to be like you when I remove from my brain all the blockers now that block my brain from being more pragmatic and accomplishing more in terms of, not because you destroy yourself or do actually do more, but because you remove the frictions that your own brain have into actually using your time in the most productive way, I think.

So yeah, I am now, we are like in this asset. We are committed to do experiments. in my opinion, the world changed so much in the past few weeks and Seqera planned for these already the past few months. For different reasons because we already saw it coming and now we gotta do it.

Intro to Fusion

Phil Ewels: So tell us a little, us an introduction to Fusion first off, like that’s, that’s central to a lot of the work that we do at Seqera.

Lorenzo Fontana: Left, left that untouched. We was this podcast supposed to be best fusion , not just about me.

The, so Fusion is an interesting name because there are so many things, named Fusion. In the first that comes in mind is the Dragonball reference. But, Fusion is, essentially a file system. In the Linux kernel there is a feature called FUSE . File system user space, because usually file system not user space I system are in kernel space. They are written as kernel modules, very consistent with my BPF and like I want to do things without breaking things.

And it’s not just file system user space that exposes your files that you have in different cloud providers, S3 GCS, Azure, whatever. Like we have a lot of integrations, but that also is tailor made to be aware of a Nextflow pipeline of what happens, of how to optimize throughput for what the pipeline is doing and what Fusion thinks that the pipeline will do.

No, which, personally I, tend to use it for my, all my interactions with buckets because it just annoying to go into the things.

And also because like sometimes you see blog posts like are related to, bioinformatics, but see blog posts, oh, I need to delete 20 terabytes of data from S3. Oh, do this script, do that. oh, if you do it from the UI is bad. Fusion does the best way. Fusion has all the techniques to do it. The things in the most efficient way, the least expensive way, all this. And for me it’s a no brainer to just use it.

Phil Ewels: The way I think of fusion is being exactly like that abstraction layer. You don’t have to think about the fact that you’re using cloud storage or even multiple different clouds in one place. fusion just exposes all of it in a way that I’m familiar with, which is really, nice.

Lorenzo Fontana: Yeah, has been like exposing, object store as a POSIX file system. Is, the, like the, theme of this century no, object store are optimize to put your data to multiple machines, be redundant, be fast, be cheap, and not allow them.

And POSIX file system is just convenient to use because you write programs and and you cannot write the integration in every program. You cannot update every program does it? or write the credentials to every program.

Fusion is shipped, via Wave. So you essentially, augment whatever you know, container that you have in your pipeline. With Fusion, it’s automatic and.

I initially didn’t get it, and I, because I’m more used to you mount the file system in the machine and not in the container, but I think it’s a great idea because it binds the life cycle of, that abstraction to the pipeline execution, which makes possible to do optimizations that are, taken to an extreme.

I don’t think that even though sometimes you, don’t see small optimizations, small incremental optimizations as a value added, I think that in fusion, those are what make fusion possible, no.

Those and the fact that the team, want to shout out to Jordi but more important was Alberto, is very knowledgeable. Of, Alberto Miranda, if you want to, Jordi was already in the podcast, but Alberto, no, I think. Are extremely knowledgeable on POSIX file system. What happens and what.

Not everyone is like one, one thing that like people ask, in the interview questions that I heard is do you know what happens when you open a browser tab and, all the, network connections, all the small bits, the DNS query, who is replying first? Who gives you the ip? What, is the DPN shake? All this stuff? And I can go with this question is nice, a nice question, but because you see our person goes in depth into all the levels and what is the view? I think that if there is a depth, that you can go down no, with POSIX and file systems. If I see some, Alberto knows the, the deepest point, no. It’s I dunno, it’s in English, but it’s like the point, the name in English that it’s the point in the ocean, which is the, the most depth.

Phil Ewels: Mariana Trench.

Lorenzo Fontana: Yeah, the Mariana Trench, Fossa delle Marianne in Italian. So it’s literally that.

Comparing Fusion to other solutions

Phil Ewels: One, one of the things I was, gonna ask you, you talked about the Fuse file system and we said how fusion can, make object storage appear as a posix file system. Like Fusion isn’t the only tool that can do this. People might be familiar with Mount Point from AWS, and I’m sure there are others. what’s the different, like what’s the difference between Fusion and these other tools? it the thing about mounting fusion within the container? it the Nextflow optimization? Is it all of this and more, or.

Lorenzo Fontana: I believe that, first of all, the easy one is that it’s integrated in the ecosystem. it’s, a no brainer to use. Now you just enable it and it works, which already for me, it’s, it’s a winning point of it. Now.

Second one is that, the person making it are taking in depth consideration more than anything else, more than even POSIX compliance, more than even, Building a good file system? No, they actual use cases for Nextflow, and not only Nextflow, but for the tools that usually people run, in Nextflow, like the, like other device systems, they don’t, are not that reactive to, I know if, pigz has these, behavior or if, crisprseq doesn’t work or if, these way the distinct breaks because they are gen generalists now they are generic tools that you can use, no.

And, usually, purpose build tools are a lot more performant. The downside of purpose build tools that they’re purpose built. No. And it’s no. It’s a limitation when you want use them for things that are not their domain, but we are focusing on our specific domain. Otherwise we will, no use something.

We are not focused on making this perfect for allowing you to back up your Mac, no. We are focusing it for, speed of execution of a pipeline for making sure that it doesn’t get in your way when you don’t have enough CPU and you don’t have enough memory, or when you run multiple tasks on the same computer when you run with a GPU.

Fusion Snapshots

Lorenzo Fontana: And we are stressing this with fusion so much that it ended up becoming the center of another feature that we develop, which is called fusion snapshots.

In the end, fusion snapshots, if you’re not familiar with it, allows you to bring a task that is running on a machine and move it to a completely different machine in a transparent way, no.

Moving things requires a file system, with a normal, s3 integration, no, whatever it is, like another Mount point or another, another way like, load upload, how do you do it.

And one of the challenges of fusion snapshots is that we want to do this in the time window where your instance gets reclaimed by, by AWS because fusion snapshots, the way it works now that they say you are on the spot market use, did all your settings, maybe like you want to spend a lot less than your default and all of the sudden the machine gets reclaimed. It goes in a loop, gets reclaimed, gets reclaimed, gets reclaimed and you end up spending more, no.

Phil Ewels: We talked about this a little bit in the last episode, and I was saying that when I started working with cloud computing, the spot market on AWS was very, there was plenty of capacity. It was very relaxed, so you just always put stuff as spot and didn’t really think about it, and stuff was just like way cheaper.

But the capacity is, I think it’s become trimmed a lot more in recent years. And now if you run with spot on AWS is, if you run a big pipeline, it’s pretty likely that you’re gonna hit some spot reclamations.

Without any snapshot stuff, so what happens that we, if you’re using like any of these other systems like regular S3 Mount Point or whatever, the task gets killed and Nextflow will retry it, but it just starts again from scratch.

And if you are running an eight hour process, you lose a lot, of time, and a lot of money. So you are

Lorenzo Fontana: And this stuff is only possible with,

Phil Ewels: With snapshots.

How Fusion Snapshots moves tasks

Phil Ewels: And so tell us a little bit more about, when you say moving a task from one instance to another. Like how do you, what do you mean there? What does that look like?

Lorenzo Fontana: Very nice because the, stuff, it, wasn’t built for this. No, Paolo had years ago, the intuition. Of putting everything you have it task in a container. No. And containers are the perfect abstraction, because you can essentially have a bunch of processes now that they are bound by the boundaries of the container, and they’re together. Huh? They are only dependent on each other most of the time. So they don’t really interact with anything else but the processes in the container, the files in the container, maybe you, they make a, some connection to the outside to download something, to upload something. But it’s, in a research pipeline, in, if you are researching something, you’re usually transforming data or acting on data. Like it’s usually self-contained.

It’s, in functional programming, we call, we will call it pure, no. It’s pure.

When something is pure and contained because of the intuition of wave and the file system that you have is already available in other machines because Fusion if you have like a file, it’s available. Like you just, stop Fusion, restart it in another machine, go to the same folder, it’s there.

No, so the intuition was what if we, when the task goes away, just reuse the work dir in another EC2 instance.

And what happens well, is that if you don’t have, if you just reuse it, then you all the files, but you don’t have the process state. You don’t have a memory state, you don’t have the file descriptors that are open, you don’t have the network connections that are made. You don’t have the kernel state, let’s call it like this, of everything in this container.

There is a technology in the Linux kernel called CRIU, CRIU. Huh? Checkpoint Restore In Userspace, acronym book. Yeah. And CRIU, is doing essentially this.

But CRIU is not batteries included, no. So we, again, as other things, we made our own version, no. That is perfect for, Nextflow pipelines for task running in Nextflow pipelines.

Hopefully now we have also of future plans for it, and we are gonna ex execute them and experiment them. I’m a very big, proposal of let’s try experiments and let’s do, these, and I think that, it’s very, aligned with the view of the world that exists now and also with the audience that, Nextflow has with the kind of users that Nextflow has because what are people doing in Nextflow experiments in the end? We don’t know the answer, no. And as those experiments evolve, Nextflow and Seqera platform and that tools that the company Seqera provides need to change to go towards new ways. No. Now with ai, now with whatever will come in the future, no.

Fusion snapshot is an innovation in my opinion. We will see how more it becomes an innovation. I have a lot of ideas. Let’s see if we can make them happen.

Phil Ewels: I really like the, acronym CRIU because it, it reminds me of cryogenically freezing the container. it’s like flash freezing with liquid nitrogen or something as you might do in the lab and then resuscitating it in an another place or another time.

Lorenzo Fontana: It’s a very good observations. I think that it’s also true, but I don’t, I don’t remember.

Contributions back to CRIU

Lorenzo Fontana: But the, what I can tell you is that I think that the people working on it, they are very edge in terms of, details and they’re very compatible with how I think like someone should, should do things, no. So I was excited to work with them.

And when we had, things they were, we proposed patches to CRIU, from Seqera, we found bugs, we fixed them, we have been fixing them. And they have been extremely reactive and proactive in, finding, new.

For example, we sent a patch for an issue that we had for, I think that there was some problem like, snapshoting some parts of an rnaseq pipeline and, Fusion Snapshots, you know. And it was like coming as a ticket every day. And, And the only solution was to fix it in CRIU.

As everyone does, I tried to Claude Code it, but it didn’t happen, because it’s almost impossible to provide the right context of AI for a bug like that, and not because it’s not capable, quite the opposite, but it was impossible to prompt it correctly.

And ended up writing a patch that was solving it, but was not getting to the root cause. They merged it, it was fixed, but, and usually like projects like they, like they’re all humans. Like we move on. No, they started rewriting it, made a better version and solved the problem in a better way, more areas, but was quite excited and happy to see that they take these so seriously and which is a very compatible, I would say, project through the way we envision ability and the way we envision, making great, software.

Phil Ewels: Yeah, I was gonna say, I mean we’re, both obviously enthusiasts for open source software with providing open source software to a community is how I usually see it happen. But it, is nice to hear that we’re also contributing back to other projects as well.

Lorenzo Fontana: I agree. I agree more than just agreement. I always try to do that. sometimes it’s harder than other times, because we also all have to live a life.

criu-static

Phil Ewels: Yeah, say, you, you have a few, you’ve have several open source projects that you’ve released from Seqera, right? The things you’ve built in Seqera, which we’ve released as a, as OSS.

Lorenzo Fontana: Yeah, there is, I, the most, the one that I’m mostly proud, that is because it went unnoticed is criu-static now, which, comes from the pain point of being able to ship CRIU, in a way that, you know, because.

Let me put it in this way, I like everything I, all the computing. Yeah. And I like a lot the model of MacOS where, and also of windows in a way, where everything gets shipped with your software. Now you just package it and it works. Makes thing work.

No. ‘cause you don’t need to install the library or have a shared, the whole concept of shared libraries for me is a bad idea now in a way because, it solves the problem of, when there’s a vulnerability in certain operating system like Debian, it’s incredibly good because they have a very good security team that is always on top of n the software that you ship. Maybe it’s not.

I like that. Maybe the baseline operating system has this, and they told the, package manager has this.

But from our point of view of shipping software works for our users, it was impossible in a way to ship CRIU. With, all the system libraries in a way that is perfectly linked in a pipeline of all the crazy things that our users do.

Because they come. One thing that I learned is that, bioinformaticians, they are the power users of the power users of, UNIX tools and of tools like, I see. I’ve been programming in C for, years, a lot of years data at least. And I thought that I did in Interesting, no, now doesn’t matter. The interesting C code.

Then you look at, tools that like others do, like in, like you look in the rnaseq pipeline, you look at the tools that are in there, like, what is this? Like why, how someone has this idea of doing this sequence of instructions and write this code. Why? No.

And you learn that is because they are amongst the most persistent people. They have to do a research, research it to be completed and, agency, intelligence, no. All in the same person. Exactly. They happen and then they pushed open source and then your mind blown.

in, in this sense, I want to be able that all this craziness No, that can have different requirements always work. No, I want that.

You then when you run Fusion Snapshots and CRIU, you because it’s there. It doesn’t work because as a base container, you use something that is wild. No. And I don’t also want to tell you, oh, you have to install this because otherwise Fusion Snapshots doesn’t work.

So we made criu-static, sent it to the mailing list, they still exist. And and it got excitement and so I’m very proud of it.

staticreg

Lorenzo Fontana: There was also staticreg, I like the name static, some reason and not, on purpose.

So when you are in, situation where you have to navigate all the craziness now that comes out of, of running, so many pipelines, and you build many different, containers for them, et cetera.

You want to be able to inspect them a little bit. Wave does that, but there was no way to actually navigate it via ui. So we made this, open source project that allows you to visualize a container registry right, backed at by wave in this case, but any OCI container registry and see what is inside.

I, believe that open source is very powerful. and, I’m happy that also I, this there is alignment on this in this company because it’s. I’m somehow also happy that fusion snapshot is not open source because I did put really a lot of hacks into it. I’m joking.

Fusion Snapshots broken down

Lorenzo Fontana: But Fusion Snapshots is extremely sophisticated. If you use it standalone, like I don’t think you have any person listening to this who is using Fusion Snapshots standalone. Like they just using the pipelines. Using it standalone. If you enable fusions snapshots in a pipeline, go for it. Go on platform

Phil Ewels: It’s a tick box, right?

Lorenzo Fontana: Enable fusion snapshot and then get the wave container that comes with it and Docker run it, no. On your laptop right now, if you’re listening. And then you will have the binary fusion snapshot.

Run a program with it fusion snapshots and top, no. And then send a SIGHUP signal to it. With kill, kill minus HUP process id.

You will see that the process completely freezes and create files. Now download those files on your computer. No, forget what another computer. And now use fusion snapshots against that folder, by setting fusion work, and the folder path is an environment variable. And now you have Fusion Snapshots.

Magically the program appears exactly as it was before. Open VI and write hello and then move the files. Start an instance on AWS, put the files there and do this. You will see your program. It’s mind blowing.

To make this happen. There was a lot of suffering and pain, but it’s mind blowing.

The thing that I brought into it was my knowledge of, name spaces, of Linux, of containers, of CRIU, of everything.

The idea was from Jordi, the original from Jordi. He was extremely supportive in decluttering my thoughts so that it came out as the most lean, the most lightweight, the most. so it’s, I’m very proud of it.

Fusion Snapshot spot reclamation

Phil Ewels: And practically speaking, the way that people, I mean, that walkthrough is really, really in interesting to show. That’s basically what’s happening under the hood, right? you do the tick box in Seqera when you can create your compute environment to enable fusion, to enable Fusion Snapshots, and then the task is running on, could be AWS, it works on other clouds as well, right? Or it will do soon?

Lorenzo Fontana: Yeah, we’re working, yeah, on GCP.

Phil Ewels: GCP.

Lorenzo Fontana: planning on Azure.

Phil Ewels: Azure. And so the task is running on the cloud, so the, spot instance that reclamation process happens, AWS sends a notification that this, instance is gonna be closed and then the fusion of snapshots does its magic, freezes everything down. And then how does the system know when to start it off again? It’s just the scheduler makes a new instance available again and picks it back up it.

Lorenzo Fontana: This was already there. You think about it. What happens when you set max spot attempts is that Nextflow will create attempt to the same task in the same workdir.

What Fusion Snapshot does is that essentially it becomes the command run, essentially. Before the command run of that task, it just executes the same command, but before executing it, it intercepts, checks if there was already an execution and just resumes it.

Yeah, it’s, I was mind blown now that it looked like, what we build past few years was total with that in mind? No, with the kind of.

In fact, the first version of fusion snapshots did not require, I think still it is like that. Any change, any other component? Huh? We just enrich the container with, when you enable fusion, we add the fusion to your container by injecting the layer via wave.

When, because Wave had this ability to build your docker files, but also allows you, for example, when you go and pick conda packages, it creates the layers for you. It packages them so that when your computer or you docker demon downloads the definition, the manifest, it’ll see, oh, these are the layers. No, which is essentially what container run times do. They build a set of layers so that you can download and then pack them together. Wave is, even better because it allows you to set as a layer whatever is, on, a bucket, on A URL, et cetera.

So, the task downloads those layers we enrich with the fusion and fusion snapshots, naturally, no. And it just works.

And then when, it’s done in the end we have cleanup functions. We are like to remove everything.

Incremental dumps

Lorenzo Fontana: But recently we released the incremental dumps. ‘cause now the problem was that two minutes are not enough. Because remember, I dunno what, your audience kind of research is doing, your audience, but we have certain users that are, I want to be able to run a task that uses 200 gigs of memory.

The kind of users that use Seqera platform, nextflow, so they are again, crazy people like to do crazy things because they have to solve crazy problems that nobody else have.

Using 200 gigs of memory, you can have the best machine you want, you can have the best network that you want. But moving these, mastodontic thing to another machine is already a challenge. Moving it in under two minutes is a bigger challenge. So we, are leveraging a feature in CRIU that allows you to take dumps that are incremental in time.

We, created our own, an even more sophisticated system on top of it that is able to detect when it’s the right moment. And, usually, a timer is enough, but also when you do let’s say do it every five seconds, do it every 10 seconds. Every 10 minutes, yes, you can do it. We do it with the timer now, but it has to detect if in that context of the, a Nextflow pipeline in the task is still the right moment to do it.

What. What is happening? Can it be done? Because one thing is doing it at the end when the machine is going away, you have no, alternative. One thing is doing it while there was no need in terms of, for the user, this is still a very good moment to not have any disruption, in the pipeline. And we do this because we want to be able to capture the memory state and read it to the file system before, long before it starts becoming a problem. So that we already have, we already have a view, no.

In fact, again, coming back to do the things on your terminal, if you can. If you still have that container that we said before, run it with, one of your task execution, you will find that there is also another binary called fusion snapshot utilities that contains a command line tool that allows you to view an HTML view, an actual, graphical representation of your snapshot. So you see a graph with when was it taken, how many incremental dumps were taken, how much time did they take? So it’s extremely useful to troubleshooting maybe why it’s not doing it in time.

And this is still something that we are, not working on, but we are, keeping on ears open because again, you all out there are, it’s not like you’re all running websites, no. Or running inference for AI model. You are doing everything, huh?

Fusion snapshot process timing

Phil Ewels: One of the things that, really blows my mind that with fusion snapshots, that, you managed to make this work is how important it is to get the timing just right. Because you need to be able to take that snapshot and then not let the process run anymore. And so that when it restarts the files on disk are perfectly in sync with the running process.

‘cause you have a, read aligner like BWA or something and that, that’s, aligning probably millions of reads to the disc per second. how do you ensure that the timing is just right so that you don’t duplicate any of those alignments? You don’t have the same thing run twice or you don’t miss any.

Lorenzo Fontana: The process. It’s very, I will say simple. No, the process gets frozen, no. There is a feature in the kernel called cgroup freezer. The freezer, the process gets completely frozen. It doesn’t do any anything anymore. And it’s still there. this is done by CRIU.

It’s still, a challenge for us for snapshots because CRIU starts from generic assumptions. While we, as again, I will say it again, we run, very intense workloads that can do whatever they want, no.

And usually like even when you look at the documentation for CRIU and tools like that, that the kind of user that, they have in mind is someone running an application that they know that these snapshots, when they want. Now more recently it’s been used a lot for AI inference, like to get the state of AI inference ready faster, completely different than our use.

We use it for that. You find the state and don’t waste 20 minutes waiting for it again and redoing the same things over. All these features that they have, we are adapting them to work perfectly for this.

And also, this was not possible without fusion because, luckily no, we already had, again, other things that seem like happening for a reason and we already did a file system and spent years, and years optimizing it, making it perfect. All of the sudden is perfect for this use case, now. And without it, it would’ve been a lot difficult to do the timing correctly, because if you in two minutes have to dump everything to disk and then upload it, it’s not enough time. Fusion is optimized in a sense that while you start writing on this, it gets uploaded directly and we are even thinking about completely bypassing no. And have a pipe that go, that goes directly from memory through s3, no, this one of the next,

Seqera full stack

Phil Ewels: I, really like about the direction that this is going in is the kind of the fact that with the tooling that we’re building in Seqera, we own this full stack. You see that in this, these kinds of conversations where you have fusion snapshots, which is built on fusion, which is built on wave, which is built on Nextflow.

And you go all the way from Seqera Platform, which is where the person actually interacts with it, and you tick a little box on a web page, which is the simplest thing in the world. And yet that has this effect all the way down, right down to the, the almost a kernel level of the absolute process stuff.

And and I think that’s so cool that we can do that, and that we have the breadth of technical expertise and the breadth of product, that we can implement these things.

Lorenzo Fontana: Yeah, Seqera did a great job at getting together people from completely different backgrounds and help them work together.

Phil Ewels: Reminds me a little bit, we, you, mentioned like Apple Macs a few times, and sort of reminds me a little bit of how they approach things. That the MacBook body is machined out of aluminum to be exactly the right shape for the components. It feels like the same thing. We’re, machining our own kind of, computing components to fit around the shape of Nextflow and to make it work perfectly.

Like it, it’s, hopefully clear to everyone listening that this is a pretty amazing thing that you’ve built. I’m in awe of it. I don’t understand it, and I don’t think I ever well, but I’m very happy that you’ve built it and that it’s going to make running on cloud faster and cheaper, probably significantly so, assuming that the spot market continues to go in this direction of, being very frequently interrupted and everything.

Call to action

Lorenzo Fontana: The thing that we missed is that we didn’t task everyone, no. To give it a try because the way in general, no, generally things work and evolve is if people, are actually going to toward them and testing them and criticizing them and, telling what they see of value, what they don’t like.

Also like fusion snapshots is, all it is, doesn’t collect telemetries enough for, how you’re using me? No. and so ideally, no, seeing discussions, seeing conversations around it in the Nextflow slack for me, and if you want include me, tag me @Lorenzo.

Because will give me, a way to know, no, if, we have pretty positive feedback from the customers that we tried to speak, but I want to see, the, the bioinformatician person, like I want to come to the next Nextflow Summit, no, and work with people that tell me, oh, Lorenzo I tried it and I didn’t like it because it did this thing, or I liked it because they did this other thing.

pigz

Lorenzo Fontana: But, my specific, I know I do, one of the things that make me cry is the usage of pigz and pigz is my enemy. No, it just because it does so many instructions that are, like, there was one problem. Let me tell you this, where pigz optimizes for, ascertain instruction set in X 86. Then you move the workload to a better machine and it gets moved and everything is fine, and it works.

But if AWS Batch decides to then move the work to another, again to a worse machine, no. Huh? The snapshot process works and everything is fine, but pigz doesn’t want to start because it decided that it has to use an instruction set in the meanwhile, which is not a pigz fault. pigz didn’t know as you were moving pigz around different machines, but I literally had blood on my head because I keep on banging it, huh? To solve this.

And usually all those, you know, specific, archaic problems that, don’t make me sleep are all from pigz, not pigz fault. Because it, it’s me that I’m doing something, different. But usually it’s always pigz, if someone open a ticket and say, oh, this app doesn’t work unless it’s, the usual because it didn’t have enough time. So it’s an optimization issue. It’s pigz.

Phil Ewels: That’s because pigz optimized at such a low level for the different systems it could run on.

Lorenzo Fontana: Yeah, totally. It’s just, trying to pick the best instructions to reach the result. It’s that, which from an engineering point of view is that I think, oh, it’s just well engineered and uses the compiler in the right way. They just put, if you think about it, it makes sense. Like you want your files to be compressed and compressed enough. The best way possible on this machine?

Phil Ewels: Yeah,

Lorenzo Fontana: No, you don’t want a generic solution. Otherwise you would use a generic, compression utility, no. You would use one that is less, no. Compression is a nightmare, no, tar, zip, they never gave me a problem. pigz

How to try Fusion Snapshots

Phil Ewels: So if someone wants to try it out, I guess we’ve, we just launched Seqera Compute and with a hundred free credit, so the easiest way is to go and sign up and get that and spin up an environment on the cloud via Seqera compute, and a couple of clicks, tick fusion snapshots, and try running your pipelines. It’s as simple as that.

Lorenzo Fontana: Yeah, the ideally, there should be less buttons to click. always. I, I like this, approach from, I got it from Michael, our colleague Michael Tansini. He saying, less buttons, less buttons. I pretty much like that, less buttons. But now there is a button for it. So if you click it now and you tell me what you, tell us what you think, it’ll be amazing.

Phil Ewels: Very cool.

AI usage in the future

Phil Ewels: Right, there’s one more topic I wanted just to touch on. ‘cause I think. I think it’s interesting I left iit to the end ‘cause it’s, nice to think a little bit about conclusions of this is where we are now, this is state of art, but as you said, things are moving so fast. With one eye to the future, knowing also that you and I are part of a secret group within Seqera called Claude Power Users, which I don’t think many people know exists. So you’re obviously a power user of ai, of code generation tools, and, and very deep into that kind of world. Where do you see things going broadly speaking, how do you see things changing in the future and does Fusion, where does Nextflow, where does Seqera fit into this picture?

Lorenzo Fontana: Today the thing that is most important is to give the right context. It was always, but now it’s the most important thing, even more than before, because I see it’s going in a way where we are not going to look at what it produced.

Yeah. so it needs to be careful, crafted. And can be hand crafted, but also can be that the tools that you use have to provide the right context. So providing the right context is, in my opinion, one focus that Seqera will have, providing to AI tools, giving the ability to iterate quickly on, experiments will be paramount.

And coming back to, to cloud and to usage. I think that it’s finally the moment and enough to create new things without being blocked by ourselves, no.

So now we are relieved, no, of this need to test and give all our energies, no, to testing ideas and to do all of them perfectly. Now it’s a moment when you can go quick, and correct, and right, with the right, no. And maybe technologically we are not still there. Still there is something, but we need to be prepared for when it will be.

I don’t think that when it’ll be, is in in two years or in six months, I think it’s this quarter likely enough. I don’t like to make predictions, but I think that the sooner the mind changes and you enter in a situation where you get the personality for it.

I always make this example, I made this example to everybody I know. Like I, nobody here in this podcast knows me. But if you go and search my name on YouTube, you’re gonna see, I was like 130 kilos. I was like, different. And I always told myself, oh no, I have to do a diet. That’s the wrong thing to tell yourself. No, it’s not a goal. Being on a diet is not a goal. The goal is to be healthy. The goal is to start doing every day, have the personality of someone who is healthy so that being unhealthy becomes against your personality.

So I think that tech technology is the same thing. if you set an habit for yourself to use AI in a certain way, not oh, if this AI give me an extra pipeline, that sequence with this, DNA, that doesn’t work. You have to put your brain, your knowledge into.

But I think that if you are right, you are, it’s gonna be a lot different soon how you get to that result. I think that what happened to software engineers will go spread as oil to all the other fields where thinking is involved because thinking is becoming the most important thing and not typing.

Phil Ewels: Typing. I, like your, diet analogy. it reminds me, it made me want to quote the matrix, the film, the, you’ve got the kid with a spoon and he is can’t bend a spoon. You have to bend yourself. I like that.

Lorenzo Fontana: Yeah.

Phil Ewels: Yeah. you, the diet is not the objective. It’s the, it is the losing weight in the long or being healthy in the long run. And it’s just, and with ai, the objective is not to write one piece of code or write one pipeline. It’s to change how you work.

Lorenzo Fontana: And I think that with ai, we are gonna be able to define a new identity. And so if you feel that now some of your identity is going away, because, you cannot define yourself.

I still define myself as a programmer, as a coder, as a developer. But the act of writing it, is like, now a hobby, like I, it’s not more anymore the thing that I do most of my time in the day.

And I realize that it freed a lot of time for thinking what actual problems, actual things, and like multiply my impact, no. And things that I, ‘cause I don’t have to act willingly in detailing every single, piece of the machine, no. Which was also exhausting. I don’t know how many nights I didn’t sleep, but they just.

For no reason, like you then go to sleep at 6:00 AM I wake up at 7:00 AM to bring the kids to school. And what did you do? Configure that framework for doing that project that I have in mind. The next day, you’re exhausted. The project dies. so it’s,

Wrap up

Phil Ewels: That was a more philosophical note to end on than we usually go to the podcast, but I enjoyed it. It’s nice to frame how we think about these things. And it’s exciting times to live in.

Lorenzo, thank you for joining me today. Thank you for,

Lorenzo Fontana: thank you for being awesome Phil.

Phil Ewels: It’s always a pleasure to chat. I never quite know what we’re gonna end up talking about, but that it, that’s what I like about, about chatting to you. So it’s, it’s always an exciting experience. and yeah,

Lorenzo Fontana: Same.

Phil Ewels: Everyone who’s listened to this podcast, thanks very much, give, give Fusion snapshots a try. Tell us what you think, you’ll, make, even if you’re submitting a bug report, you’ll still make Lorenzo happy.

Lorenzo Fontana: I, look to the tickets just for the Fusion Snapshot bugs and when they are not there. I’m a bit sad

Phil Ewels: Exactly.

Lorenzo Fontana: it’s like the heart beat, no.

Phil Ewels: Exactly. Yeah.

Alright. very much, Lorenzo. Thanks everyone’s listening and, see you all soon.

Lorenzo Fontana: I thank you.