Seqera
/
Podcasts

Pipeline chaining, meta pipelines and automation (Part 2)

In Episode 57 of the Nextflow podcast, Phil Ewels, Ben Sherman, and Edmund Miller continue their exploration of pipeline chaining and meta pipelines.

This is Part 2 of a two-part series. In Part 1 (Episode 56), we looked at existing solutions for automating Nextflow, chaining pipelines, and building meta pipelines. This episode focuses on the Nextflow language itself: recent changes and upcoming improvements that should make pipeline composition much easier.

Key Topics

Workflow Inputs:

  • New params block in Nextflow 25.10 for precise parameter type definitions
  • Streamlining sample sheet parsing with record types
  • Moving params to the entry workflow level for better modularity

Workflow Outputs:

  • The new publish section replacing publishDir
  • Generating JSON output manifests with full metadata
  • Using output channels to create structured, API-like pipeline interfaces

Pipeline Composition:

  • How typed inputs/outputs enable pipeline chaining without glue code
  • The role of the nextflow_schema.json in external tooling
  • Future possibilities: piping Nextflow commands on the command line

Nextflow Lineage:

  • Using lineage IDs to reference files across pipeline runs
  • Moving toward data-centric rather than pipeline-centric workflows

Nextflow Language Features:

Related Episodes:

Summary

Defining Pipeline Interfaces

Ben explains that if you want to compose pipelines, you need clear inputs and outputs. Params already work reasonably well for inputs: a list of typed parameters that map nicely to JSON payloads and API calls.

Nextflow 25.10 made params more precise. Instead of just declaring parameter names and maybe a default value, you can now specify types directly in the script.

Streamlining Sample Sheet Parsing

Ben describes a future where sample sheet handling gets much simpler. Right now you have to provide a sample sheet file path, parse the CSV in your pipeline code, split by row, and convert to tuples or maps. With record types, you’d just define a param as a list of records with column names and types. Nextflow would automatically parse any incoming CSV, JSON, or YAML based on that type definition.

This means pipelines wouldn’t need to care about the file format. CSV, JSON, database query result, whatever. As long as Nextflow understands the data source and has the record type definition, it handles the parsing for you.

Params at the Top Level

We’ve also become more strict about where params can be used. In DSL1, params were global variables accessible everywhere. With modularization in DSL2, this pattern became problematic: a sub-workflow referencing params.foo just assumes that param exists in any pipeline that imports it.

The recommended approach now is to only use params in the entry workflow, passing them explicitly as inputs to sub-workflows. Record types will make this more practical by letting sub-workflows accept a single record containing all their configuration, rather than dozens of individual parameters.

Workflow Outputs Revolution

The current publishDir approach has issues: publish directives are scattered throughout the codebase, there’s no overall view of what a pipeline actually produces, and no way to publish metadata alongside files. You end up with a directory tree of files and no structured index.

The new workflow outputs feature (out of preview in 25.10) changes this. Instead of publishing from processes, you propagate output channels back to your entry workflow. A new publish section in the entry workflow, paired with an output block, lets you:

  1. Define how files route to specific paths
  2. Generate index files (sample sheets) that serialize channel contents as CSV, JSON, or YAML

The end result: “Nextflow pipeline as an API call”. JSON params in, JSON manifest out, with all the metadata preserved.

Enabling Pipeline Chaining

With structured JSON outputs, chaining pipelines gets much easier. Instead of guessing file paths or manually constructing sample sheets, you can pass the output JSON directly to the next pipeline. The metadata is preserved, so it’s easy to pick out exactly what you need.

Phil points out that this creates a nice symmetry: if pipelines accept JSON record inputs and produce JSON outputs, and if record types ignore unneeded fields, pipelines that use similar naming could interface directly with minimal glue code.

The Schema Connection

The nextflow_schema.json serves as a way for external tools (like Seqera Platform) to reason about pipelines without running them. With the new params and output blocks, much of this schema can be generated automatically from the code.

For pipeline chaining in Platform, schemas could enable early validation when connecting pipelines. Platform could compare the output schema of Pipeline A with the input schema of Pipeline B, identifying mismatches and suggesting fixes before anything runs.

Piping Pipelines

Phil asks: with JSON in and JSON out, could you just pipe Nextflow commands on the command line? nextflow run A | nextflow run B?

Ben confirms he’s prototyped this. It requires Nextflow to print outputs as JSON and accept params via stdin. For two-step chains it works; three or more steps need some stdout/stdin buffering work. Maybe not revolutionary, but it would be satisfying to literally pipe pipelines together.

Lineage Integration

Nextflow’s lineage system records every task and file produced, each with a unique ID (lid://hash). You could have sample sheets with lineage paths instead of file paths, and Nextflow would resolve them automatically.

This lets you shift from “pipeline-centric” to “data-centric” thinking. Instead of manually running pipelines in batches, you could have data catalogs where pipelines automatically run when new data arrives, with lineage tracking everything that happened along the way.

Looking Forward

Ben is excited about how these features are converging. For very large pipelines, he recommends waiting for record types in 26.04 before migrating to workflow outputs, since they significantly reduce verbosity. But the pieces are coming together for a future where pipeline composition is natural and clean.

Edmund sees this enabling truly bespoke workflows: instead of monolithic pipelines, users could pull individual pieces and compose them for their specific needs, which was always part of the nf-core vision.

Full transcript

Welcome

Phil Ewels: Hello and welcome to the Nextflow podcast. You are listening to episode 57 going out in March, 2026, and this is part two of a two part podcast episode where we’re talking all about pipeline chaining and meta pipelines.

In Episode 56, we talked about existing solutions, workarounds that people have built, and systems that people have integrated Nextflow into, to automate pipeline launches, to chain pipelines, and to build meta pipelines.

In this second part, we’re gonna talk more about the Nextflow language itself. Changes we’ve made recently and how they affect pipeline chaining and meta pipelines, and we’re gonna look forward to the future about the things we can expect to come together and hopefully unlock some of these things which have been difficult. We also talk a little bit about what we’ve attempted in the past and why it’s been difficult and why it hasn’t worked.

Joining me today, we’ve got Edmund Miller and Ben Sherman, two regulars on the Nextflow podcast, and together we’ll really dive into the guts of Nextflow pipelines, the inputs, the outputs, how we define these, how we import Nextflow syntax, and so on.

So without further ado, let’s dig in.

Nextflow language

Phil Ewels: So how do we, we we’re kind of, we’ve said that it is basically a bit awkward at the moment to do meta workflows, to to do pipeline chaining.

Both because of the complexity of, of bringing in everything that a workflow needs when it’s imported. You know, you’ve got the config and everything and, and also matching up inputs and outputs. I’ve that example in Node-RED where I’m just like guessing output file paths based on what I know to be the standard thing that’s saved. But if you have dynamic file paths or anything, it breaks down straight away.

So what, what are we doing about it? How do we make this situation better?

Ben Sherman: So we’ve showcased some of the, you know, pragmatic ways that we’ve tried to make these things work, and, and where they fall short.

Inputs and outputs

Ben Sherman: In my view, one of the main things we need to do within a Nextflow language itself is to provide a way to define the inputs and outputs of a workflow in a very clear way, because this is sort of a fundamental thing that you need if you’re going to compose things, is you need to have a concept of like, I have a block and it has these inputs and these outputs, and I can hook them up to each other and, you know, I, or the system I’m using can validate that when I hook two things up together, they’re compatible.

And if they’re not compatible, that I have an easy way to put in an adapter, right. And make, make, make it just work.

And so with Nextflow, we already have the concept of Params. The params are very straightforward, right? It’s just a list of params. Every params has like a type and a value, and you just pass those in. You can even supply a, a params file that’s like a JSON or a YAML, right? And that, that, that maps very nicely to like making a, API call with a JSON payload, right? So the inputs are very set up.

There’s a little bit of extra work, this is part of what we’ve done in Nextflow 25.10 around being able to define those params more precisely. So instead of just declaring them like the name and maybe giving a default value, actually being able to declare the type of each param, in the script.

That just sort of gives you a more precise specification of, of what the expected inputs are. Like being able to tell whether an input is a number or a string or a boolean.

Another thing that, I would like to do is even provide some convenience around providing sample sheet params. Because right now, what you have to do is you provide the sample sheet as a file path, and then in your pipeline code you have to parse the CSV file, split it up by row, and then convert it into, you know, tuples or maps, you know, records, whatever, and then into a channel.

Streamlining sample sheet parsing

Ben Sherman: I think we could actually streamline a lot of that, by basically you just defining a param that say is like a list of records. And you have this type definition of I have a list of records and I have some special record type that allows me to specify like, okay, these are the expected column names and their types. And then when Nextflow gets a CSV file coming in, instead of you having to do all that CSV parsing in your pipeline code, Nextflow just looks at your record type, looks at the CSV file coming in and just loads in everything automatically. Because it has all the information that it needs.

And then where you go from there is maybe your pipeline doesn’t actually care if it’s a CSV file or a JSON file or a YAML file or some other thing like a database query. You’re just able to provide some data source that Nextflow understands.

And then you have your record type definition and your pipeline code. And then whatever data source comes in, Nextflow just parses it, whatever it is. And then it just goes straight into a channel. And then in your workflow logic, you just have that channel of records that you need.

Does that make sense?

Phil Ewels: This is, this is you getting back to the, everything’s the data frame. You want, you wanna get rid of CSV files. That’s what, that’s what I’m hearing.

Ben Sherman: I don’t even necessarily want to get rid of them. I just wanna make it easier for. I think the pipeline code should be mainly concerned with the things that are important to the pipeline, right? Whether the input is A CSV or a JSON or whatever just isn’t really important. As long as it has a structure, and it’s something Nextflow knows how to parse, then the user shouldn’t really have to worry about that.

I think that just it becomes increasingly important as the data landscape continues to grow and evolve. And there’s like a million different databases out there and a million different ways to like ingest data and transfer data.

It’s just nice if your Nextflow pipeline, if it’s easy to sort of plug it into whatever data source you have, I think that will be useful.

Phil Ewels: Definitely see that in many of the automation cases that we’ve been talking about, where it’s like, okay, I need a step here to take this database query, write it to a file, and then pass that file to Nextflow.

That intermediate hop every time. So if there was a way to just pass it directly yeah, that would be, that’d be really nice.

Or even have it in the Nextflow config somehow. Like, you know, either give me a file path or give me a SQL query.

Edmund: Yeah, I feel like we have that already, it’s almost the lineage piece. Back in the day it used to just be a *.fastq.gz, like you just pull all those in, but you had no idea what people were actually running if they just gave you that command.

But now with Lineage, you can know what actually went into that pipeline and then you can go back to just saying like, okay, take all my FastQs and like run this.

Ben Sherman: Yeah, it opens up an alternative way of doing automations, where instead of thinking about like discreet sample sheet files, you’re maybe just operating on queries.

Maybe you have your Seqera Platform and you have, you know, a project or something and you have all these files coming in all the time.

Maybe you don’t need to worry about creating a specific sample sheet for a given batch. You just say like, okay, it’s Tuesday morning. Just grab all the samples that are ready to go, right, and throw ‘em into a sample sheet on the fly. Or, or JSON file or whatever, and let’s launch the the run.

That could be an alternative way to do things.

Edmund: Sample sheets on the fly. I like that.

Placement of params

Phil Ewels: I was gonna say, just before we go onto outputs, another thing about the params, which we’ve done recently is you, you mentioned the typing and everything, but also we’ve become more strict about where you can use params, right? I think that’s kind of important.

Ben Sherman: Yes, that’s a good point because part of the thing you were talking about with like installing an nf-core pipeline is that right now a lot of sub-workflows and modules, it’s common for the sub components of a pipeline to just reference params directly.

And this sort of comes from the DSL1 paradigm where everything was in one script. Params was just a global variable. And so it made sense. You could just use Params everywhere.

And then in DSL2, when we start modularizing things, we still kept that capability of being able to use the Params everywhere. But from a higher level, it, it doesn’t make as much sense anymore.

Because, suppose I have a sub-workflow defined in the NF-core modules repo, and this thing exists as its own thing, and then it’s referencing something like params.foo. It’s like, okay, well what, what is this params.foo? Where is it coming from?

Well, the assumption is that when you install it into your pipeline, you will have a params.foo defined at the top level, and it will mean the same thing that the workflow is expecting, right?

And so this is sort of a, a, a bit of a dubious assumption. And so one of the things that we’ve tried to do with this new Params block, and really just with the strict syntax in general, you know, we’re not enforcing this yet, we may enforce at some point in the future, is getting people to say, try to pull the params from out of the internals of your workflow. Only use Params in the entry workflow in the top level. Basically think of Params as like the take section for your entry workflow. Right.

You know, when you have inputs to a sub-workflow, you can’t reference those inputs in a module that it calls. You have to pass it explicitly as an input.

And so you wanna think about that the same way with the params. You have your params at the top level, you reference them in your entry workflow, your entry workflow calls, you know, whatever sub-workflows or processes. And it passes those params in explicitly as inputs.

The reason we haven’t been pushing this right now is because it’s, it’s actually really annoying to do that unless you have records. Because really what, what happens is that you’ll have a sub-workflow like rnaseq that will have a ton of params that it’s using. They’re really like rnaseq params.

And so I think one of the things that’s, that’s missing, that hopefully will have in the next release is the ability for the, the sub-workflow to say, okay, one of my inputs in the take section is just a record. And maybe that record, I’ll just call it params, right? The input itself. I’ll just call it Params. That’s fine.

And it has a whole bunch of, flags and other values that are specific to the RNA-Seq workflow. And then when I call that workflow, I’ll just pass in the params as that.

And as long as all of the params in that record type were also defined as top level params, then it all sort of passes through validated.

But the point is that now that RNA-Seq workflow is truly modular, at least with the params, it’s not referencing some params that it just hopes are gonna be defined elsewhere in somebody else’s pipeline code. It’s actually defined in a way that’s internal to that workflow itself.

So you would have the workflow, rnaseq, all the workflow logic, and then below that you would have say a record type called rnaseq_params. And in that you would define all these params that you’re using and their expected types. And so that way everything is sort of explicit and self-contained.

Phil Ewels: And important distinction is it’s defining them. It’s describing them.

You know what kind of variable types there have to be, and they have to match up with the pipeline that’s using them.

And you can set defaults, I guess, as well in the record. So then, you know, if you are pulling in a pipeline, you can let that pipeline use most of its default values and just set overrides for the things you need to.

Ben Sherman: Absolutely.

Sharing workflows

Edmund: I think where you can see this kind of play out in practice is, again, the Sarek versus RNA-Seq piece.

Sarek pulls in a lot of params randomly throughout the workflow in modules, kind of just all over the place. And when you’re trying to work on that code base and you don’t know where that’s been done, and you’re trying to debug it, you’re kind of like, where did this come from?

And whereas RNA-Seq, Harshil took a very painstaking approach, params are very top level here and this is where they’re gonna be and everyone’s gonna know where they are.

But as Ben said, it’s kind of a real pain to like, not just get like, ah, but I could just call params from over here.

Phil Ewels: Paolo did actually try this approach several years back, right? We have the addParams declaration, which basically no one used.

Ben Sherman: Right. There was an idea to have, you know what I’m talking, what I’m talking about now, passing things in through the workflow call, doing it through the include instead. So you would include a module and then you would inject the params there and it, yeah, it had some weirdness to it. I think ultimately it, it will make more sense, you know, having that interface at the actual function call level. So hopefully that will work out.

But yes, it is very convenient. The current approach, I imagine people are gonna keep doing it right up until we actually start, you know, throwing an error for it.

Which is why I want to make sure that it’s as smooth as possible to actually move to that approach before we start hard enforcing anything, drawing any hard lines.

Phil Ewels: One thing I really do like about having the params at the top level as well is that you basically then have a description of the command line interface for your pipeline in one place.

Ben Sherman: Yes.

Phil Ewels: And, and coming from a background of building Python tools where you, you have a cli.py kind of, you list all your flags. It feels quite familiar, but you know, actually, arguably with a nicer nicer syntax.

Ben Sherman: Yeah, and I think the nf-schema plugin currently does a lot of niceties around that, right? Where it, it will do certain kinds of validation and it can also generate help text. Those are all things that we hope to pull into Nextflow natively, eventually.

Outputs

Ben Sherman: Okay, so that’s, that’s one side of it is the inputs. And then the other side is the outputs, which is a much larger can of worms ‘cause it requires a lot more changes.

The current setup is the publishDir directive, which anybody who writes Nextflow will be familiar with. It’s very convenient, right? You just, you have all your processes and each process is generating some output files. And right in that process you say, okay, these output files, I’m gonna mark these task level output files as ” workflow level outputs” and copy them from the work directory into some sort of top level output directory.

And so there were a couple of issues here. Again, this also sort of goes back to the DSL1 days where everything is in one file, right? And also in DSL1, everything was much more process centric. So maybe it made sense to just have all the publishing done at the process level.

As things modularize, it becomes difficult because if you were defining those publishDir statements in the process directive, you know, in your pipeline code, then all your publishDirs are scattered throughout your code. You don’t really have one holistic view of what this pipeline is producing.

A common approach has to instead move all those directives into a config file. Okay so that’s nice. Now you have one top level definition. It’s quite long. There’s a lot of duplicated code just because of how publishDir works.

And it’s still a bit hard to read because you’re still having to target like, okay, I want this process. You know, foo sub-workflow and then bar and whatever. You have to like, target specific processes with the, withName selector. You have to target specific files or use glob patterns.

There’s no real way to save metadata. So like, if you have stuff like a meta map going through the process, you can’t like publish the meta map. You can’t just like publish metadata. You can write it to a file and then you can publish that file. And that’s essentially what people end up doing.

The end result is still is that there is really no coherent definition of the pipeline outputs. The, the output is essentially a directory tree of files.

If you’re lucky, you might get a sample sheet as well, that maybe serves as sort of an index, if the pipeline went to the effort of generating that for you.

But at the end of the day, usually you’re having to scan this directory tree, look at the file names. Try to infer the meaning of things by looking at file names and that sort of thing.

Whereas again, what would the equivalent be in an API call? Well, you should be able to call an API you provide your JSON payload. Okay, that’s the params. And then the API will respond with a JSON payload of its own, right. That should be the outputs. There’s nothing nearly like that in Nextflow. ‘cause you’re getting a directory tree.

And so our answer to that was this workflow outputs, which by now I think, most people will be familiar with just ‘cause we keep talking about it. But it’s been kind of confusing because we’ve had it in preview for like a, a year and a half. It finally came out of preview in 25.10, back in October, after many different revisions.

It took a long time, but those 18 months were actually very crucial for us to figure out what we really needed to build. I’m really glad where it ended up.

And basically what happens now is that instead of publishing files directly from the process, you’re already emitting those files on output channels, right? And so the idea is just keep propagating those output channels back up to your entry workflow.

And, in the entry workflow now there’s a new section. So you know how a workflow can have, like, take main emit? Well, the entry workflow can have main, and it can have publish. And in the publish section you can “publish” channels, which sort of serves as the replacement for using publishDir.

And then alongside your entry workflow, you’ll have an output block where you’ll define a set of output declarations. For example, you might have an output called samples, an output called MultiQC report, things of that sort.

And within each one of those output declarations, you can define mainly two things. One, you can define how to basically route different files to different file paths. So basically to define the directory tree that you want to have, just like you had before with publishDir.

Second, you can optionally define an index file, which is just another name for a sample sheet, basically. It allows you to take the channel exactly as it is, and just serialize it as a CSV file or JSON file, or a YAML file. Basically whatever structure you had in that channel, say it was like a tuple with like meta map in the different files, it’ll just convert that out to JSON and then write out all of your samples into one big JSON file.

And so then the sum total JSON response would be a JSON object where you have a key for each of those output declarations. And then the value would be the channel contents.

There’s a lot of syntax niceties, there’s a lot less duplicated code. There’s finally an official, concept of an output directory, right? Everything is going into one output directory, which allows us to do some nice things.

But I think the most important thing is this idea of actually having a JSON-able workflow output.

We sort of have this in the lineage, so if you use the the data lineage system, you’ll see this. I’m hoping we can maybe surface this in a more explicit way.

Workflow ouputs to chain pipelines

Ben Sherman: But now you could imagine you could actually have a pipeline where you call it, with a JSON that has all your params, and then it runs, it saves a bunch of files to, you know, say S3 or wherever. And then in the end, it spits out this giant JSON file that contains basically all of the metadata and all the file paths that, that were going through your pipeline that you chose to publish at the end.

And that JSON file sort of serves as a structured index over all the files in your output directory.

And so then you can just take that JSON and pass it along to the next block in your pipeline chain.

Phil, you were talking about with your Node-RED example having to figure out, all the different files that you wanna grab and all that. Well now you could just consume this JSON. And all the metadata is still there, so it makes it a lot easier to pick out exactly what you need, and you can just manipulate that however you need to, to run the next pipeline.

And so this is sort of the dream. It’s basically Nextflow pipeline as an API call, you know, JSON in, JSON out all the information is there.

Pulling it out of the module so that it’s all the top level, hopefully makes it much easier to use a Nextflow pipeline as itself a module in a larger system.

Phil Ewels: The way I think of it is a way of being much more explicit about the contract that you’re writing with the pipeline about the inputs and the outputs.

At the moment if you ask me what files does rnaseq create, I would have to spend quite a significant amount of time digging through the code trying to figure out the answer, ‘cause it’s just everywhere. And honestly, the easiest thing is probably just to run the pipeline, right, and then write ‘tree’. So actually just being explicit about it is huge.

And, and the other thing you’re talking at the end about passing a JSON, that feels to me like that’s ties in very nicely to the, the thing we were talking about earlier about having record types as inputs and not really caring if they’re what file format they are and stuff.

Especially if that record type will just ignore any fields which are not needed, then as long as pipelines are relatively consistent with calling different entities the same thing. Like if a BAM file is always called BAM and fastq_1 is always called fastq_1, it feels like maybe we could do away with some of the glue code. And multiple different pipelines would be able to interface with one another just by having a shared vocabulary of of entity types.

Ben Sherman: I mean, this is a pattern I’ve noticed in some NF-core pipelines, I’m sure many Nextflow pipelines do this, is that oftentimes you will see an upstream pipeline trying to anticipate the sample sheet that other downstream pipelines are, are expecting.

Which is actually the exact reverse of how you would want to do it. There’s a principle in systems engineering around this where, if you have a network of components that you’re composing, each component should be very flexible in what it can take in. But it should be very strict in what it produces because then its output becomes very predictable. And so those downstream components can say, okay, I know exactly what I’m getting when I call this pipeline.

I think you’ll, you’ll still end up having some glue code, hopefully not as much. But I’ve gone back and forth with some of our like hardcore pipeline devs to try and figure out, okay, what are the most complicated things you can, you can throw at me?

And there are definitely some cases where it’s not just a matter of using the same column name across the board, right? Like we could come up with conventions around that.

But there are sometimes cases where for example, maybe Pipeline A outputs a sample sheet, and then you have Pipeline B, which is expecting a sample sheet, which is a superset of the pipeline A output, but it’s also expecting some other columns.

And so who is responsible for inserting those extra columns, right? Pipeline A doesn’t really need to do it. ‘cause it, it doesn’t care about those extra columns. Pipeline B doesn’t really wanna do it ‘cause it just wants the final thing. It doesn’t wanna worry about what Pipeline A gave it versus what was missing.

And so it seems like there is a niche to be filled there for some kind of glue logic that can just do something as simple as take Pipeline A and augment it with a few extra columns, or join it with this extra, auxiliary data that I had somewhere. Anybody who’s done data engineering in other domains, this is a common thing that just has to be done sometimes.

And it was actually Rob Syme who pointed this out. We were thinking about like all these different ways, like, oh, we could come up with like a little like a JSON or YAML based like DSL to like just do this really quickly. And then Rob was just like, you know, we have a great, workflow language for, merging things like this. It’s called Nextflow.

And so I, I actually, I ran with that idea for a bit and actually it’s not terrible. Like you, you could just write little glue pipelines in Nextflow. You can actually make them very concise depending on how strict you want to be. But a lot of the channel operators like doing joins and combines and all that stuff, are, are exactly the kinds of operations you need to do with this glue logic.

And so it could be that instead of thinking about it as Pipeline A, Pipeline B with a little bit of glue logic, maybe it’s actually Pipeline A, Pipeline C, and then Pipeline B is just this tiny little pipeline that just does a little bit of channel manipulation, a little bit of sample sheet manipulation.

So that’s, that’s one approach that we’re thinking. But you know, there, I’m sure there are others.

Edmund: Small NF, that’s that’s what it is.

Ben Sherman: Small Nextflow.

Publishing channels

Phil Ewels: Rob was on the podcast, what, 3 episodes ago doing a recap of 2025, and we were talking a little bit about this workflow output syntax. It’s something that’s very close to his heart.

One of the things that’s always bugged him is that we have this rich metadata in pipeline runtime. We have all this information about what happened to files, where they came from, all the associated meta data, and then we just discard all that and just save a bam file.

But part of what you’re describing then is, is you can save channels, basically. We’re publishing channels now, we’re not publishing files. We can serialize that from that to JSON and stuff. Does that impact the meta pipelines? Can we keep data structures as their native channel format and, and not have to go through a serialization step?

Ben Sherman: There might be a way to do that. I think that gets into basically how you wanna structure maybe you can think of it as like vertical versus horizontal scaling. That can be one way to do it, where you maybe define your meta pipeline in some Nextflow native way. And at runtime it just combines all of it into one giant workflow DAG, right?

And in that case, yes, you could skip all the sample sheet serialization. You just pass the output channel directly in as the input to another workflow. So, so that’s fine is basically as long as you’re willing to run everything on one node and you sort of know ahead of time what are all of the steps I want to actually run.

If you get into a situation where maybe the pipelines themselves are user defined, so maybe you don’t actually know ahead of time the exact DAG that you wanna run.

Or maybe you do know it, but for whatever reason, maybe you want to split it across multiple, like jobs, like, you know, maybe so that you can scale things differently, things like that. In that case, it may be worthwhile doing the serialization.

I’m trying to think of, ‘cause most of these things have a nice analogy in like the tech world because we’re basically redoing everything that the tech world did like 10 years ago. I guess, I guess you could think of it as like microservices versus monoliths. You know, people like to dunk on microservices ‘cause it’s like, oh, well you’re just inserting network transactions, like in the middle of everything, making everything really slow.

But there was a real reason for doing microservices. I think it was mainly around being able to scale things right. You have like service A, B, and C, and you have a pipeline, sometimes it’s useful to be able to, to run those services on different nodes so that you can say, okay, stage B need to scale it up like a hundred X right now, but I don’t need to stage up A and C, like they’re doing fine. So I’m just gonna scale those up.

And that’s where having that serialization layer can be very useful. Now I don’t know if that ends up being very important in like a Nextflow pipeline. It probably depends on like how dynamic that pipeline needs to be. But that could be where it’s useful of like, okay, I’m just gonna go ahead and write out a sample sheet and pass it off to another job so that I don’t need to worry about how that job is running or scaling. Right? It can scale up however it needs to be.

Phil Ewels: And I guess also, I mean, if we switch gears from pipeline chaining to meta pipelines. Are we importing a whole pipeline or are we just importing the workflow? Because if, if we’re just importing the workflow, then we have no concept of the output block from that pipeline. So then we, are just dealing with the emitted channels.

Ben Sherman: I think that remains to be seen. Certainly the efforts that I’ve described here with the workflow inputs and outputs, it goes a long way towards decoupling the pipeline code from the config that exists around it, like we were talking about earlier.

You know, a lot of the config that remains is stuff that you will already have in the meta pipeline anyway, so maybe you don’t need to worry about copying it over.

But there are still a few loose ends remaining around, say like the, the ext.args, ext.prefix, that kind of stuff.

You would need a way to either carry that over or provide an alternative to basically find a way to move that also into the pipeline code itself, which there are some ideas brewing around that.

So yeah, it remains to be seen. Both paths are, are feasible. I think if you could just import the core workflow and just only have to worry about the pipeline code when you’re importing a pipeline like that, that would be ideal.

So we’ll probably try to make that work. But as a backup, we can always try the more dirtier approach where we just bring it all in somehow. We’ll have to see.

Phil Ewels: Feel quite excited that this is, this is now actually like a fungible problem. It’s something we’re actively working with. It feels like it might be a solved problem at some point.

For many years I’ve just kind of it on the pile of like, that’s impossible. So it’s, it’s quite an exciting time.

Ben Sherman: I will say that both of these, both the params and the outputs. For very large pipelines, it’s probably worth waiting for record types first. For record types to come along in 26.04 before embarking on these huge migrations.

Just because, especially with the outputs, like having to emit all of these channels all the way back up to the entry workflow, can be extremely verbose, with the current tuples. Whereas with the records it gets a bit nicer.

I know our colleague Jon recently tried to migrate RNA-Seq to Workflow Outputs and I mean, it is just, it is just nasty, like how many lines that produces.

I think he ended up just going ahead and merging it and just taking the pain, enduring the pain for a few more months.

But the records are a crucial piece of this just because of the way they allow you to, you know, define this structured data and just sort of pass things along and not have to worry about like what tuple element number 0 1, 2, 3, 4, all that stuff is.

Schema

Phil Ewels: But Something we haven’t touched on is the nextflow_schema.json, which is a concept we’ve had around for a while from the nf-core side. Which is useful not as something dynamic, but as a static description of, of a workflow interface.

How does that tie into these plans?

Ben Sherman: I mainly see the schema as a way for external tools to reason about Nextflow pipelines. The main one we’re interested in, of course, is Seqera Platform, like being able to load up a pipeline, and the Platform being able to reason about, okay, what are the expected inputs and outputs, without actually having to run it or to like parse the code itself. You don’t want to have a little mini Nextflow compiler, like sitting in the Platform.

And right now, the, the one that, you know, NF-core developed is basically exactly what we need. We may add on a few things to it, but at this point we’re mainly interested in just bringing that in as a sort of a native Nextflow experience.

In fact, I mean the, the schema itself it’s considered part of a Nextflow org. But especially now with this params block and the output block, we should be able to generate a lot of this stuff automatically, instead of having to write it by hand.

Of course, NF-core also has this like schema builder, there’s like a CLI and a web interface. I think those will continue to be useful because, there’s a lot of information in the schema that is not in the params and the output block in the code.

So it will likely look something like you write your Nextflow pipeline, you define all your params and outputs in your code first, and then maybe Nextflow generates that schema JSON, like a skeleton of it.

And then from there you can go in and, you know, add on extra stuff. Things like extra validation rules or just, you know, if you wanna define the icon, you know, Nextflow doesn’t really care about that, but you could go in and add it manually or through CLI or through a website.

And then you bake that into your pipeline repo, maybe every pipeline release that you have will have your main code and also have the schema. And systems like Platform can look at that schema and get a preview of what this pipeline is gonna do.

This is nice for things like, if I just want to render a preview, a description of what this pipeline does, without having to look at the code, documentation generation.

Schema + pipeline chaining

Ben Sherman: But it also can be very useful for pipeline chaining. If we imagine a way to build up pipeline chains in the platform itself, instead of writing a meta pipeline, maybe there’s some kind of user interface.

We already have this idea of like a pipeline. Where you can sort of prepare a pipeline, you know, throw in the repository, URL, the configuration, all that stuff. And just have this executable that’s just baked in ready to go. And then whenever you actually wanna run it, you just click launch and all you really throw in is the input data. Everything else is already configured.

Well, you can imagine a similar thing for a pipeline chain where you say, okay, I want this pipeline to actually be a chain of A, B, C. And if you have those pipeline schemas telling you the inputs and outputs.

Platform could then do a lot of validation right up front. I don’t have to wait to run the pipeline to see if I like made a mistake when I hooked up all the inputs and outputs I can say, okay, pipeline a outputs this sample sheet. I wanna take that, plug it into the input sample sheet of Pipeline B.

Platform can look at the schemas of both. So, you know, currently the Nextflow schema just has the params, right? But in this future scenario, the schema JSON would also have some description of the outputs, right? So we have the outputs and the inputs, and so platform could say, okay, output of A, input of B, got the sample sheet. Okay, let’s see if they match up. Okay. They mostly match up, but pipeline A’s output, you know, it has a column called Sample and Pipeline B that is expecting column called ID. So I need you to tell me to like link up, like which column goes to which.

And then maybe platform has some limited capability of like implementing glue logic. Or maybe it just tells you like, Hey, these, these sample sheets don’t match up. You need to add a third step to fix it, whatever. Right.

This is sort of like a shortcut, convenience, for the simplest use cases right.

Now, if you have something where it’s like, I wanna launch Pipeline A and then I want this sample sheet every row in this sample sheet, I wanna launch a Pipeline B for each one of those rows. That’s probably going beyond a simple user interface. You probably wanna just write a meta pipeline for something that complicated, right?

But for these simpler cases where you could build a pipeline chain in Platform, the schema becomes very useful for giving you early validation.

And maybe there’s other ideas I haven’t thought of and maybe you have other di ideas, Phil, I don’t know. But that’s the main one on, on my mind.

Piping pipelines

Phil Ewels: The only other one that comes to mind is one that came up and I can’t remember who originally suggested this. So credit to who that was, it wasn’t me.

If pipelines are now emitting JSON outputs, saying these are all the files I created, and, and they can now take a JSON input and, and pass it into a record.

You’ve got that nice one-to-one symmetry. It’s, it’s a nice interoperable language. I mean, could we just pipe Nextflow commands on the command line? Just say like, nextflow run A | nextflow run B. And you know, you just take the input off standard-in?

Edmund: That’s the dream

Ben Sherman: I wrote a PR and I tried to do this. And yeah, you’re right, you don’t really need the schemas for this. You can kind of just make it work with what we already have.

Basically I had to add two things. One, the “nextflow run a” needs to be able to print out the outputs as a JSON right? It currently doesn’t do that.

And then it also needs to be able to take in params, say like it needs to be able to take in the params file, but through standard in instead of the command line argument. So if you add those two basic things, it actually does work.

Actually the problem that I ran into was if you have three steps or more. So A pipe into B, pipe into C. Then that “nextflow run B” has to, like, you have to, there’s something that I’m missing where you have to like, get the standard in and standard out working correctly, so that like they don’t hang.

Because C needs to wait for the output of B, before it starts. Right. And I’m, I’m sure there’s a standard way of doing this because like every bash, like every Unix command line, clearly has to be able to handle that case.

So once I figure out how to do that, then I think we could have just, just straight Unix pipes for Nextflow.

I don’t know like how, how much value that will add. I’m sure people will make people very feel very nice that they can literally pipe their Nextflow pipelines into each other.

Edmund: It’s cool. So that’s, yeah, that’s the ultimate pipe syntaxes. Just being able to pipe them together. Well, you need like to be able to stream each individual result or output and like have like a streaming input for Nextflow from each of those.

Ben Sherman: Well, that’s the thing. I guess the pipe is, is normally designed for that, that you, you’re like, you’re taking in some text and you’re streaming the text out line by line.

The way I’m using it, it’s not really doing it that way. It’s like Nextflow starts up, it just reads in all the JSON immediately it, and then parses it, and then it doesn’t stream anything until the very end. It just prints out the entire JSON.

Unless you’re imagining something where like every task that completes, you know, it could stream out the task outputs.

But I think the way that the params block and output block is set up, the Nextflow pipeline’s gonna do that for you. It’s gonna stream everything into the workflow outputs.

And so there isn’t, that’s maybe getting a little crazy, you know?

Edmund: We could probably just do that with the println, you know, and just print like a JSON, pipe it into it. You can use some JQ in the middle of each of ‘em, and then every fetchngs process that finishes on it and has like downloaded the file, you could then kick off RNA-Seq and like start that up.

But you’d end up starting different pipelines. But I want like a watch path, but for like watch JSON or like watch streaming or standard in.

Ben Sherman: I feel like, again, that’s maybe the point where you just, write a meta pipeline.

Because then you get the channels, right. And so with the channels, you get that behavior. Like every, you know, fetch and GS has some output channel for the FASTQ samples. Well, every sample that comes on that channel will immediately be passed into the the downstream workflow.

Phil Ewels: You just step away from the keyboard Edmund. I know what you’re gonna be doing this weekend.

Edmund: Yeah, I just might go crazy on that. That’ll be, that’ll be a fun one.

Phil Ewels: This is cool stuff though.

Nextflow Lineage

Phil Ewels: One thing I was gonna ask you earlier on with the workflow outputs, Ben, is how this ties into lineage. Because Lineage can have identifiers for file outputs, right? I remember talking about this on the blog post when we launched it. You can specify a file input with like the lineage, URL / URI prefix. Can we do any kind of clever stuff with this now?

Ben Sherman: Well, yeah, lineage is this like totally other thing that’s, that’s like looming in the background. Where we’re trying to figure out, we, we all know that it’s very powerful and valuable and we’re trying to figure out exactly how to layer it in with everything.

Nextflow records a lineage record for every task that gets run in every file that gets produced in the course of your workflows. And each of those records has an id. It’s kind of like an s3 path, except it’s like lid:// and then there’s like some kind of hash, like the task hash. And you can refer to files in that way.

So you can imagine a sample sheet where instead of having normal file paths, they just have lineage paths. And then the pipeline can just load those files up, you know, by sort of looking at the lineage record and then the lineage record points to where the actual file is.

This goes back to something we were talking about earlier where like maybe if you have the pipeline chaining and the lineage and all these pieces put together, maybe in Platform you’re able to transcend this paradigm of like passing sample sheets around and running things in batches. And instead doing more of a data engineering ETL kind of workflow where pipelines just get run, regularly on the data that’s there.

Really thinking about your whole system as being more of, a data centric system rather than a pipeline centric system. The main thing is not like running pipelines to generate new data. It’s like I have these data catalogs or whatever. And, pipelines are the links between my data catalogs and, when new data comes in, to data catalog A and then data catalog B is downstream of that, it just automatically gets updated.

And the fact that it runs a pipeline under the hood or a bunch of pipelines under the hood, doesn’t really matter. You know, if, if those pipelines are stable enough that you don’t really have to think about them at all. They just run. They just happen.

And lineage becomes an important part of that. because, you’re not running these pipelines manually anymore. You just have this system that’s automatically running them, just updating things as it needs to be updated.

It becomes increasingly important to say, say that Data Catalog B, I’m, I might, I might be, I mean maybe I’m making Edmund cringe, maybe I’m butchering all this terminology. I’m just throwing different words around like data catalog, but like, you know, you’re looking at some table that you have that is downstream of like three or four analyses, whatever, that you didn’t run. You set up an automation, platform ran it for you. Now you’ve got this data Catalog F or whatever, and you’re looking at some rows in that. It becomes very important to look at those rows and say, okay, what are all the steps that this file had to go through to get to where it is now?

Because I sure don’t remember, ‘cause I didn’t run those pipelines. They, they just, they just ran, you know, over the weekend. And so the lineage just keeps track of all that stuff as it happens. I think that’s really where it comes in. It’s sort of this crucial component, on our journey of going from being like a, a pipeline run centric system to like a data centric system.

Phil Ewels: Exactly. The pipelines are just a means to an end.

Ben Sherman: Yeah.

Edmund: I think it’s just gonna become even more important with like the AI agents running this. Of you said like, oh, it just kicked off automatically over the weekend. Or it’s like, oh, my agent went and proactively ran this pipeline for me, and you need to know the lineage of like, that come from? Why’d we kick that off? Why did this happen?

And, not just for agents debugging it, but so humans can just reason about these systems overall.

But it just kinda goes back to like the age of bioinformaticians and like going back from like running each individual process to like, these processes are kicking off automatically that I set up and then these processes are kicking off because my agents set up an automation for it.

Phil Ewels: Brave new world.

Conclusion and wrap-up

Phil Ewels: Alright, any closing thoughts? I know we, this has been a long podcast, but it’s been a really interesting one.

Ben Sherman: Well, I guess I’m, I’m excited. You know, I feel like. I feel like we’re on the verge of, of something spectacular here.

We’ve, we’ve got all these Lego pieces lying around and we’re trying to figure out how to put ‘em together.

So I’m just eager to, finally get it working, especially that last bit. You know, if I could stop worrying about pipelines so much and just think more about my data, I think that’s gonna be a much, a much nicer world.

Edmund: Yeah, I’m definitely excited for that as well. I think the, the concept of all of this is that everyone has their own unique needs and where we’ve been trying to create these like generalized tools for, you know, one size fits all, or like RNA-Seq. And then usually what people do is they take RNA-Seq and they fork it and they rip everything out.

So I’m really excited to see where we’re going with that, not only in automation, but like taking individual pieces of pipelines and kind of breaking them down for scraps and then pulling them out and putting them back together in bespoke workflows.

Phil Ewels: Yeah, which was always part of the plan for NF-core from the very, the very start was that pipelines would be a foundation to build on, not just entities in their own rights.

Cool. Well, this was a fascinating and wide ranging discussion chaps, so you very much for joining me. And good, good stamina on keeping with me to the end.

I’m sure we’ll, we’ll be touching on these topics again. But it’s been really nice to speak about them in this framing about, you know, gives a good context as to why, why we’re working on these things, why are we doing inputs and outputs.

Like, you know, coming at it from this angle it shows, it shows some of the things which are problems and, and, and how, how these changes and how these, these tweaks and improvements can address them.

Cool. Right, thanks very much Ben and Edmund for joining me and thank you everyone for listening. I hope that it’s been interesting and enjoyable and, we’ll catch you again in a future podcast.

Thanks very much.