Pipeline Storage: Stop Comparing Price per TB

Thanks to cloud computing, we have more options to store our genomics data than ever, but these options bring enormous challenges with them. What used to be a simple fixed cost for a shared server is now a bewildering set of options where everything is a choice. Spend more upfront and trade away flexibility, reduce the up front costs and hope dynamic pricing doesn’t get you. The price per TB is not as relevant as the shape of the bill; where do you pay the fixed costs vs the surcharges, invest in hardware vs spending more engineering time. Together, we end up with a set of incomparable systems with their own strengths and weaknesses.

Most teams will pick a single architecture and stick with it. Bioinformatics data lasts a long time, and once established it’s rare to sit down and evaluate if switching storage methods is worth it. It’s difficult, time-consuming and at the end of the day very hard to reach a firm conclusion. These costs are critical and they can quickly dwarf the costs of running pipelines, so optimizing them should be one of your highest priorities.

Three Options to Store Pipeline Data

Typically, the three options are shared storage, cloud filesystems and object storage. Each makes different trade-offs around cost, performance, and operational overhead.

→On-prem local storage - physical disks in a server in your building or a rented rack. Served over NFS, GPFS, Lustre, BeeGFS, or whichever HPC stack your institute uses.
→Cloud shared filesystems - the cloud equivalent of shared storage (FSx for Lustre, GCP Filestore, Azure NetApp Files). Same POSIX interface, with somebody else racking the disks.
→Object storage - AWS S3, Google Cloud Storage, Azure Blob. Storage that’s distributed so far and wide it no longer looks like a file system. No POSIX, no mount points, but essentially infinite, highly available storage accessed via web-style APIs.

The three categories push the operational overhead onto different people and cost you in different ways.

Fixed Costs: The Entrance Fee

A local cluster is your traditional shared storage layer, with racks of physical storage mounted on a server, available via a shared protocol to all nodes. It has a high upfront cost to buy and install the hardware, but once provisioned it’s reusable for the lifetime of the hardware. However, this can be a deceptively expensive option. Every month the disks cost you in power, cooling, datacenter space, plus the salary of whoever replaces failed drives. All of these costs exist even if we don’t use a single byte.

Object storage is the cloud-native way of storing data. The cloud providers use clever sharding over massive scale to spread data over thousands of drives, amortizing the costs of storage across customers to the point the upfront cost is effectively zero. That same scale keeps data replicated across availability zones for durability, and makes object storage one of the most cost-effective options available. Major cloud providers all charge in the region of $0.02/GB-month, so 12 TB is roughly $240/month, but more importantly, if you delete the data after a pipeline finishes, the bill goes with it. Conversely, if you rapidly need more storage, there’s no capital expenditure to increase your capacity, just push your data to the cloud.

A persistent cloud filesystem tries to split the difference: take an existing storage solution like Lustre, turn it into a managed product like AWS FSx for Lustre, and let the cloud provider handle maintenance. The challenge is that you still have upfront costs on top of cloud fees; you need to size in advance for peak concurrent usage, can't easily shrink capacity, and the meter runs 24/7. In the end, renting a shared drive from the cloud is paying more for someone else to manage the risk for you, so you can focus on your work, but it comes at a premium cost. AWS FSx for Lustre SSD-persistent costs around $0.145/GB-month at the entry throughput tier, so a 12 TB filesystem clocks in at around $1,740/month before any pipeline runs, over six times the cost of S3 object storage. EFS Standard at $0.30/GB-month is more than twice that again.

Transfer Costs: The Usage Fee

With local disks, storage transfer is effectively free because bytes move over your own network, which you already paid for. For cloud filesystems, traffic is cheap because they sit close to the compute. You do need to be careful that your filesystem and your compute instances don’t drift apart, as cross availability zone (AZ) data transfer charges are around $0.01/GB on AWS and Azure and can add up quickly for a busy pipeline, but generally, the data transfer rates for networked filesystems are small enough to ignore.

Object storage is where transfer bills can be unpredictable. It is cheap to upload and store data, but you get charged every time you access it. You get charged for reading, moving or copying files, and these costs are dramatically increased when you do it between cloud regions, with cross-region replication at $0.02/GB and egress to the wider internet at a hefty ~$0.09/GB. This incentivizes you to upload data to your cloud provider and leave it there, only touching it from within the cloud platform. For analyzing genomics data, this is sometimes fine. Do all the analysis on a Batch service in the same region and costs can be kept minimal, but sharing work with colleagues or collaborators across the globe can quickly escalate costs. Worse still, transfer costs only appear in arrears; by the time you've seen the bill, the data was moved long ago!

Management Costs: The Overhead Fee

Local storage has the highest hidden cost here. Drives, controllers and fans fail; capacity fills up. You need sysadmins who know how to deal with degraded RAID arrays and angry phone calls from users. Capacity planning is a real job but rarely goes accounted for, taking time away from staff who could be doing something more productive.

Persistent cloud filesystems trade hardware problems for configuration problems. Someone has to provision Terraform, size volumes, monitor throughput limits, rotate snapshots, and work out why your Lustre cluster has degraded throughput. You pay a cloud vendor to ignore the hardware but still have to pay the costs to manage the data layer.

Object storage management is largely outsourced to the provider. Lifecycle policies, intelligent tiering, retention rules, and replication all need to be enabled, but they come as pre-packaged features. You can still set fire to your bill, but there is much less routine work. For example, access control and auditing use the same primitives as the rest of your cloud deployment: IAM and bucket policies aren't trivial, but they're a known entity and you don't have to roll your own.

Development Costs: The Hidden Fee

The whole bioinformatics tooling stack is built on assumptions about file paths and directories, glob patterns and symlinks. Local disks and managed cloud filesystems are easy to use because they behave like a regular filesystem and everything in bioinformatics expects this. Want to read a file? Easy, use a 40-year-old protocol. Unfortunately in this modern age where data exceeds the capacity of a single machine, this doesn’t map to object storage APIs. If you’re using object storage you need to bridge that gap so reading a file can be translated to be the protocol these old tools understand. There are three primary strategies for dealing with this: you can rewrite tools to speak object storage natively, wrap every task in localize/upload boilerplate, or bolt a generic FUSE driver on top of the object storage. None of these are great, and all are workarounds for the same underlying mismatch: object storage and POSIX just speak different languages.

What’s worse is they all require effort. If you can assume a single type of storage, you can write your pipelines to match that storage. The moment you don’t know how the files will arrive, you have to code that into your pipeline. Do you write wrapper logic to detect a remote file and download it first? Do you put the effort on the infrastructure engineer to make remote files feel local? After making the decisions, you have to invest development time in building this logic instead of pipeline code and often, this detracts from the project goals.

Nextflow was one of the first bioinformatics tools to understand this dichotomy and tried to abstract the underlying storage away. Developers denote inputs and outputs as files and Nextflow handles them for you. Working on a local filesystem? Nextflow uses a symlink to save the copy operation. Using object storage with distributed computing? Nextflow stages the file using native command line tooling. By hiding the underlying protocol, Nextflow takes some of the development burden off the bioinformatics engineer and instead puts it on configuration. But the solution isn’t perfect: data staging adds delays at the start and end of every task and requires cloud access permissions to be pre-configured; local storage must be large enough to hold all inputs and outputs for every running task. Overall, Nextflow trusts your infrastructure has been set up correctly, a tricky thing to guarantee.

Who Pays the Time Tax?

Accessible filesystems keep developers happy at the cost of administrator time. Object storage keeps administrators happy at the cost of developer time. The persistent filesystem option can be particularly punishing because of how you size it. The storage has to cover the worst case: every concurrent pipeline running at peak, plus headroom. A "12 TB" filesystem is rarely 12 TB because the team uses 12 TB; it's 12 TB because that's the smallest number that doesn't break under load. Most of the time it sits half-empty but fully billed.

	Local disks	Object storage	Persistent cloud filesystems
Fixed cost	High, predictable	Low, scales to zero	High, scales poorly to idle
12 TB / month	Capex + power + cooling	~$280 (S3 Standard)	~$1,740 (FSx SSD)
Transfer	Free in-network	Cheap same-region, $0.09/GB to internet	Cheap, watch cross-AZ
Management	Sysadmin time	Mostly the provider	DevOps time
POSIX semantics	Native	None	Native
Existing tools	Just work	Need rewrites or wrappers	Just work

Fusion: Object Storage Prices, Filesystem Semantics

Fusion addresses the object storage/filesystem tradeoff by implementing a thin, POSIX-compliant client. This client runs inside your pipeline containers (installed via Wave) and mounts S3, GCS, or Azure Blob storage as a standard filesystem. Existing tools read and write files normally while Fusion streams data directly to object storage, using intelligent caching for performance. The advantage of this approach is decoupling compute from persistent storage sizing. Since Fusion uses your existing object storage, there is no need to provision a shared filesystem, size for peak concurrency, or pay for idle capacity.

For bioinformatics developers, compatibility is straightforward: existing Nextflow modules require only a configuration flag (fusion.enabled = true) to work, and Wave will add the Fusion binary to the existing containers. Tools like samtools continue to access files via a POSIX path, which Fusion resolves to the corresponding S3 object. Operationally, this combines developer-friendly POSIX semantics with the favorable economics of object storage: when pipelines are idle, costs reflect minimal data-at-rest pricing.

While several FUSE (Filesystem in Userspace) implementations exist, most are optimized for general use cases like web serving of small, discrete files. Fusion is specifically engineered for high-performance bioinformatics workloads, supporting disk access patterns common in these tools, such as extremely fast random reads from large files and optimized large sequential writes. It’s true that in certain scenarios, a dedicated Lustre drive may outperform Fusion, but by using fast, local NVMe drives and intelligent caching we can mitigate these performance drawbacks while keeping the low cost overheads of a FUSE drive.

What It Looks Like on a Real Bill

Say we take the same 12 TB working set mentioned above, with a busy bioinformatics group running a few concurrent pipelines a day. The monthly costs are already drastically different:

Approach	Monthly cost
FSx for Lustre SSD-persistent	~$1,740, plus cross-AZ traffic
EFS Standard	~$3,600
S3 Standard + Fusion	~$280 storage + Fusion throughput licence

But the bigger savings are operational. Those engineering hours often cost more than the storage itself, even if they never show up as a line item. By removing the maintenance burden, Fusion gives teams back hundreds of hours that would otherwise go on wrestling with infrastructure. If you’ve ever had to fix a pipeline failing because of disk issues, Fusion is for you.

Our aim at Seqera has always been to help you spend more time doing science. We've spent a long time watching capable teams burn engineering time on storage infrastructure that should be invisible. Fusion is our answer to that: object storage prices, filesystem semantics, no software rewrites required. If you want to see what that looks like with your actual workload, get in touch.

What is Fusion?Fusion is Seqera's cloud-native file system designed and optimized for Nextflow pipelines. Whether you're running high-throughput genomics pipelines or scaling across multiple clouds, Fusion lets you analyze data in place without sacrificing speed, reliability, or efficiency.

Learn more about Fusion