Data Studios – Interactive analysis in Seqera Platform
Nextflow is the de facto standard for reproducible workflows in the cloud, but the scientific data lifecycle is much broader than just pipelines — including iterative development, tertiary analysis, and data modeling. With the Seqera Platform, we aim to enable rapid iteration and collaboration across the entire scientific lifecycle, saving you time whether you’re experimenting, conducting research, preparing for your next clinical trial, or producing a new therapeutic.
In October 2023, Seqera CEO and co-founder Evan Floden unveiled the private-preview of Data Studios, enabling streamlined creation of collaborative notebook environments using cloud-native components coupled with your data and hosted in your own secure environment. Today we’re excited to announce that Data Studios is publicly available to all Seqera Cloud users in Public Preview!
Combining Workflows and Data Analysis
Nextflow and Seqera Platform are enormously effective at launching, managing, and collaborating on scientific data analysis pipelines. However, a pipeline run is often not where the analysis ends, and for every user who needs to run and manage pipelines, many others, including analysts and data scientists, need interactive environments such as Jupyter Notebooks or RStudio. These are used for exploratory data analysis, modeling, and building visualizations and dashboards for analyzing and sharing scientific results.
For scientific users, deploying and configuring secure, performant interactive notebook environments to work with data in context has traditionally been surprisingly hard. As a concrete example, consider a scenario where a pipeline is running on AWS, and a data scientist wants to analyze results stored in Amazon S3 using a familiar Jupyter Notebook. Configuration doesn’t happen by itself: the notebook must be hosted, made network accessible, authorization limited to specific groups of users, and pre-configured with packages commonly used in bioinformatics, such as Biopython, NumPy, Scikit-learn, and Matplotlib.
Before data can even be read using pandas, s3fs must be installed, which in turn depends on other prerequisite packages. Additionally, Notebook users must know the paths to the S3 buckets where the data files reside, including the Nextflow pipeline work directory, and have appropriate access.
Multiply this complexity across multiple tools, cloud providers, file stores, languages, and libraries, and you get the picture: configuring these environments is tedious, time-consuming, error-prone, and often beyond the privilege-level or expertise of analysts.
Data Studios – Simplifying Analysis Environment Management
Data Studios enable you to easily create, manage, and share notebook environments in Seqera Platform using point-and-click actions — connecting your data to on-demand batch computing resources — similar to how you currently manage Nextflow pipelines.
Like pipelines, Data Studios enable simple deployment and scaling using customizable, ephemeral compute environments and containers. You add new interactive environments based on predefined templates, as shown below, defining your own metadata, vCPUs and memory, and deploying them with any (public or private) data mounted on a variety of compute environments already configured in Seqera Platform.
The initial release of Data Studios ships with pre-built container templates for Jupyter and RStudio, and environments can be shared with individuals and teams in Seqera Platform using Role Based Access Control (RBAC).
The productivity impacts for data scientists and analysts are profound: you can launch your preferred interactive environment with a single click, pre-configured with the necessary libraries and notebook markdown files, and have immediate access to pipeline data output for real-time analysis in-context. Furthermore, you can collaborate with colleagues by securely sharing Data Studios, along with the code and visualizations within. Some use cases already developed include:
- Processing single-cell RNAseq data using nf-core/scrnaseq and performing downstream, interactive analysis using the popular Scanpy (Python) or Seurat (R) packages.
- Running a differential gene expression analysis using nf-core/differentialabundance and launching an R Shiny app to explore the results in an RStudio notebook.
- Extending pipeline functionality by experimenting with Nextflow and Bash in VSCode directly using output data from your pipeline run in Seqera Platform.
Snapshots and Session Persistence
Data Studios can be started and stopped at-will, preserving state at every step. This includes all code, output and metadata, ensuring minimum costs are incurred compared to managing independent, dedicated analysis VMs. And all while providing fault tolerance, improved reproducibility, and portability of analyses.
State is preserved via timestamped snapshots of the Data Studio environment. Individual snapshots can optionally be renamed for improved discoverability, and used as the base template for a new Data Studio, preserving the complete analysis history and allowing experimentation without impacting the original analysis environment.
Not just for analysis
Beyond analysts and data scientists, Data Studios are a powerful tool for bioinformaticians developing workflows. In this initial release, we offer a Data Studios template for Microsoft’s VS Code Server — a web-based version of the popular VS Code IDE commonly used by Nextflow pipeline developers.
Unlike the current process where developers typically build and test Nextflow modules and pipelines locally, Data Studios facilitates building, testing, and troubleshooting pipelines in production environments using cloud executors and real data.
Software issues commonly appear when running in specific environments or with particular datasets. Faced with a problem, developers can simply enter their familiar IDE in Data Studios and begin troubleshooting the issue live and in context using real pipeline data.
Looking forward
Just as Data Explorer boosts productivity for researchers and analysts, Data Studios does the same for data scientists. Data Explorer enables researchers to easily access and manage data residing in cloud storage buckets from within Seqera Platform, without switching to external environments like the Amazon S3 console. Similarly, Data Studios enables users to easily launch interactive open science tools to analyze data in-context — no matter where the pipelines run or the output data resides — and use those analyses to inform colleagues in real-time with critical updates to pivot experimental approaches or methodologies. By combining Data Explorer, Pipelines, and Data Studios, Seqera Platform helps guide teams through the scientific data lifecycle enabling:
- Simple linking and exploration of data as it’s generated via Data Explorer.
- Ability to easily develop, deploy, and scale Pipelines.
- Seamless transition from Pipeline output to interactive analysis with Data Studios.
While work continues, Data Studios represents a significant step forward. In the coming months, we'll continue developing additional features including support for custom templates, a cost estimator, resource labels, and improved integration across the Seqera Platform.
Much as the nf-core community builds and curates production-quality pipelines and modules, we envision a similar catalog of Data Studio templates in the future comprising additional interactive analysis tools, such as Integrative Genomics Viewer (IGV) and web-based IDEs such as xpra.
Learning more
You can view running Data Studios today in the Seqera Platform Community Showcase workspace. To enable Data Studios for your own organization, reach out to your Seqera Account Manager or start a free-trial today.