Automating resource creation with Seqerakit

In modern genomics labs worldwide, bioinformaticians face a data deluge. By the start of 2020, approximately 38 million genomes had been analyzed using various techniques, ranging from genotyping to whole genome sequencing. This number is projected to grow to 52 million by 2025. This rapid rate of data accumulation is challenging bioinformaticians, who must process and analyze ever-increasing volumes of information. As data grows, so does the need for running complex analysis pipelines – often hundreds or thousands of times for large-scale studies.

The challenge is clear: how can bioinformaticians keep pace with this data explosion without sacrificing accuracy or efficiency? The answer lies in automation. Just as robotics revolutionized wet lab processes, the dry lab needs similar innovations in workflow automation.

Enter Seqerakit — a tool built to streamline and automate the creation and management of Seqera Platform resources and pipeline execution. By addressing the growing need for automation, Seqerakit helps scientists reclaim their time from repetitive tasks and focus on extracting meaningful insights from their genomic data.

In this blog post, we'll walk you through using Seqerakit to automate the scientific data lifecycle — from managing your team’s access and setting up your input data and computational infrastructure to launching your pipeline — all with a single configuration file and command.

💡Sign up: If you do not yet have a Seqera Platform account, click here to sign up.

What is Seqerakit?

Imagine having a tool that could conjure up your entire bioinformatics infrastructure with one command. For Seqera users, Seqerakit is that tool. At its core, Seqerakit is a Python-based utility that transforms simple YAML configuration files into Platform CLI commands, automating the creation and management of your computational and data resources.

But Seqerakit is more than just a wrapper — it's a game-changer for bioinformaticians looking to streamline their workflows. Here's why:

Infrastructure as code: Define, manage, and version your entire infrastructure right from the command line.
Simplicity in configuration: Work with intuitive YAML files, making configuration as easy as describing your ideal setup in plain language.
End-to-end automation: Manage everything from your organization setup to launching your pipelines, all with a single command and configuration file.

Seqerakit takes the pain and repetition out of setup and management, letting you get to your analysis faster.

Step 1: Install Seqerakit

Ready to get started? Follow these steps to set up Seqerakit:

💡Prerequisites: Ensure you have the following prerequisites installed: Python (3.8 or later), PyYAML and Platform CLI

Install Seqerakit using pip:
pip install seqerakit
Set up your Seqera access token as an environment variable:
export TOWER_ACCESS_TOKEN=<Your Seqera access token>
Verify your installation:
seqerakit --info

And that's it! You're now ready to start automating your bioinformatics workflows with Seqerakit.

💡Hint: For alternative installation methods or advanced configuration options, see Seqerakit installation for more details on customizing your setup.

Step 2: Populate a YAML configuration file

Seqerakit uses YAML configuration files to define the parameters for creating Seqera resources. While you can create individual resources with separate commands, the real power of Seqerakit lies in defining multiple resources in a single YAML file.

Here's a snippet of what a Seqerakit configuration might look like for setting up and launching the nf-core RNASeq pipeline as part of a cancer genomics project:

organizations:
  - name: 'CancerGenomics'
    full-name: 'Cancer Genomics Research Institute'
    description: 'Advancing cancer research through innovative analysis'

workspaces:
  - name: 'bulk-rna-sequencing'
    full-name: 'RNASeq Analysis Workspace'
    organization: 'CancerGenomics'
    description: 'Workspace for RNA sequencing workgroup'

credentials:
  - type: 'aws'                             
    name: 'aws-credentials'                    
    workspace: 'CancerGenomics/bulk-rna-sequencing'
    access-key: '$AWS_ACCESS_KEY_ID'                         
    secret-key: '$AWS_SECRET_ACCESS_KEY'                                
    assume-role-arn: '$AWS_ASSUME_ROLE_ARN'                               
    overwrite: True

compute-envs:
  - name: 'aws-fusion'
    workspace: 'CancerGenomics/bulk-rna-sequencing'
    credentials: 'aws-credentials'
    type: aws-batch
    config-mode: forge
    region: 'us-west-2'
    work-dir: 's3://cg-data'
    fusion-v2: True
    wave: True
    max-cpus: 1000
    instance-types: 'c6i,r6i,m6i'

pipelines:
  - name: 'rnaseq-pipeline'
    workspace: CancerGenomics/bulk-rna-sequencing'
    region: 'eu-west-1'
    compute-env: 'aws-fusion'
    url: 'https://github.com/nf-core/rnaseq'
    revision: '3.18.0'
    work-dir: 's3://cg-data/work'
    params:
      outdir: 's3://cg-data/results'
      aligner: 'star_salmon'
      pseudo_aligner: 'salmon'

launch:
  - pipeline: 'rnaseq-pipeline'
    params:
      input: 's3://cg-data/samples.csv'

This configuration sets up an organization, workspace, credentials, compute environment, and the nf-core RNASeq pipeline as part of a cancer genomics project. When you run Seqerakit with this file, it will create all these resources and then launch the pipeline with the specified parameters.

💡Hint: See Templates for downloadable YAML configuration files and guidance to create individual or multiple Seqera resources with Seqerakit.

Step 3: Run Seqerakit

Before creating resources, you may want to preview the CLI commands Seqerakit will generate. The --dryrun flag is especially useful for debugging and verifying resource dependencies before making changes:

seqerakit file.yml --dryrun

The output of this command will show:

The exact Platform CLI commands that will be executed
The order of resource creation, helping to catch dependency issues
Potential errors or misconfigurations before making real changes

If your --dryrun output looks as expected and contains no errors, your YAML configuration file is ready, and you're one command away from setting up your resources and launching your pipeline:

seqerakit file.yml

Seqerakit will process your configuration file, creating all specified resources and launching the defined pipeline. You're now ready to dive into your analysis!

Optional step: Delete resources

When setting up temporary environments for testing, you may want to clean up your Seqera resources. This helps to save costs, keeps your workspaces organized, and ensures you’re not maintaining unused infrastructure. With Seqerakit, this process is as simple as creating resources. With the same configuration file you used to set up your environment, you can recursively delete the resources you created previously with the following command:

seqerakit file.yaml --delete

This command removes all resources defined in the YAML file, in the correct order to avoid dependency conflicts. Always check the contents of your configuration file to ensure you’re only deleting intended resources.

Automation to streamline your analysis workflows

The genomics data explosion demands efficient, automated workflows. Seqerakit answers this call by simplifying Seqera Platform resource management and pipeline execution. With infrastructure-as-code and end-to-end automation, it frees bioinformaticians to focus on data analysis rather than setup and configuration. Whether you're launching a new project, performing pipeline or compute benchmarking, or optimizing existing workflows, Seqerakit can significantly reduce repetitive tasks.

💡Ready to revolutionize your genomic analysis? See the Seqerakit documentation to get started today.