Ben ShermanJan 20, 2023

Nextflow and K8s Rebooted: Running Nextflow on Amazon EKS

In the world of container orchestration, Kubernetes has clear momentum. While Kubernetes has long been a hot topic with developers, it is now moving into the IT mainstream as enterprises accelerate their adoption of cloud-native technologies.

Unlike cloud batch services, Kubernetes was not built with scientific workflows in mind, and its complexity has been a barrier to adoption. However, this is beginning to change as Kubernetes distributions mature and organizations gain experience with the platform. Today there are solid reasons to consider Kubernetes as a compute environment for Nextflow:

  • Corporate adoption – As Kubernetes grows in popularity, it’s becoming easier for corporate IT departments to carve out a new tenant on a Kubernetes cluster rather than deploy a traditional HPC cluster.
  • Ubiquity – Cloud batch services are convenient, but they are cloud-specific, and not every cloud provider has one. By contrast, you can spin up a Kubernetes cluster on just about every private and public cloud.

Despite its advantages, Kubernetes can be daunting for first-time users. If you are fortunate enough to have an experienced Kubernetes administrator, they can help you prepare an environment that suits your needs. Otherwise, you may need to deploy a Kubernetes cluster on your own. While there are ways to deploy simple environments locally (e.g. minikube), these environments aren’t practical for production-scale workloads.

In this article, we explain how to configure an Amazon EKS cluster to support Nextflow pipelines.

About Kubernetes

Kubernetes (abbreviated as “K8s”) is an open-source container orchestration platform for automating software deployment, scaling, and management. Kubernetes was initially developed by Google, but the project is now maintained by the Cloud Native Computing Foundation (CNCF).

Today, there are dozens of Kubernetes distributions, and most cloud providers offer Kubernetes-as-a-service. Among the most popular Kubernetes cloud offerings are Amazon EKS, Google GKE, and Azure AKS.

Kubernetes is easily the leading environment for containerized application deployment. A recent survey conducted by the CNCF and released in February 2022 found that 96% of organizations are either using or evaluating Kubernetes – up from 83% in 2020. As enterprise adoption grows, Kubernetes is expected to also become an important platform for scientific workflows.

Nextflow and K8s: A high-level overview

For a long time, the only way to run Nextflow on Kubernetes was to provision a PersistentVolumeClaim (PVC) and launch Nextflow in a “submitter pod” within a cluster, using the PVC to store the work directory shared by Nextflow and its tasks. This setup is challenging because it requires users to know enough about Kubernetes to (1) provision their own storage and submitter pod, and (2) manually transfer input and output data to and from the PVC.

However, we announced a new service in October 2022 which unlocks some interesting possibilities for Nextflow and K8s. Wave is a dynamic container provisioning service – it can, for example, build a Docker image on-the-fly from a Conda recipe file, or augment an existing container with additional functionality.

What does Wave have to do with Kubernetes? Well, one of the main use cases for Wave is to augment task containers on-the-fly with the Fusion file system client, which allows the use of S3 buckets as a local file system in the container. This way, we can store the pipeline work directory in S3, which means (1) we don’t need to provision our own PVC, (2) we don’t have to run Nextflow within the cluster, and (3) we can have Nextflow stage input data from and publish output data to S3, rather than manually transfer files to and from a PVC.

Wave and Fusion greatly simplifies the use of Nextflow with Kubernetes, and it only requires a few lines in the Nextflow configuration. We will use this approach in the following step-by-step guide.

A high-level overview of the integration is illustrated below:

Nextflow and K8s: A high-level overview

Users launch their pipeline from the command line via nextflow run, and Nextflow uses the user’s Kubernetes config file (normally located at ~/.kube/config) to access the cluster. Nextflow will run the pipeline as normal, and it will submit tasks to the cluster as Pods. When a Pod pulls its container image, it will receive an additional layer from Wave which contains the Fusion client. The task will access its work directory like a regular POSIX directory, but the Fusion client will perform S3 transfers under the hood.

With Wave and Fusion, Nextflow can utilize the power and flexibility of Kubernetes while providing a much more convenient interface (similar to AWS Batch) to the user.

Nextflow and EKS: Step-by-step guide

This guide explains how to run Nextflow pipelines on Amazon EKS. The preferred way to install Amazon EKS is using eksctl, an open-source tool jointly developed by AWS and Weaveworks.

The guide is organized as follows:

  1. Prepare a deployment environment and install Amazon EKS

  2. Prepare the EKS cluster for Nextflow

    a. Connect to the cluster

    b. Create a namespace

    c. Create a service account

    d. Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster

    e. Create an S3 bucket

    f. Configure your Kubernetes service account to assume an IAM role

  3. Run a Nextflow pipeline on your EKS cluster

    a. Configure Nextflow for Kubernetes

    b. Run a pipeline

Prepare a deployment environment and install Amazon EKS

We will use command-line tools to accomplish most of the steps in this guide, which means we will need to install these tools in a separate environment, which we will call the “deployment environment”. It can be your local machine or an EC2 instance, depending on where you want to launch your Nextflow pipelines.

In our example, we deployed an EC2 instance to act as our deployment environment. For those new to AWS, the process for launching an EC2 instance is described in the AWS documentation.

You will need to run through the installation process for each of the following tools specific to your deployment environment (in this guide, we are using Ubuntu Linux):

The process of installing a Kubernetes cluster using eksctl is explained in the AWS documentation in Getting started with Amazon EKS – eksctl. You will need to create an IAM user or role with sufficient permissions so that eksctl can deploy the cluster using CloudFormation. See the AWS documentation Enable IAM user and role access to your cluster for instructions on how to do this.

In this guide, we assume that the tools above have been installed and that you have a functional Amazon EKS cluster.

Prepare the EKS cluster for Nextflow

kubectl is a command-line tool that allows you to interact with Kubernetes clusters. Whereas eksctl is used specifically to manage EKS clusters and underlying AWS resources, kubectl works with any Kubernetes cluster and is used to manage Kubernetes resources (e.g. Pods) rather than cloud-specific resources (e.g. EC2 instances).

Connect to the cluster

When eksctl installs your cluster, it automatically creates a ~/.kube/config file containing the cluster’s configuration details. Run the following command to verify that the ~/.kube/config file has been properly configured to access the EKS cluster:

$ kubectl cluster-info
Kubernetes control plane is running at https://54C5BB900A2C2F64DC8C4F428C37CDD8.gr7.us-east-1.eks.amazonaws.com
CoreDNS is running at https://54C5BB900A2C2F64DC8C4F428C37CDD8.gr7.us-east-1.eks.amazonaws.com/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

You can also run ‘kubectl cluster-info dump’ to further debug and diagnose cluster problems.

List the available cluster nodes:

$ kubectl get nodes
NAME                            STATUS   ROLES    AGE   VERSION
ip-192-168-0-140.ec2.internal   Ready    <none>   24d   v1.22.9-eks-810597c
ip-192-168-41-0.ec2.internal    Ready    <none>   24d   v1.22.9-eks-810597c

If you made it this far, congratulations! You now have a functional Amazon EKS cluster and are ready to configure it to support Nextflow pipelines.

Create a namespace

In Kubernetes, a namespace is a mechanism for grouping related resources. For example, every user and/or application in a K8s cluster might have their own namespace. Every K8s cluster starts out with a default namespace, but we will create a custom nextflow namespace to keep things organized. This practice will serve you well as your K8s cluster takes on additional workloads and users over time.

Create a namespace:

kubectl create namespace nextflow

Configure kubectl to use the nextflow namespace by default:

kubectl config set-context --current --namespace nextflow

You can also append -n <namespace> when you run kubectl to target a particular namespace.

Create a service account

Before creating a service account, it is helpful to explain what they are. When a user accesses a Kubernetes cluster (for example, with kubectl), they are authenticated by the Kubernetes API server as a particular user account. In our examples, we’ve been using the nextflow-user IAM user we created in a previous step.

A process running in a Pod can also contact the API server. When it does, it is authenticated as a particular service account. Every namespace has a default service account, but this account has only minimal permissions.

To enable Nextflow to work with Kubernetes, we will create a nextflow-sa service account and empower it with the permissions that Nextflow needs in order to do things like create Pods and mount Volumes.

We will define the service account and associated permissions in a YAML manifest file as shown below. Depending on how you use Nextflow, you may need to adjust the rules defined in nextflow-role, but these permissions should be sufficient for most use cases.

cat nextflow-sa.yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: nextflow
  name: nextflow-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: nextflow
  name: nextflow-role
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/status", "pods/log", "pods/exec"]
    verbs: ["get", "list", "watch", "create", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: nextflow
  name: nextflow-rolebind
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: nextflow-role
subjects:
  - kind: ServiceAccount
    name: nextflow-sa

This manifest will perform the following steps:

  1. Create a new ServiceAccount called nextflow
  2. Create a Role called nextflow-role
  3. Create a RoleBinding called nextflow-rolebind that associates nextflow-role with the nextflow-sa ServiceAccount

You can apply this manifest to the cluster as follows:

$ kubectl create -f nextflow-sa.yaml
serviceaccount/nextflow-sa created
role.rbac.authorization.k8s.io/nextflow-role created
rolebinding.rbac.authorization.k8s.io/nextflow-rolebind created

List the service accounts to verify that the nextflow service account was created:

$ kubectl get sa
NAME          SECRETS   AGE
default       1         16d
nextflow-sa   1         1m21s

Enable IAM Roles for Service Accounts (IRSA) on the EKS cluster

A best practice with Amazon EKS is to use IAM Roles for Service Accounts (IRSA) as explained in the Amazon EKS documentation.

Amazon EKS Clusters version 1.14 or later have an OpenID Connect issuer URL associated with them. You can use the following AWS CLI command to retrieve it.

$ aws eks describe-cluster --name test-cluster --query "cluster.identity.oidc.issuer" --output text
https://oidc.eks.us-east-1.amazonaws.com/id/C72DD67CF92970329EED6154A08B01B2

To use IAM roles for service accounts in your cluster, you will need to create an OIDC identity provider as shown using eksctl.

eksctl utils associate-iam-oidc-provider --cluster test-cluster --approve

Create an S3 bucket

Normally, at this point we would create a persistent volume claim backed by some kind of storage such as EFS. Thankfully, because we are using Wave to store everything in S3, we can skip this part, and instead we only need an S3 bucket.

If you don’t already have a bucket, it’s very easy to create one:

aws s3 mb s3://mybucket

Bucket names in S3 are globally unique, so you will need to substitute s3://mybucket with your unique bucket name.

Configure your Kubernetes service account to assume an IAM role

In the following steps we will create an IAM role called nextflow-role and associate it with the nextflow-sa service account created in a previous step. For Nextflow to run on Kubernetes, we need to create a security policy that allows Pods running Nextflow to manipulate files in the S3 bucket we created in the previous step. We then attach this policy to our IAM role. If you run into difficulties, you can review the Amazon EKS documentation for additional information. An example of an appropriate Kubernetes configuration is also provided in the Wave Showcase on GitHub.

First, create a nextflow-policy.json file containing the security policy as shown below. You will need to modify the script to reflect your chosen S3 bucket name.

cat >nextflow-policy.json <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::mybucket"
            ]
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:PutObjectTagging",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::mybucket/*"
            ],
            "Effect": "Allow"
        }
    ]
}
EOF

Next, create the AWS IAM policy using the command below:

aws iam create-policy --policy-name nextflow-policy --policy-document file://nextflow-policy.json

Before running the steps below, it is helpful to set up a few environment variables for convenience. Make sure that you enter the correct name of the cluster and the region where the cluster is installed in the commands below:

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
export OIDC_PROVIDER=$(aws eks describe-cluster --name test-cluster --region us-east-1 --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///")
export NAMESPACE=nextflow
export SERVICE_ACCOUNT=nextflow-sa

Next, create a trust policy file for the IAM role as shown:

cat > trust-relationship.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::$AWS_ACCOUNT_ID:oidc-provider/$OIDC_PROVIDER"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "$OIDC_PROVIDER:aud": "sts.amazonaws.com",
          "$OIDC_PROVIDER:sub": "system:serviceaccount:$NAMESPACE:$SERVICE_ACCOUNT"
        }
      }
    }
  ]
}
EOF

Now that we have the trust relationship defined, create a IAM role as shown:

aws iam create-role –role-name nextflow-role –assume-role-policy-document file://trust-relationship.json –description "nextflow-role-description"

Attach the IAM policy to the role:

aws iam attach-role-policy --role-name nextflow-role --policy-arn=arn:aws:iam::123456789012:policy/nextflow-policy

You can verify that your role is created using the following command:

aws iam get-role --role-name nextflow-role

Finally, annotate your service account with the Amazon Resource Name of the IAM role that you want the nextflow-sa service account to assume as shown:

kubectl annotate serviceaccount -n $NAMESPACE $SERVICE_ACCOUNT eks.amazonaws.com/role-arn=arn:aws:iam::$AWS_ACCOUNT_ID:role/my-role

You can verify that the annotation was added to the service account using the following command:

kubectl describe sa nextflow-sa

Run a Nextflow pipeline on your EKS cluster

Now that we have an EKS cluster with an empowered service account, it’s time to run a Nextflow pipeline.

Configure Nextflow for Kubernetes

Create a nextflow.config file in your working directory as shown below. Make sure to replace the bucket URL in workDir with the actual name of your bucket.

wave {
  enabled = true
}

fusion {
  enabled = true
}

process {
  executor = 'k8s'
}

workDir = 's3://mybucket/work'

k8s {
  namespace = ‘nextflow'
  serviceAccount = 'nextflow-sa'
}

Run a pipeline

Finally, test the entire integration by running a Nextflow pipeline:

$ nextflow run rnaseq-nf
N E X T F L O W  ~  version 22.10.4
Launching `https://github.com/nextflow-io/rnaseq-nf` [agitated_mccarthy] DSL2 - revision: ed179ef74d [master]
 R N A S E Q - N F   P I P E L I N E
 ===================================
 transcriptome: /home/ubuntu/.nextflow/assets/nextflow-io/rnaseq-nf/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa
 reads        : /home/ubuntu/.nextflow/assets/nextflow-io/rnaseq-nf/data/ggal/ggal_gut_{1,2}.fq
 outdir       : results

executor >  k8s (fusion enabled) (4)
[07/a1c00a] process > RNASEQ:INDEX (ggal_1_48850000_49020000) [100%] 1 of 1 ✔
[e0/ba18a4] process > RNASEQ:FASTQC (FASTQC on ggal_gut)      [100%] 1 of 1 ✔
[e4/af3de6] process > RNASEQ:QUANT (ggal_gut)                 [100%] 1 of 1 ✔
[da/1189ca] process > MULTIQC                                 [100%] 1 of 1 ✔

Done! Open the following report in your browser --> results/multiqc_report.html

Addendum: What about kuberun?

Experienced Nextflow users may point out that the nextflow kuberun command is another option that makes it easier to use Nextflow with Kubernetes. While kuberun does allow you to launch a pipeline from outside the cluster by automatically creating a submitter Pod for you, it suffers from a number of limitations:

  1. You can’t run local pipeline scripts.
  2. You can use local config files, but they have limited support — some config features like implicit variables (e.g. baseDir) and custom functions simply don’t work and would be difficult to support.
  3. You still have to provision your own PVC and manually transfer your data to and from the PVC. Creating and managing this PVC is arguably the most difficult aspect of using Nextflow with Kubernetes.

For these reasons, kuberun has never been a viable solution for running production pipelines on Kubernetes. Instead, we believe the challenges of Nextflow+K8s are more effectively solved by external solutions: a workflow platform (like Tower) to manage the submitter pod, and Wave (which can be used with or without Tower) to provide remote storage.

Conclusion

In this guide, we have taken you step-by-step through the process of deploying an Amazon EKS environment, and using it as a Nextflow compute environment. While deploying and managing Kubernetes clusters can be a bit complicated, for organizations already using Kubernetes, leveraging the same infrastructure for scientific workflows makes perfect sense. New innovations such as Wave and Fusion make running Nextflow on Kubernetes easier than ever before.

We will continue to enhance our support for Kubernetes in Nextflow. To learn more, and keep abreast of the latest developments, check out the Nextflow Kubernetes documentation and the Nextflow blog.