Genomics workflows are high-performance computing workloads. Life-science analysis groups make use of assorted genomics workflows. With every invocation, they specify customized units of knowledge and processing steps, and translate them into instructions. Moreover, crew members keep to observe progress and troubleshoot errors, which will be cumbersome, non-differentiated, administrative work.
In Half 3 of this collection, we describe the structure of a workflow supervisor that simplifies the administration of bioinformatics knowledge pipelines. The workflow supervisor dynamically generates the launch instructions based mostly on consumer enter and retains observe of the workflow standing. This workflow supervisor will be tailored to many scientific workloads—successfully turning into a bring-your-own-workflow-manager for every venture.
Use case
In Half 1, we demonstrated how life-science analysis groups can use Amazon Internet Companies to take away the heavy lifting of conducting genomic research, and our design sample was constructed on AWS Step Capabilities with AWS Batch. We talked about that we’ve labored with life-science analysis groups to place failed job logs onto Amazon DynamoDB. Some groups desire to make use of command-line interface instruments, such because the AWS Command Line Interface; different interfaces, reminiscent of PyBDA with Apache Spark, or CWL experimental grammar together with the Amazon Easy Storage Service (Amazon S3) API, are additionally used when entry to the AWS Administration Console is prohibited. In our use case, scientists used the console to simply replace desk gadgets, plus provoke retry through DynamoDB streams.
On this weblog put up, we prolong this concept to a brand new frontend layer in our design sample. This layer automates command era and displays the invocations of a wide range of workflows—turning into a workflow supervisor. Life-science analysis groups use a number of workflows for various datasets and use circumstances, every with completely different syntax and instructions. The workflow supervisor we create removes the executive burden of formulating workflow-specific instructions and monitoring their launches.
Resolution overview
We enable scientists to add their requested workflow configuration as objects in Amazon S3. We use S3 Occasion Notifications on PUT requests to invoke an AWS Lambda operate. The operate parses the uploaded S3 object and registers the brand new launch request as a DynamoDB merchandise utilizing the PutItem
operation. Every merchandise corresponds with a definite launch request, saved as key-value pair. Merchandise values retailer the:
- S3 knowledge path containing genomic datasets
- Workflow endpoint
- Most popular compute service (elective)
One other Lambda operate displays for change knowledge captures within the DynamoDB Stream (Determine 1). With every PutItem
operation, the Lambda operate prepares a workflow invocation, which incorporates translating the consumer enter into the syntax and launch instructions of the respective workflow.
Within the case of Snakemake (mentioned in Half 2), the operate creates a Snakefile that declares processing steps and instructions. The operate spins up an AWS Fargate process that builds the computational duties, distributes them with AWS Batch, and displays for completion. An AWS Step Capabilities state machine orchestrates job processing, for instance, initiated by Tibanna.
Amazon CloudWatch offers a consolidated overview of efficiency metrics, like time elapsed, failed jobs, and error varieties. We retailer log knowledge, together with standing updates and errors, in Amazon CloudWatch Logs. A 3rd Lambda operate parses these logs and updates the standing of every workflow launch request within the corresponding DynamoDB merchandise (Determine 1).
Implementation concerns
On this part, we describe a few of our previous implementation concerns.
Register new workflow requests
DynamoDB gadgets are key-value pairs. We use launch IDs as key, and the worth consists of the workflow sort, compute engine, S3 knowledge path, the S3 object path to the user-defined configuration file and workflow standing. Our Lambda operate parses the configuration file and generates all instructions plus ancillary artifacts, reminiscent of Snakefiles.
Launch workflows
Launch requests are picked by a Lambda operate from the DynamoDB stream. The operate has the next required parameters:
- Launch ID: distinctive identifier of every workflow launch request
- Configuration file: the Amazon S3 path to the configuration sheet with launch particulars (in
s3://bucket/object format
) - Compute service (elective): our workflow supervisor permits to pick a selected service on which to run computational duties, reminiscent of Amazon Elastic Compute Cloud (Amazon EC2) or AWS ParallelCluster with Slurm Workload Supervisor. The default is the pre-defined compute engine.
These factors assume that the configuration sheet is already uploaded into an accessible location in an S3 bucket. This may difficulty a brand new Snakemake Fargate launch process. If both of the parameters is just not offered or entry fails, the workflow supervisor returns MissingRequiredParametersError
.
Log workflow launches
Logs are written to CloudWatch Logs routinely. We write the placement of the CloudWatch log group and log stream into the DynamoDB desk. To ship logs to Amazon CloudWatch, specify the awslogs driver within the Fargate process definition settings in your provisioning template.
Our Lambda operate writes Fargate process launch logs from CloudWatch Logs to our DynamoDB desk. For instance, OutOfMemoryError
can happen if the method makes use of extra reminiscence than the container is allotted.
AWS Batch job state logs are written to the next log group in CloudWatch Logs: /aws/batch/job
. Our Lambda operate writes standing updates to the DynamoDB desk. AWS Batch jobs might encounter errors, reminiscent of being caught in RUNNABLE
state.
Handle state transitions
We handle the standing of every job in DynamoDB. Every time a Fargate process adjustments state, it’s picked up by a CloudWatch rule that references the Fargate compute cluster. This CloudWatch rule invokes a notifier Lambda operate that updates the workflow standing in DynamoDB.
Conclusion
On this weblog put up, we demonstrated how life-science analysis groups can simplify genomic evaluation throughout an array of workflows. These workflows normally have their very own command syntax and workflow administration system, reminiscent of Snakemake. The introduced workflow supervisor removes the executive burden of getting ready and formulating workflow launches, rising reliability.
The sample is broadly reusable with any scientific workflow and associated high-performance computing methods. The workflow supervisor offers persistence to allow historic evaluation and comparability, which allows us to routinely benchmark workflow launches for value and efficiency.
Keep tuned for Half 4 of this collection, by which we discover learn how to allow our workflows to course of archival knowledge saved in Amazon Easy Storage Service Glacier storage lessons.
Associated data