Genomics workflows analyze information at petabyte scale. After processing is full, information is commonly archived in chilly storage courses. In some circumstances, like research on the affiliation of DNA variants towards bigger datasets, archived information is required for additional processing. This implies manually initiating the restoration of every archived object and monitoring the progress. Scientists require a dependable course of for on-demand archival information restoration so their workflows don’t fail.
In Half 4 of this collection, we glance into genomics workloads processing information that’s archived with Amazon Easy Storage Service (Amazon S3). We design a dependable information restoration course of that informs the workflow when information is out there so it may proceed. We construct on prime of the design sample specified by Components 1-3 of this collection. We use event-driven and serverless rules to supply probably the most cost-effective resolution.
Use case
Our use case focuses on information in Amazon Easy Storage Service Glacier (Amazon S3 Glacier) storage courses. The S3 Glacier Prompt Retrieval storage class supplies the lowest-cost storage for long-lived information that’s hardly ever accessed however requires retrieval in milliseconds.
The S3 Glacier Versatile Retrieval and S3 Glacier Deep Archive present additional price financial savings, with retrieval occasions starting from minutes to hours. We concentrate on the latter with a purpose to present probably the most cost-effective resolution.
It’s essential to first restore the objects earlier than accessing them. Our genomics workflow will pause till the info restore completes. The necessities for this workflow are:
- Dependable launch of the restore so our workflow doesn’t fail (on account of S3 Glacier service quotas, or as a result of not all objects have been restored)
- Occasion-driven design to reflect the event-driven nature of genomics workflows and carry out the retrieval upon request
- Value-effective and easy-to-manage through the use of serverless providers
- Upfront detection of archived information when formulating the genomics workflow job, avoiding idle computational duties that incur price
- Scalable and elastic to satisfy the restore wants of huge, archived datasets
Resolution overview
Genomics workflows take a number of enter parameters to organize the initiation, reminiscent of launch ID, information path, workflow endpoint, and workflow steps. We retailer this information, together with workflow configurations, in an S3 bucket. An AWS Fargate job reads from the S3 bucket and prepares the workflow. It detects if the enter parameters embody S3 Glacier URLs.
We use Amazon Easy Queue Service (Amazon SQS) to decouple S3 Glacier index creation from object restore actions (Determine 1). This will increase the reliability of our course of.
An AWS Lambda operate creates the index of all objects within the specified S3 bucket URLs and submits them as an SQS message.
One other Lambda operate polls the SQS queue and submits the request(s) to revive the S3 Glacier objects to S3 Customary storage class.
The operate writes the job ID of every S3 Glacier restore request to Amazon DynamoDB. After the restore is full, Lambda units the standing of the workflow to READY. Solely then can any computing jobs begin, reminiscent of with AWS Batch.
Implementation concerns
We think about the use case of Snakemake with Tibanna, which we detailed in Half 2 of this collection. This permits us to dive deeper on launch particulars.
Snakemake is an open-source utility for whole-genome-sequence mapping in directed acyclic graph format. Snakemake makes use of Snakefiles to declare workflow steps and instructions. Tibanna is an open-source, AWS-native software program that runs bioinformatics information pipelines. It helps Snakefile syntax, plus different workflow languages, together with Widespread Workflow Language and Workflow Description Language (WDL).
We suggest utilizing Amazon Genomics CLI if Tibanna shouldn’t be wanted in your use case, or Amazon Omics in case your workflow definitions are compliant with the supported WDL and Nextflow specs.
Formulate the restore request
The Snakemake Fargate launch container detects if the S3 objects below the requested S3 bucket URLs are saved in S3 Glacier. The Fargate launch container generates and places a JSON binary base name (BCL) configuration file into an S3 bucket and exits efficiently. This file consists of the launch ID of the workflow, corresponding with the DynamoDB merchandise key, plus the S3 URLs to revive.
Question the S3 URLs
As soon as the JSON BCL configuration file lands on this S3 bucket, the S3 Occasion Notification PutObject occasion invokes a Lambda operate. This operate parses the configuration file and recursively queries for all S3 object URLs to revive.
Provoke the restore
The primary Lambda operate then submits messages to the SQS queue that comprises the complete checklist of S3 URLs that have to be restored. SQS messages additionally embody the launch ID of the workflow. That is to make sure we will bind particular restoration jobs to particular workflow launches. If all S3 Glacier objects belong to Versatile Retrieval storage class, the Lambda operate places the URLs in a single SQS message, enabling restoration with Bulk Glacier Job Tier. The Lambda operate additionally units the standing of the workflow to WAITING within the corresponding DynamoDB merchandise. The WAITING state is used to inform the tip person that the job is ready on the data-restoration course of and can proceed as soon as the info restoration is full.
A secondary Lambda operate polls for brand new messages touchdown within the SQS queue. This Lambda operate submits the restoration request(s)—for instance, as a free-of-charge Bulk retrieval—utilizing the RestoreObject API. The operate subsequently writes the S3 Glacier Job ID of every request in our DynamoDB desk. This permits the principle Lambda operate to test if all Job IDs related to a workflow launch ID are full.
Replace standing
The standing of our workflow launch will stay WAITING so long as the Glacier object restore is incomplete. The AWS CloudTrail logs of accomplished S3 Glacier Job IDs invoke our most important Lambda operate (through an Amazon EventBridge rule) to replace the standing of the restoration job in our DynamoDB desk. With every invocation, the operate checks if all Job IDs related to a workflow launch ID are full.
In any case objects have been restored, the operate updates the workflow launch with standing READY. This launches the workflow with the identical launch ID previous to the restore.
Conclusion
On this weblog put up, we demonstrated how life-science analysis groups could make use of their archival information for genomic research. We designed an event-driven S3 Glacier restore course of, which retrieves information upon request. We mentioned tips on how to reliably launch the restore so our workflow doesn’t fail. Additionally, we decided upfront if an S3 Glacier restore is required and used the WAITING state to stop our workflow from failing.
With this resolution, life-science analysis groups can get monetary savings utilizing Amazon S3 Glacier with out worrying about their day-to-day work or manually administering S3 Glacier object restores.
Associated data