-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF annotation via asynchronous request-response pattern #108
Open
ehclark
wants to merge
19
commits into
main
Choose a base branch
from
feature/async-vcf
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
8cba86a
Add optional celery and aiofiles dependencies for async support
ehclark 325200d
Add implementation for async VCF annotation
ehclark e9cd1e6
Move logging config to main anyvar module so it is activated for both…
ehclark c0e38e9
Documentation and logging updates
ehclark a105404
Changing project file optional deps to "test" to match Makefile
ehclark 23989ef
Move has_queueing_enabled() function to top level
ehclark 2d56bb2
Fixing small errors
ehclark 01a30ac
Add unit tests for async VCF processing
ehclark 2f9d0cf
Don't allow duplicate run ids to be active concurrently
ehclark d2232f5
Add documentation for async capabilities
ehclark db4b326
Documentation updates
ehclark 19a8428
Documentation updates
ehclark a59430c
Change time estimate to 333/sec for async
ehclark c33b3ae
Documentation format updates
ehclark 237b50b
Add link to async README
ehclark e96a186
Update to worker shutdown and cleanup handling
ehclark 9dfd81d
Improve logging
ehclark 60f975e
Add pool type warning to README
ehclark 4101c1f
Fix caution block formatting
ehclark File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
# AnyVar Asynchronous VCF Annotation | ||
AnyVar can use an | ||
[asynchronous request-response pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/async-request-reply) | ||
when annotating VCF files. This can improve reliability when serving remote clients by | ||
eliminating long lived connections and allow AnyVar to scale horizontally instead of vertically | ||
to serve a larger request volume. AnyVar utilizes the [Celery](https://docs.celeryq.dev/) | ||
distributed task queue to manage the asynchronous tasks. | ||
|
||
## How It Works | ||
AnyVar can be run as a FastAPI app that provides a REST API. The REST API is run using | ||
uvicorn or gunicorn, eg: | ||
```shell | ||
% uvicorn anyvar.restapi.main:app | ||
``` | ||
|
||
AnyVar can also be run as a Celery worker app that processes tasks submitted through the REST API, eg: | ||
```shell | ||
% celery -A anyvar.queueing.celery_worker:celery_app worker | ||
``` | ||
|
||
When VCF files are submitted to the `/vcf` endpoint with the `run_async=True` query parameter, | ||
the REST API submits a task to the Celery worker via a queue and immediately returns a `202 Accepted` | ||
response with a `Location` header indicating where the client should poll for status and results. | ||
Once the VCF is annotated and the result is ready, the polling request will return the annotated | ||
VCF file. For example: | ||
``` | ||
> PUT /vcf?run_async=True HTTP/1.1 | ||
> Content-type: multipart/form-data... | ||
|
||
< HTTP/1.1 202 Accepted | ||
< Location: /vcf/a1ac7850-0df7-4db6-82ab-b19bce93faf3 | ||
< Retry-After: 120 | ||
|
||
> GET /vcf/a1ac7850-0df7-4db6-82ab-b19bce93faf3 HTTP/1.1 | ||
|
||
< HTTP/1.1 202 Accepted | ||
|
||
> GET /vcf/a1ac7850-0df7-4db6-82ab-b19bce93faf3 HTTP/1.1 | ||
|
||
> HTTP/1.1 200 OK | ||
> | ||
> ##fileformat=VCFv4.2... | ||
``` | ||
|
||
The client can provide a `run_id=...` query parameter with the initial PUT request. If one is not | ||
provided, a random UUID will be generated (as illustrated above). | ||
|
||
## Setting Up Asynchronous VCF Processing | ||
Enabling asychronous VCF processing requires some additional setup. | ||
|
||
### Install the Necessary Dependencies | ||
Asynchronous VCF processing requires the installation of additional, optional dependencies: | ||
```shell | ||
% pip install .[queueing] | ||
``` | ||
This will install the `celery[redis]` module and its dependencies. To connect Celery to a different | ||
message broker or backend, install the appropriate extras with Celery. | ||
|
||
### Start an Instance of Redis | ||
Celery relies on a message broker and result backend to manage the task queue and store results. | ||
The simplest option is to use a single instance of [Redis](https://redis.io) for both purposes. This | ||
documentation and the default settings will both assume this configuration. For other message broker | ||
and result backend options, refer to the Celery documentation. | ||
|
||
If a Docker engine is available, start a local instance of Redis: | ||
```shell | ||
% docker run -d -p 6379:6379 redis:alpine | ||
``` | ||
Or follow the [instructions](https://redis.io/docs/latest/get-started/) to run locally. | ||
|
||
### Create a Scratch Directory for File Storage | ||
AnyVar does not store the actual VCF files in Redis for asynchronous processing, only paths to the file. | ||
This allows very large VCF files to be asychronously processed. All REST API and worker instances of AnyVar | ||
require access to the same shared file system. | ||
|
||
### Start the REST API | ||
Start the REST API with environment variables to set shared resource locations: | ||
```shell | ||
% CELERY_BROKER_URL="redis://localhost:6379/0" \ | ||
CELERY_BACKEND_URL="redis://localhost:6379/0" \ | ||
ANYVAR_VCF_ASYNC_WORK_DIR="/path/to/shared/file/system" \ | ||
uvicorn anyvar.restapi.main:app | ||
``` | ||
|
||
### Start a Celery Worker | ||
Start a Celery worker with environment variables to set shared resource locations: | ||
```shell | ||
% CELERY_BROKER_URL="redis://localhost:6379/0" \ | ||
CELERY_BACKEND_URL="redis://localhost:6379/0" \ | ||
ANYVAR_VCF_ASYNC_WORK_DIR="/path/to/shared/file/system" \ | ||
celery -A anyvar.queueing.celery_worker:celery_app worker | ||
``` | ||
To start multiple Celery workers use the `--concurrency` option. | ||
|
||
> [!CAUTION] | ||
> Celery supports different pool types (prefork, threads, etc.). | ||
> AnyVar ONLY supports the `prefork` and `solo` pool types. | ||
|
||
|
||
### Submit an Async VCF Request | ||
Now that the REST API and Celery worker are running, submit an async VCF request with cURL: | ||
```shell | ||
% curl -v -X PUT -F "vcf=@test.vcf" 'https://localhost:8000/vcf?run_async=True&run_id=12345' | ||
``` | ||
And then check its status: | ||
```shell | ||
% curl -v 'https://localhost:8000/vcf/12345' | ||
``` | ||
|
||
## Additional Environment Variables | ||
In addition to the environment variables mentioned previously, the following environment variables | ||
are directly supported and applied by AnyVar during startup. It is advisable to understand the underlying | ||
Celery configuration options in more detail before making any changes. The Celery configuration parameter | ||
name corresponding to each environment variable can be derived by removing the leading `CELERY_` and lower | ||
casing the remaining, e.g.: `CELERY_TASK_DEFAULT_QUEUE` -> `task_default_queue`. | ||
| Variable | Description | Default | | ||
| -------- | ------- | ------- | | ||
| CELERY_TASK_DEFAULT_QUEUE | The name of the queue for tasks | anyvar_q | | ||
| CELERY_EVENT_QUEUE_PREFIX | The prefix for event receiver queue names | anyvar_ev | | ||
| CELERY_TIMEZONE | The timezone that Celery operates in | UTC | | ||
| CELERY_RESULT_EXPIRES | Number of seconds after submission before a result expires from the backend | 7200 | | ||
| CELERY_TASK_ACKS_LATE | Whether workers acknowledge tasks before (`false`) or after (`true`) they are run | true | | ||
| CELERY_TASK_REJECT_ON_WORKER_LOST | Whether to reject (`true`) or fail (`false`) a task when a worker dies mid-task | false | | ||
| CELERY_WORKER_PREFETCH_MULTIPLIER | How many tasks a worker should fetch from the queue at a time | 1 | | ||
| CELERY_TASK_TIME_LIMIT | Maximum time a task may run before it is terminated | 3900 | | ||
| CELERY_SOFT_TIME_LIMIT | Amount of time a task can run before an exception is triggered, allowing for cleanup | 3600 | | ||
| CELERY_WORKER_SEND_TASK_EVENTS | Change to `true` to cause Celery workers to emit task events for monitoring purposes | false | | ||
| ANYVAR_VCF_ASYNC_FAILURE_STATUS_CODE | What HTTP status code to return for failed asynchronous tasks | 500 | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
"""Provides asynchronous tasks via Celery integration""" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, any reason for
pyaml
and notpyyaml
?