These are supplementary instructions for setting up AWS Batch in order to use our workflow. These instructions assume that you understand the basics of AWS and particularly AWS Batch. As a warning, this is not for the faint of heart, and we provide some guidance.
In summary you must
- do the set-up (this need only be done once -- unless your data set sizes vary very considerably or you change region)
- decide which AWS region you want to run in
- create an AWS S3 bucket in that region as work directory (I'll refer to this as
eg-aws-bucket
in these instructions, which you should replace with your own bucket name- if your input and output are going to AWS S3 they should be in the same region -- the input could be on your local disk or S3, and the output could go your local disk or S3. It would probably make sense for the input to be on S3 since you might want to run several times, but that is completely up to you and YMMV. Do remember to set the security on the buckets so as to protect any sensitive data
- set up an AWS Batch Compute Environment with appropriate permissions
- set up an AWS Batch queue that connects to the batch environment
- set up a Nextflow config file that puts all this together
- run the workflow
NB: AWS Batch requires all resources to be set up in the same AWS region. A major cause of misconfiguration is failing to observe this (easy to do), so pay attention to which region you are in. In particular, the S3 buckets and AWS Batch needs to be in the same region.
Log into your AWS console.
Choose the AWS region based on what is convenient for you and has a the best pricing. It probably doesn't make a big difference. Our University's ethics committee prefers us to run on af-south-1
since these are subject to South African privacy laws but your requirements may well be different. Remember that any buckets you use must be in the same region.
You need to have a bucket that can be used for temporary storage. It could be an existing bucket. Make sure you are the only person who can read and write to it.
Make sure you have a suitable VPC (Virtual Private Cloud) for Batch to run in. If you haven't done this before, consider following our instructions in Section 5 (Addendum below) before setting up the Batch Compute environment.
This defines what resources you need for your workflow. There are a number of steps but mainly we can use the default values. However, you do need to be careful that you set things up so that you have enough reources.
By default, AWS instances that run batch jobs are 30GB in size. We think that you need an image that is at least 4x bigger than the input data size to run safely. If your need is less than 30GB, there's no problem and you can skip the rest of 2.1. If not, there's an extra configuration step in the configuration to set up an environment with disks of the correct size
The easiest way of doing this is to set up a launch template. You can do this using the console but in my experience it is more complex than using the command line tools.
Install boto3 library using pip or yum or the like (e.g., yum install python3-boto
)
There is a file called launch-template.json
in this directory. Download it to your machine. Change the LaunchTemplateName field associated value to something meaningful and unique for you and the VolumeSize field to the value you want (in units of GB)
Then, using the following command (mutatis mutandis -- that is change the region and the name of the template file if you've changed it) create the launch template
Note that the service templates are account and region specific. So if you you define a template for af-south-1
it will not show in us-east-1
aws ec2 --region af-south-1 create-launch-template --cli-input-json file://h3a-100GB-template.json
The bigger your data files the more RAM you need. This is not an issue you have to worry about at configuration time. When you run the workflow you may have to set the plink_mem_req
and other_mem_req
parameters as discussed elsewhere.
Read
- https://www.nextflow.io/docs/latest/executor.html#aws-batch
- https://www.nextflow.io/docs/latest/awscloud.html#awscloud-batch (up to point 5 -- from "Configuration" on is meant for developers of pipelines and not users).
You should be able to follow the default settings unless you need a launch template
Choose
- Computer Environment Configuration
- Managed environment type
- give a meaningful name
- enable environment
- under "Service role" pick AWSBatchServiceRole
- under additional settings for "Compute environment configuration" (this sometimes only appears once you click in the next section -- so the UI can be confusing here so be patient).
- AWSBatchServiceRole
- ecsInstanceRole
- Choose a keypair for the region (this is only needed if you intend to ssh into the instances that spin up and so would not normally be done)
- Instance Configuration
- Spot pricing (choose the percentage your pocket can afford)
- Optimal for allowed instance types
SPOT_CAPACITY_OPTIMIZED
for allocation strategy- You don't need a "Spot fleet role" if you have chosen
SPOT_CAPACITY_OPTIMIZED
- Under Additional settings (if you have ever defined a launch template for this region)
- pick none or a template you have defined if you need to. Note for each template you need to define a new environment. (See section 2.1 above). If you haven't defined a template for this region there will be nothing for you to do and you won't be able to select an option; that's OK.
- You don't need to to pick an AMI and should do so only if you really know what you are doing.
- Under networking add a VPC -- note that under additonal settings are the definitions of the securty groups which define access
- Add tags if you want to : may be helpful for tracking resources
The AWS Batch instances will need to access S3 and so you need to give them this permission.
- From the AWS Console, choose "Services" and then "IAM"
- Choose "Roles"
- Choose "ecsInstanceRole"
- Choose
Attach Policies
- In the filter bar type in
AmazonS3FullAccess
(NB: no spaces) and select
The ecInstanceRole now comprises two policies: AmazonEC2ContainerServiceforEC2Role and AmazonS3FullAccess
In the Amazon Console, go back to AWS Batch from the list of services
Create a job queue. Unless you need to do something fancy just pick the default options.
- for convenience call the queue the same as the environment
- attach the environment you created to the queue
You use the queue name in your Nextflow config file.
It will look something like this. I'll call this aws.config
for the example but you can name it what ever you want. Please note that the accessKey
and secretKey
must be valid for the account and region in which you created the environment and queue. Also if you are using an IAM user the IAM user must have permission to run Batch jobs.
params {
plink_mem_req = "8GB"
other_mem_req = "8GB"
}
profiles {
awsbatch {
region = "af-south-1"
accessKey ='YourAccessKeyForTheRegion'
secretKey ='AssociatedSecretKey'
aws.uploadStorageClass = 'ONEZONE_IA'
process.queue = 'QueueYouCreated'
process.executor = "awsbatch"
}
}
Then run the code something like this. Note in this example I am using input that's already in S3, except for the one file is local (this is to show you that the data can be in different places -- it would probably make sense for the phenotype file to stored in S3, but perhaps you are trying to be extra careful).
nextflow run h3agwas/qc/main.nf \
-profile awsbatch -c aws.config \
-work-dir 's3://za-batch-work' \
--input_dir 's3://za-batch-data' --input_pat sim1_s \
--output_dir 's3://za-batch-data/qc' --output sim1_s_qc\
--data 's3://za-batch-data/out_qt.pheno' \
--case_control data/see.fam --case_control_col=pheno
Note the work bucket you give will start to fill up (as will any output buckets). If you do lots of analysis it's possible for the work bucket to quickly get to the hundreds of GB level. There may be some sensitive data and AWS will also charge you for this, so remember to regularly delete objects from your work bucket.
We have found that one of the reasons for Batch not working is that the networking for Batch is not set up correctly. For those who are not very familiar with AWS, these instructions may be useful. If you find that your workflow just hangs, one possible reason is that you have not set up networking properly. These are instructions are better done before setting up the environment
Your batch jobs will run in a VPC (Virtual Private Cloud). This VPC needs to communicate with you and S3. There are many ways of doing this and although not in scope of our instructions you may find this useful. As a warning, our instructions use public IP addresses -- this should be secure but you may want consider using private IP addresses only -- this is really out of scope of our documentation.
- In the Console Services choose "VPC"
- Choose Create VPC
- Select VPC + more
- Name your VPC meaningfully -- you will need this name
- Use an IPv4 CIDR block: I've used 10.0.0.0/8 and 172.31.0.0/16 but any sensible choice of private IP range should work
- Select the number of availability zones AZ: default of 2 is probably good
- Choose public subnets, using same number as AZs as you chose. Do not create private subnets
- Don't add a NAT
- Create
- The creation of the VPC will also create an internet gateway -- it's ID will start "igw-" and will have as part of the name the name you gave your VPC.
- Once the VPC has been created, select it and look for the main routitng table -- it will start with "rtb". Select that
- Click on "routes" and "Edit Routes"
- Choose "Add Routes"
- Add the entry 0.0.0.0/0 and then under Target choose "Internet gateway" and select the gateway that has created
- Save
- While still on the "VPC Dashboard", click on Subnets in the panel on the left. For each of the subnets that belong to the VPC
- Select the subnet from the Actions menu option and select Edit subnet settings
- Tick Enable auto-assign public IPv4 address
- Save
When you create your Batch environment use this VPC