AWS EC2 Batch

Warning

This page may not be updated. For the latest HPS book, please visit https://seisscoped.org/HPS-book

Here’s a short tutorial on using Amazon EC2 Batch with Fargate Spot and containers to perform a job that involves writing and reading from Amazon S3:

The below steps require setting up the AWS CLI as well as the jq tool (optional).

1. Create role

AWS batch requires an IAM role to be created for running the jobs. This can be done from the IAM webconsole.

Create a role using the following options:

  • Trusted Entity Type: AWS Service

  • Use Case: Elastic Container Service

    • Elastic Container Service Task

On the next page, search for and add:

  • AmazonECSTaskExecutionRolePolicy

  • AmazonS3FullAccess

Once the role is created, one more permission is needed:

  • Go to: Permissions tab –> Add Permissions –> Create inline policy

  • Search for “batch”

  • Click on Batch

  • Select Read / Describe Jobs

  • Click Next

  • Add a poolicy name, e.g. “Describe_Batch_Jobs”

  • Click Create Policy

Finally, go to the S3 bucket where you’ll be writing the results of the jobs. Open the Permissions tab and add a statement to the bucket policy granting full access to the role you just created:

		{
			"Sid": "Statement3",
			"Principal": {
			    "AWS": "arn:...your job role ARN."
			},
			"Effect": "Allow",
			"Action": "s3:*",
			"Resource": "arn:...your bucket ARN."
		}

Note that the job role ARN will be in the format of arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<JOB_ROLE_NAME>. The bucket ARN will be in the format of arn:aws:s3:::<YOUR_S3_BUCKET>.

2. Create a Compute Environment

You’ll need two pieces of information to create the compute environment. The list of subnets in your VPC and the default security group ID. You can use the following commands to retrieve them:

aws ec2 describe-subnets  | jq ".Subnets[] | .SubnetId"
aws ec2 describe-security-groups --filters "Name=group-name,Values=default" | jq ".SecurityGroups[0].GroupId"

We design the computing environment in a YAML file

computeEnvironmentName: ''  # [REQUIRED] The name for your compute environment.
type: MANAGED
state: ENABLED
computeResources: # Details about the compute resources managed by the compute environment.
  type: FARGATE
  maxvCpus: 256 # [REQUIRED] The maximum number of Amazon EC2 vCPUs that a compute environment can reach.
  subnets: # [REQUIRED] The VPC subnets where the compute resources are launched.
  - ''
  securityGroupIds: # [REQUIRED] The Amazon EC2 security groups that are associated with instances launched in the compute environment.
  - ''

Use this values to update the missing fields in compute_environment.yaml and the run:

aws batch create-compute-environment --no-cli-pager --cli-input-yaml file://compute_environment.yaml

Make a note of the compute environment ARN to use in the next step.

3. Create a Job queue

Add the compute environment and a name to job_queue.yaml as the folling file:

jobQueueName: ''  # [REQUIRED] The name of the job queue.
state: ENABLED
priority: 0
computeEnvironmentOrder: # [REQUIRED] The set of compute environments mapped to a job queue and their order relative to each other.
- order: 0  # [REQUIRED] The order of the compute environment.
  computeEnvironment: '' # [REQUIRED] The Amazon Resource Name (ARN) of the compute environment.

and then run:

aws batch create-job-queue --no-cli-pager --cli-input-yaml file://job_queue.yaml

4. Create a Job Definition

Update the jobRoleArn and executionRoleArn fields in the job_definition.yaml file with the ARN of the role created in the first step:

jobDefinitionName: ''  # [REQUIRED] The name of the job definition to register.
type: container
platformCapabilities:
- FARGATE
containerProperties:
  image: 'ghcr.io/noisepy/noisepy'
  command:
  - '--help'
  jobRoleArn: ''
  executionRoleArn: ''
  resourceRequirements: # The type and amount of resources to assign to a container.
  - value: '16'
    type: VCPU
  - value: '32768'
    type: MEMORY
  networkConfiguration: # The network configuration for jobs that are running on Fargate resources.
    assignPublicIp: ENABLED  # Indicates whether the job has a public IP address. Valid values are: ENABLED, DISABLED.
  ephemeralStorage: # The amount of ephemeral storage to allocate for the task.
    sizeInGiB: 21  # [REQUIRED] The total amount, in GiB, of ephemeral storage to set for the task.
retryStrategy: # The retry strategy to use for failed jobs that are submitted with this job definition.
  attempts: 1  # The number of times to move a job to the RUNNABLE status.
propagateTags: true # Specifies whether to propagate the tags from the job or job definition to the corresponding Amazon ECS task.
timeout: # The timeout configuration for jobs that are submitted with this job definition, after which Batch terminates your jobs if they have not finished.
  attemptDurationSeconds: 36000  # The job timeout time (in seconds) that's measured from the job attempt's startedAt timestamp.

Add a name for the jobDefinition. Finally, run:

aws batch register-job-definition --no-cli-pager --cli-input-yaml file://job_definition.yaml

4. Submit a job

We will take the example of noisepy. Update job_cc.yaml with the names of your jobQueue and jobDefinition created in the last steps:

jobName: 'noisepy-cross-correlate'
jobQueue: ''
jobDefinition: '' # [REQUIRED] The job definition used by this job.
# Uncomment to run a job across multiple nodes. The days in the time range will be split across the nodes.
# arrayProperties:
#   size: 16  # number of nodes
containerOverrides: # An object with various properties that override the defaults for the job definition that specify the name of a container in the specified job definition and the overrides it should receive.
  resourceRequirements:
  - value: '90112' # CC requires more memory
    type: MEMORY
  command: # The command to send to the container that overrides the default command from the Docker image or the job definition.
  - cross_correlate
  - --raw_data_path=s3://scedc-pds/continuous_waveforms/
  - --xml_path=s3://scedc-pds/FDSNstationXML/CI/
  - --ccf_path=s3://<YOUR_S3_BUCKET>/<CC_PATH>
  - --config=s3://<YOUR_S3_BUCKET>/<CONFIG_PATH>/config.yaml
timeout:
  attemptDurationSeconds: 36000  # 10 hrs

Then update the S3 bucket paths to the locations you want to use for the output and your config.yaml file.

aws batch submit-job --no-cli-pager --cli-input-yaml file://job_cc.yaml --job-name "<your job name>"

Submit a Stacking job

Update job_stack.yaml with the names of your jobQueue and jobDefinition created in the last steps.

jobName: 'noisepy-stack'
jobQueue: ''
jobDefinition: '' # [REQUIRED] The job definition used by this job.
# Uncomment to run a job across multiple nodes. The station pairs to be stacked will be split across the nodes.
# arrayProperties:
#   size: 16  # number of nodes
containerOverrides: # An object with various properties that override the defaults for the job definition that specify the name of a container in the specified job definition and the overrides it should receive.
  resourceRequirements:
  - value: '32768'
    type: MEMORY
  command: # The command to send to the container that overrides the default command from the Docker image or the job definition.
  - stack
  - --ccf_path=s3://<YOUR_S3_BUCKET>/<CC_PATH>
  - --stack_path=s3://<YOUR_S3_BUCKET>/<STACK_PATH>
timeout:
  attemptDurationSeconds: 7200  # 2 hrs

Then update the S3 bucket paths to the locations you want to use for your input CCFs (e.g. the output of the previous CC run), and the stack output. By default, NoisePy will look for a config file in the --ccf_path location to use the same configuration for stacking that was used for cross-correlation.

aws batch submit-job --no-cli-pager --cli-input-yaml file://job_stack.yaml --job-name "<your job name>"