NoisePy tutorial: AWS Batch#

Here’s a tutorial on using Amazon EC2 Batch with Fargate and containers to perform a job that involves writing to and reading from AWS S3.

1. Checklist and prerequisites#

1.1 Tools#

You are not required to run this on a AWS EC2 instance, but two tools are required for this tutorail: AWS Command Line Tool (CLI) and JQ. Note that the code cell below only works for x86_64 CentOS and you have the right permission. You can find installation instructions for other OS below if not running on EC2.

# Install AWS CLI (Command line interface)
# This tool may already be installed if you are on a EC2 instance running Amazon Linux

! curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
! unzip awscliv2.zip
! sudo ./aws/install
# You may check the correct installation of CLI with the following command, 
# which lists the files in SCEDC public bucket.

! aws s3 ls s3://scedc-pds
# Install jq

! sudo yum install -y jq

1.2 AWS Account#

The account ID is a 12-digit number uniquely identify your account. You can find it on your AWS web console.

⚠️ Save the workshop <ACCOUNT_ID> here: REPLACE_ME

1.3 Role#

AWS role is a virtual identity that has specific permissions where its ID (called ARN) is in the format of arn:aws:iam::<ACCOUNT_ID>:role/<ROLE>. AWS batch requires a role to be created for running the jobs. This can be done from the IAM panel on the AWS web console. Depending on the type of service to use, separate roles may be created. A specific role is required for the Batch Service:

  • Trusted Entity Type: AWS Service

  • Use Case: Elastic Container Service

    • Elastic Container Service Task

  • Permission Policies, search and add:

    • AmazonECSTaskExecutionRolePolicy

    • AmazonS3FullAccess

Once the role is created, one more permission is needed:

  • Go to: Permissions tab –> Add Permissions –> Create inline policy

  • Search for “batch”

  • Click on Batch

  • Select Read / Describe Jobs

  • Click Next

  • Add a policy name, e.g. “Describe_Batch_Jobs”

  • Click Create Policy

⚠️ Workshop participants please use arn:aws:iam::<ACCOUNT_ID>:role/NoisePyBatchRole

1.4 Simple Storage Service (S3)#

A S3 bucket is required to store cross-correlation functions and stacking. Users are required to create a new bucket with the right permission listed below.

NoisePy uses S3 cloud store to store the cross correlations and stacked data. For this step, it is important that your role and the bucket have the appropriate permissions for users to read/write into the bucket.

The following statement in the JSON format is called a policy. It explicitly defined which operation is allowed/denied by which user/role. The following bucket policy defines that

  • all operations ("s3:*") are allowed by your account with attached role ("arn:aws:iam::<ACCOUNT_ID>:role/<ROLE>") on any file in the bucket ("arn:aws:s3:::<S3_BUCKET>/*").

  • anyone is allowed to read the data within the bucket ("s3:GetObject","s3:GetObjectVersion")

  • anyone is allowed to list the file within the bucket ("s3:ListBucket")

{
    "Version": "2012-10-17",
    "Id": "Policy1674832359797",
    "Statement": [
        {
            "Sid": "Stmt1674832357905",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<ACCOUNT_ID>:role/<ROLE>"
            },
            "Action": "s3:*",
            "Resource": "arn:aws:s3:::<S3_BUCKET>/*"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::<S3_BUCKET>/*"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::<S3_BUCKET>"
        }
    ]
}

⚠️ Save your <S3_BUCKET> name here: REPLACE_ME

2. Setup Batch Jobs#

2.1 Compute Environment#

You’ll need two pieces of information to create the compute environment: network subnet as well as the security groups. You can use the following commands to retrieve them.

! aws ec2 describe-subnets  | jq ".Subnets[] | .SubnetId"
! aws ec2 describe-security-groups --filters "Name=group-name,Values=default" | jq ".SecurityGroups[0].GroupId"

Use these values to update the missing fields subnets and securityGroupIds in compute_environment.yaml and run the code afterwards. If you have multiple subnets, choose any one of them.

For HPS-book reader, the file is also available here on GitHub.

! aws batch create-compute-environment --no-cli-pager --cli-input-yaml file://compute_environment.yaml

2.2 Create a Job Queue#

Add the computeEnvironment and the jobQueueName in job_queue.yaml and then run the following command.

For HPS-book reader, the file is also available here on GitHub.

! aws batch create-job-queue --no-cli-pager --cli-input-yaml file://job_queue.yaml  

2.3 Create a Job Definition#

Update the jobRoleArn and executionRoleArn fields in the job_definition.yaml file with the ARN of the role created in the first step (they should be the same in this case). Add a name for the jobDefinition and run the code below. Again, the job role ARN is in the format of arn:aws:iam::<ACCOUNT_ID>:role/NoisePyBatchRole

For HPS-book reader, the file is also available here.

! aws batch register-job-definition --no-cli-pager --cli-input-yaml file://job_definition.yaml

3. Submit the Job#

3.1 Cross-correlation Configuration#

Update config.yaml for NoisePy configuration. Then copy the file to S3 so that the batch job can access it after launching. Replace the <S3_BUCKET> with the bucket we just used, as well as an intermediate <PATH> to separate your runs from others.

For HPS-book reader, the file is also available here.

! aws s3 cp ./config.yaml s3://<S3_BUCKET>/<PATH>/config.yaml

3.2 Run Cross-correlation#

Update job_cc.yaml with the names of your jobQueue and jobDefinition created in the last steps. Also give your job a name in jobName. Then update the S3 bucket paths to the locations you want to use for the output and your config.yaml file.

For HPS-book reader, the file is also available here.

! aws batch submit-job --no-cli-pager --cli-input-yaml file://job_cc.yaml

3.3 Run Stacking#

Update job_stack.yaml with the names of your jobQueue and jobDefinition created in the last steps. Also give your job a name in jobName. Then update the S3 bucket paths to the locations you want to use for your input CCFs (e.g. the output of the previous CC run), and the stack output. By default, NoisePy will look for a config file in the --ccf_path location to use the same configuration for stacking that was used for cross-correlation.

For HPS-book reader, the file is also available here.

! aws batch submit-job --no-cli-pager --cli-input-yaml file://job_stack.yaml

4. Visualization#

You can use plot_stacks.ipynb for cross-correlation visualization after all jobs return.