Note: We strongly recommend creating a separate conda environment to run this tutorial. You can do so using:

conda create --name scoped python=3.12

Earthquake detection on AWS Batch with SeisBench and PyOcto#

In this tutorial, we use SeisBench and PyOcto to generate a deep learning earthquake catalog. We read the input data from the S3 repository of the NCEDC and write all picks and associated events to a MongoDB database. We use Amazon EC2 Batch with Fargate to parallelise the computations.

Note: This tutorial focuses on the cloud integration of the tools described here. For a deeper dive into the SeisBench and PyOcto, have a look at the tutorials on their Github pages.

This tutorial is based on the NoisePy on AWS Batch tutorial.

0. Background on earthquake catalog generation#

Before getting into the actual tutorial, let’s take a few lines to describe the workflow and the tools we are using. Earthquake catalog generation is typically a two-step process. First, a phase picker identifies a set of (potential) phase arrival times in a set of continuous waveforms. Second, these phases are passed to a phase associator that groups the phase arrivals into events by identifying which picks fit to a consistent origin. This step also helps identifying false picks, as these will usually not correspond to consistent onsets.

For phase detection and picking, we use the models integrated in SeisBench. SeisBench is a toolbox for machine learning in seismology, offering a wide selection of data sets, models, and training pipelines. In particular, it contains a collection of pretrained phase picking models, i.e., ready-to-use versions of, e.g., EQTransformer or PhaseNet trained on different datasets. In this tutorial, we will be using PhaseNet trained on the INSTANCE dataset, a large, well-curated dataset from Italy.

For phase association, we will be using PyOcto. PyOcto is a high-throughput seismic phase associator. It was built specifically to deal with the high number of phase picks coming from modern deep learning pickers in dense seismic sequences. PyOcto internally uses an iterative 4D search scheme in space-time.

In this tutorial, we’ll treat the phase picker and phase associator mostly as a black box. We will interact with it through a prebuilt Docker container. We use this abstraction to focus on the AWS Batch aspects of the workflow, rather than the seismological aspects. If you’re interested in the inner workings of the code, the Dockerfile and all codes used in this example are available in the Cloudwork repository.

1. Checklist and prerequisites#

1.1 Tools#

This tutorial can be executed both locally and on an AWS EC2 instance. Note that in both cases the actual computation is happening on the AWS cloud within AWS Batch. The machine you’re working on, so either your machine or an EC2 instance, is only used to submit and monitor the jobs.

If you’re running locally, you’ll need to install the AWS Command Line Tool (CLI). Note that the code cell below only works for x86_64 and requires appropriate permission. You can find installation instructions for other OS below if not running on EC2. Please note that the AWS CLI version in the Ubuntu package repository tends to be outdated and is not recommended.

# Install AWS CLI (Command line interface)
# This tool may already be installed if you are on a EC2 instance running Amazon Linux

! curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
! unzip awscliv2.zip
! sudo ./aws/install
# You may check the correct installation of CLI with the following command,
# which lists the files in NCEDC public bucket.

! aws s3 ls s3://ncedc-pds

1.2 Scripts, configuration files, Python dependencies#

The scripts and configuration files required for this tutorial are available on Github. You can download them with the following command:

! wget https://github.com/SeisSCOPED/cloudwork/archive/refs/heads/main.zip
! unzip cloudwork-main.zip
cd cloudwork-main/sb_catalog

Once you downloaded the scripts, you’ll need to install the dependencies.

!pip install -r requirements.txt

If you have a quick look at the installed software, you’ll notice that neither SeisBench nor PyOcto are among the requirements. That’s because these tools will not run on your local machine, but only in a prebuilt Docker container on AWS Batch.

1.3 AWS Account#

The account ID is a 12-digit number uniquely identify your account. You can find it on your AWS web console.

⚠️ Save the workshop <ACCOUNT_ID> here: REPLACE_ME

1.4 Role#

AWS role is a virtual identity that has specific permissions where its ID (called ARN) is in the format of arn:aws:iam::<ACCOUNT_ID>:role/<ROLE>. AWS batch requires a role to be created for running the jobs. This can be done from the IAM panel on the AWS web console. Depending on the type of service to use, separate roles may be created. A specific role is required for the Batch Service:

  • Trusted Entity Type: AWS Service

  • Use Case: Elastic Container Service

    • Elastic Container Service Task

  • Permission Policies, search and add:

    • AmazonECSTaskExecutionRolePolicy

Once the role is created, one more permission is needed:

  • Go to: Permissions tab –> Add Permissions –> Create inline policy

  • Search for “batch”

  • Click on Batch

  • Select Read / Describe Jobs

  • Click Next

  • Add a policy name, e.g. “Describe_Batch_Jobs”

  • Click Create Policy

⚠️ Workshop participants please use arn:aws:iam::<ACCOUNT_ID>:role/SeisBenchBatchRole

1.5 MongoDB Atlas#

In this tutorial, we’ll be using MongoDB Atlas as out database to store picks, events, and associations. To set up your database, go to (https://cloud.mongodb.com/). Once you created an account, you’ll have to create a cluster. Chose the free M0 tier on AWS. Make sure you select the same AWS region your computations will be running in.

To make your database accessible from AWS, go to Security -> Network Access and add “0.0.0.0/0” as allowed IP. Warning: This makes your database available publicly, even though it will still require a login! It’s generally not considered good practice for a production system,

Now you’ll have to generate a user. Go to Security -> Database Access. Create a new user with password authentication and select the role “Write and read any databases”.

To retrieve your connection URI, go to your database, click Connect and drivers. There should be a URI with this format: mongodb+srv://<username>:<password>@???.???.mongodb.net/. Insert your username and password into the address listed in the interface and save it below.

⚠️ Save <DB_URI> here: REPLACE_ME

Congratulations, you’ve finished the setup!

2. Setup Batch Jobs#

Hint: Throughout this chapter, you’ll be prompted to update values in config files and script. Make sure to do the necessary modifications before running the commands.

2.1 Compute Environment#

You’ll need two pieces of information to create the compute environment: network subnet as well as the security groups. You can use the following commands to retrieve them.

! aws ec2 describe-subnets  | jq ".Subnets[] | .SubnetId"
! aws ec2 describe-security-groups --filters "Name=group-name,Values=default" | jq ".SecurityGroups[0].GroupId"

Use these values to update the missing fields subnets and securityGroupIds in compute_environment.yaml and run the code afterwards. If you have multiple subnets, choose any one of them.

! aws batch create-compute-environment --no-cli-pager --cli-input-yaml file://configs/compute_environment.yaml

2.2 Create a Job Queue#

Add the computeEnvironment and the jobQueueName in job_queue.yaml and then run the following command.

! aws batch create-job-queue --no-cli-pager --cli-input-yaml file://configs/job_queue.yaml

2.3 Create the Job Definitions#

For this tutorial, we will use two job definitions. One for picking, one for association. Update the jobRoleArn and executionRoleArn fields in the two files job_definition_picking.yaml and job_definition_association.yaml file with the ARN of the role created in the first step (they should be the same in this case). Add a name for the jobDefinition in each file and run the code below. Again, the job role ARN is in the format of arn:aws:iam::<ACCOUNT_ID>:role/SeisBenchBatchRole

! aws batch register-job-definition --no-cli-pager --cli-input-yaml file://configs/job_definition_picking.yaml
! aws batch register-job-definition --no-cli-pager --cli-input-yaml file://configs/job_definition_association.yaml

3. Building the catalog#

After everything has been set up, we can now start building our catalog. MongoDB databases are internally split into collections, which is further divided into separate tables. For our experiment, we’ll put everything into one collection. By default, let’s call the collection “tutorial”.

3.1 Populating the station database#

While everything is set up now, we’re missing a tiny piece of information: the available stations! In principle, we could parse all inventory files available on the NCEDC S3 bucket, but as this would take some time, we instead provide a precompiled file that just needs to be pushed into your MongoDB database.

! python -m src.station_helper ncedc_stations.csv --db_uri <DB_URI>

3.2 Submitting the picking and association jobs#

In this tutorial, we will use a Python script to submit the relevant picking and association jobs. Before submitting the jobs, you’ll need to provide the jobs with the name of the job queue and the two job definitions. Add them at the beginning of the parameters.py file.

We now submit the picking and association jobs for 10 days in a 2 by 2 degree region in Northern California.

! python -m src.submit 2019.172 2019.182 39,41,-125,-123 <DB_URI>

Now that the jobs are running, let’s take a moment to describe what the python job did. The two job definitions are actually parameterized jobs, i.e., they can be further configured when launched by passing in parameters. For example, the picking jobs will get information on the stations and time range it is supposed to process. The submission job performs a few simple steps:

  • Identify the stations that are within the area of interest

  • Group the stations and days into reasonably sized chunks

  • Submit one picking job per chunk

  • Submit one association job per time range. This job will depend on all picking jobs for this time range. To ensure it doesn’t start running before, we use the dependency feature of AWS. The submission script just tells AWS Batch, only to start the association job, once all required picking jobs have finished.

3.3 Monitoring job and result#

Let’s have a look at the progress of our catalog generation:

  • Go to the AWS web console and navigate to AWS batch. You should see a list of jobs queued, currently running, and (hopefully soon) successfully finished. Make sure to click on your job queue to only see your own jobs.

  • Go to the MongoDB Atlas web interface. After a while, a table with picks should appear and start being populated. And once your first associations are done, you’ll see the list of events and the associated picks.

3.4. Visualization#

Once your jobs are finished, we can visualise the resulting catalog with the script below. After running the script, the figure is available as events.png. Note that to avoid additional dependencies, we resort to a very simplistic plot using a local coordinate projection.

! python -m src.plot <DB_URI>

4. Exercise#

Now that you’ve learned how to create a machine learning catalog using SeisBench and PyOcto on AWS batch, it’s time for a small exercise. Goal of the exercise is to create a catalog using a different picker, e.g., EQTransformer, trained on a different dataset, e.g., ethz.

Hints:

  • To avoid reusing the existing picks from your first run, you should use a different collection (within the same database). Just use the --collection argument for the submit script.

  • You’ll need to set new parameters in the picking job definition. The parameters to pass to the docker container/pick script are called --model and --weight. Make sure to updated the job definition on AWS using the AWS console.

  • If you want to be extra flexible, why not include the command line arguments as parameters of the job definition, e.g., Ref::model. With a little modification to the submit script, you could then even pass in the model and weights when submitting the jobs.

Closing remarks#

You’ve now learned the basics on how to use batch processing to create a deep learning based machine learning catalog in the cloud within a few minutes. Clearly, this tutorial is rather simplified, so here are further points you might want to consider when building a catalog:

  • The picking and association tools have lots of tuning parameters and options, like thresholds or quality control criteria. Here we just hard-coded a set suitable for this tutorial. But why not add a few more parameters to the job definition?

  • We used a precompiled Docker container. While this is convenient, it also limits your flexibility to configure the behaviour of the tools. To exchange the docker container, you need to push your own container to a registry and edit the job definitions.

  • Practical earthquake detection workflows are often much more complex than the simplified version we provided. You might want to use a tool to estimate better locations, determine magnitudes, or otherwise characterise your seismicity in more detail. All of these steps can be executed on the cloud and within AWS Batch. However, the simplistic model of managing jobs and their dependencies we used here, will soon reach its limits. To solve this issue, there are workflow management systems like dask or Nextflow that help you manage such workloads and can operate on top of AWS Batch.