SpecUFEx Tutorial: Amatrice, Italy October 2016#

Written by Theresa Sawi and Nate Groebner

This example walks through fitting a SpecUFEx\(^0\) model to seismograms of approximately 1,000 earthquakes from the first 6 days of the October 2016 Amatrice (Italy) earthquake sequence. This is a subset of the events detected via Quakeflow\(^1\) (PhaseNet\(^2\) + Gamma\(^3\) + HypoInverse\(^4\) + HypoDD\(^5\)) by Kaiwen Wang and Felix Waldhauser. We cluster the features extracted by SpecUFEx with K-means clustering to identify patterns of earthquakes.

We’ve color coded the markdown cells in this notebook to indicate where you can edit the code cells:#

Example. Feel free to the cell below this box#


Installation instructions#

Create New Instance#

  1. Log into AWS

  2. Create new instance with all default settings except: A. Instance type: t2: large B. Network Settings > Select “default” security group

  3. Launch instance: A. Go to instance console, find the instance summary and click the “Connect” button on the upper left. B. Navigate to “EC2 Instance Connect”, use all default settings, and click connect. C. A Linux terminal for an empty machine should open automatically

Set up environment#

  1. In your terminal, type these commands to set up the instance environment:

    sudo yum install -y git docker
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    chmod +x Miniconda3-latest-Linux-x86_64.sh
    ./Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda
    ./miniconda/bin/conda init bash
    bash
    
  2. To open the notebook, copy and paste the public-facing DNS address (“Public IPv4 DNS”) from the Instance Console into your browser (e.g., ec2-18-222-100-129.us-east-2.compute.amazonaws.com). Enter “scoped” as the token password

#Start docker#

sudo service docker start
sudo systemctl enable docker
sudo usermod -aG docker ec2-user
sudo chmod 666 /var/run/docker.sock

#Pull and run tutorial Docker image#

sudo docker pull ghcr.io/seisscoped/specufex:tutorial
sudo docker run -p 80:8888 --rm -it ghcr.io/seisscoped/specufex:tutorial \
    nohup jupyter lab --no-browser --ip=0.0.0.0 --allow-root --IdentityProvider.token=scoped &
http://ec2-18-222-100-129.us-east-2.compute.amazonaws.com
##PASSWORD TOKEN:
scoped
  1. Open the “amatrice_tutorial.ipynb” notebook :)


Tutorial Steps#

  1. Read in waveforms from hdf5 file. These are 20-second records from a short-period seismograph filtered between 1-40 Hz, windowed 3 seconds before and 17 seconds after the P-wave picks on the vertical component from station IV.T1412, which is located roughly in the middle of the earthquake sequence.

  2. Convert waveforms to spectrograms (filtered and median normalized).

  3. Run SpecUFEx on spectrograms

  4. K-means clustering on SpecUFEx fingerprints (evaluated using Silhouette scores)

  5. View spatial and magnitude trends of the clusters


SpecUFEx Publications#

See recent publications featuring SpecUFEx:

  1. Sawi, T., Waldhauser, F., Holtzman, B., Groebner N. (2023) Detecting repeating earthquakes on the San Andreas Fault with unsupervised machine-learning of spectrograms. The Seismic Record; 3 (4): 376–384. Link

  2. Sawi, T., Holtzman, B., Walter, F., & Paisley, J. (2022) An unsupervised machine-learning approach to understanding seismicity at an alpine glacier. Journal of Geophysical Research: Earth Surface, 127, e2022JF006909. Link


References#

  1. Holtzman, B. K., Paté, A., Paisley, J., Waldhauser, F., & Repetto, D. (2018). Machine learning reveals cyclic changes in seismic source spectra in Geysers geothermal field. Science Advances, 4, 5. https://doi.org/10.1126/sciadv.aao2929

  2. Zhu, W., Hou, A. B., Yang, R., Datta, A., Mostafa Mousavi, S., Ellsworth, W. L., & Beroza, G. C. (2023). QuakeFlow: a scalable machine-learning-based earthquake monitoring workflow with cloud computing. Geophysical Journal International, 232(1), 684–693. https://doi.org/10.1093/gji/ggac355

  3. Zhu, W., & Beroza, G. C. (2018). PhaseNet: A Deep-Neural-Network-Based Seismic Arrival Time Picking Method. Geophysical Journal International, 216(1), 261–273. https://doi.org/10.1093/gji/ggy423

  4. Zhu, W., McBrearty, I. W., Mousavi, S. M., Ellsworth, W. L., & Beroza, G. C. (2022). Earthquake phase association using a Bayesian Gaussian Mixture Model. Journal of Geophysical Research: Solid Earth, 127, e2021JB023249. https://doi.org/10.1029/2021JB023249

  5. Klein, F. W. (2002). User’s guide to HYPOINVERSE-2000, a Fortran program to solve for earthquake locations and magnitudes. In Open-File Report. https://doi.org/10.3133/ofr02171

  6. Waldhauser, F., HypoDD: A computer program to compute double-difference earthquake locations, USGS Open File Rep., 01-113, 2001. pdf

Imports#

import time
from datetime import datetime
import h5py
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import scipy.signal as sp
from sklearn.cluster import KMeans
from tqdm import trange
from specufex import BayesianNonparametricNMF, BayesianHMM
import tqdm
from sklearn.metrics import silhouette_samples
from matplotlib.gridspec import GridSpec
from matplotlib.colors import ListedColormap
import matplotlib.gridspec as gridspec

np.random.seed(42)

Define helper functions#

  1. calcSilhScore: Calculates the optimal number of clusters and silhouette scores for K-means clustering based on user-defined range of clusters

  2. getSgram Get spectrogram from H5 file based on event ID

def calcSilhScore(X, range_n_clusters=np.arange(2, 11)):
    """
    Calculates the optimal number of clusters and silhouette scores for K-means clustering.

    Parameters:
    -----------
    X : array-like
        The input dataset to be clustered.
    range_n_clusters : list or range
        A range of integers specifying the number of clusters to evaluate.

    Returns:
    --------

    ## Return avg silh scores, avg SSEs, and Kopt for 2:Kmax clusters
    ## Returns altered cat00 dataframe with cluster labels and SS scores,
    ## Returns NEW catall dataframe with highest SS scores

    Kopt : int
        The optimal number of clusters.
    maxSilScore : float
        The maximum silhouette score achieved.
    avgSils : list
        A list of average silhouette scores for each number of clusters.
    sse : list
        A list of sum of squared errors for each number of clusters.
    cluster_labels_best : array-like
        cluster labels for the best clustering result.
    ss_best : array-like
        Silhouette scores for each sample in the best clustering result.
    euc_dist_best : array-like
        Euclidean distances to the centroids for each sample in the best clustering result.
    """

    maxSilScore = 0

    sse = []
    avgSils = []
    centers = []

    for n_clusters in range_n_clusters:
        print(f"kmeans on {n_clusters} clusters...")

        kmeans = KMeans(
            n_clusters=n_clusters,
            max_iter=500,
            init="k-means++",  # how to choose init. centroid
            n_init=10,  # number of Kmeans runs
            random_state=0,
        )  # set rand state

        # get cluster labels
        cluster_labels_0 = kmeans.fit_predict(X)

        # increment labels by one to match John's old kmeans code
        cluster_labels = [int(ccl) + 1 for ccl in cluster_labels_0]

        # get euclid dist to centroid for each point
        sqr_dist = kmeans.transform(X) ** 2  # transform X to cluster-distance space.
        sum_sqr_dist = sqr_dist.sum(axis=1)
        euc_dist = np.sqrt(sum_sqr_dist)

        # save centroids
        centers.append(kmeans.cluster_centers_)

        # kmeans loss function
        sse.append(kmeans.inertia_)

        # Compute the silhouette scores for each sample
        sample_silhouette_values = silhouette_samples(X, cluster_labels)

        #         %  Silhouette avg
        avgSil = np.mean(sample_silhouette_values)

        # avgSil = np.median(sample_silhouette_values)

        avgSils.append(avgSil)
        if avgSil > maxSilScore:
            Kopt = n_clusters
            maxSilScore = avgSil
            cluster_labels_best = cluster_labels
            euc_dist_best = euc_dist
            ss_best = sample_silhouette_values

    print(f"Best cluster: {Kopt}")

    return Kopt, maxSilScore, avgSils, sse, cluster_labels_best, ss_best, euc_dist_best


def getSgram(evID, SpecUFEx_H5_path):
    """
    Get spectrogram from H5 file based on event ID
    """
    with h5py.File(SpecUFEx_H5_path, "r") as fileLoad:
        sgram = fileLoad["spectrograms"].get(str(evID))[:]
        fSTFT = fileLoad["spectrograms/fSTFT"][()]
        tSTFT = fileLoad["spectrograms/tSTFT"][()]

    return tSTFT, fSTFT, sgram

Definitions and parameters#

Here, we define station location, file paths, as well as spectrogram-generation, SpecUFEx, and clustering parameters.

Guidelines for spectrogram parameters:#

Bandpass filter for spectrogram (optional)#

fMin = 5 # Hz

fMax = 20 # Hz

Spectrogram parameters#

sgramMode = ‘magnitude’ # returns absolute magnitude of the STFT

sgramScaling = ‘spectrum’ # calculate power spectrum where Sxx has units of V**2, if x is measured in V and fs is measured in Hz

Frequency/time resolution#

nperseg = 64 # Length of each segment in samples

noverlap = nperseg / 4 #Number of samples to overlap between segments

nfft = 512 # Length of the FFT used in samples, if a zero padded FFT is desired. If None, the FFT length is nperseg. Defaults to None.