Workload monitoring with ML Diagnostics

Workload Monitoring is a built-in capability within the ML Diagnostics platform for detecting and diagnosing issues affecting AI/ML workloads running on TPUs on Google Cloud. For every GKE job, ML Diagnostics Workload Monitoring automatically creates an ML run and monitors TPU duty cycle across the workload to detect events affecting the workload.

ML Diagnostics Workload Monitoring has the following key features:

Workload-to-node mapping: Automatically identify the Google Cloud nodes your workload is using. This is available by default for all TPU workloads on GKE.
Status tracking: Monitor workload statuses for multiple states, including performance degradation, hang, and termination.
Metric-based problem detection: Detect workload problems (performance degradation, hangs, terminations) by monitoring the TPU duty cycle. By default, a 15% drop in duty cycle triggers analysis.
Automated diagnosis: If a problem is detected, automated analyzers run to find potential infrastructure or software causes. Analyzers include: ICI Link, Thermal, TPU throttling, and HBM capacity.
Issue localization: Pinpoint the specific nodes affected by detected issues.
Suggested actions: Recommended steps for recovery or further investigation, such as restarting a job, removing a potentially faulty node, or collecting XProf profiles.
Console alerts: Show alerts and analysis results in the ML Diagnostics section of the Google Cloud console.
Event timeline: Display a timeline of workload events alongside the TPU duty cycle metric to help correlate events with metric time series.
Analysis details: Detailed reports for analyzers run for each event.

For more information on ML Diagnostics, see the ML Diagnostics platform overview.

Prerequisites

To use ML Diagnostics Workload Monitoring, enable the Cluster Director API and add the required IAM permissions. For more information, see ML Diagnostics prerequisites.

Since Workload Monitoring is in preview, contact your account team to enable ML Diagnostics workload monitoring in your TPU project.

Get started

When you submit your TPU job on GKE, the ML Diagnostics system automatically begins tracking your workload without requiring any manual code instrumentation. Diagnostic information is displayed in the ML Diagnostics Google Cloud console and can be retrieved using the ML Diagnostics CLI or API. Diagnostic information is listed for each Machine Learning Run (ML Run).

Workload Monitoring and analyzers

ML Diagnostics continuously monitors your workload performance by monitoring TPU duty cycle and how it relates to the underlying infrastructure. The default monitoring threshold is a 15% drop in TPU duty cycle. The workload states are tracked automatically and assigned one of the following values:

running: The workload is actively executing. This state is not directly displayed in the Google Cloud console.
termination: The workload terminated and is no longer running. This can be due to a successful completion or a failed job.
hang: Detected by a prolonged period of minimal to no TPU activity without workload completion.
performance degradation: Triggered by a significant drop in performance metrics like TPU duty cycle.

For problems detected in the workload, ML Diagnostics triggers a set of analyzers to help diagnose potential problems with the workload at the infrastructure and software layers. These analyzers identify infrastructure or software problems, pinpoint which nodes are affected, and suggest troubleshooting steps.

ICI Link Analyzer

ICI Link Analyzer detects issues with the Inter-Chip Interconnect (ICI) links that connect TPUs within a slice. This analyzer detects potential ICI networking issues on the links between TPU processors. The analyzer locates the specific nodes on which the problematic workloads are running.

Thermal Analyzer

Thermal Analyzer monitors and detects TPU nodes that are overheating. Excessive temperatures can negatively affect workload performance. The analyzer identifies the specific nodes experiencing thermal problems, which can be caused by overheating in the tensor cores or High Bandwidth Memory (HBM).

TPU Throttling Analyzer

TPU Throttling Analyzer detects when TPU chip throttling occurs. This can be due to power, thermal, or other hardware constraints. The analyzer pinpoints the specific nodes where throttling is occurring.

HBM Capacity Analyzer

HBM Capacity Analyzer monitors the High Bandwidth Memory (HBM) utilization of TPU nodes. This analyzer detects TPU nodes where HBM utilization is either approaching the limit (around 90%) or causing an out-of-memory (OOM) errors. Potential troubleshooting actions include collecting XProf profiling traces for detailed HBM memory usage analysis. You should also consider modifying workload configurations to optimize HBM usage, such as reducing the batch size or adjusting model parameters.

View Workload Monitoring in Google Cloud console

You can view your runs monitored by ML Diagnostics workload monitoring in the Google Cloud console both under Cluster Director as well as under GKE

To view all your machine learning runs in Cluster Director:

In the Google Cloud console, go to the Cluster Director page.
Click the Run Diagnostics tab.

Go to Cluster Director Run Diagnostics

To view all your machine learning runs in Google Kubernetes Engine:

In the Google Cloud console, go to the Kubernetes page.
In the navigation menu, click AI/ML.
Click the Run Diagnostics tab.

Go to GKE AI/ML Run Diagnostics

The Cluster Director and GKE consoles display the following information:

Run Summaries: The table contains the name of each run created by Workload Monitoring for a GKE job. Runs are automatically named using the job-name-timestamp format.
Run Details (Monitoring Overview): Select an ML run to see detailed information within the "Monitoring Overview" tab. This tab contains:
- TPU duty cycle: A graph of the TPU duty cycle for your ML run.
- Events timeline: A timeline of all events that have affected the ML run, including performance degradations, hangs, and terminations.
- Events Table: This table summarizes events that affected the ML run. For each event, it includes:
  - Event name and event type.
  - Start time and end time of event.
  - Links to details from analyzers.
- Event Details: Select View details for an event to see all analyzer results for that event.

Retrieve ML run names

To interact with Workload Monitoring data using Google Cloud CLI commands or direct API calls, you need the MachineLearningRun name assigned to your GKE job. Workload Monitoring automatically generates this name in the job-name-timestamp format. You can use the Cluster Director (hypercomputecluster.googleapis.com) API to list all MachineLearningRun resources within your project and location.

To list the MachineLearningRun resources, send a GET request to the /v1alpha/{parent=projects/*/locations/*}/machineLearningRuns endpoint. The following curl command demonstrates how to do this:

curl -X GET \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json" \
     "https://hypercomputecluster.googleapis.com/v1alpha/projects/PROJECT_ID/locations/LOCATION/machineLearningRuns"

The API responds with a JSON object containing a machineLearningRuns array. Each element in this array represents an ML run and includes a name field. This field holds the full resource name, in the format: projects/PROJECT_ID/locations/LOCATION/machineLearningRuns/job-name-timestamp. You can find the ML run you need by matching the job name and timestamp in this name field.

The following is an example of the JSON response structure:

{
  "machineLearningRuns": [
    {
      "name": "projects/user-project-id/locations/us-central1/machineLearningRuns/my-tpu-job-20260327T073000",
      // ... other MachineLearningRun fields
    },
    {
      "name": "projects/user-project-id/locations/us-central1/machineLearningRuns/another-tpu-job-20260327T080000",
      // ... other MachineLearningRun fields
    }
    // ... other runs
  ]
}

Access Workload Monitoring information through Google Cloud CLI

For every GKE job, Workload Monitoring automatically creates a Machine Learning Run within ML Diagnostics with an ML run name using the job-name-timestamp format. You can use ML Diagnostics gcloud CLI commands such as Describe and List to view all ML runs created in your project, and also details of those runs. You cannot use Create, Delete, or Update methods on ML Runs created by Workload Monitoring. For more information, see Get started with the ML Diagnostics CLI.

Access Workload Monitoring information through the API

You can access the events detected by Workload Monitoring through the "GET" method for the MonitoredEvent API.

List all monitored events

To list all monitored events (MonitoredEvent) for a given ML run, use the "GET" method with the /v1alpha/{parent=projects/*/locations/*/machineLearningRuns/*}/monitoredEvents endpoint.

The following is an example request:

curl -X GET \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \

 "https://hypercomputecluster.googleapis.com/v1alpha/projects/PROJECT_ID/locations/LOCATION/machineLearningRuns/ML_RUN_ID/monitoredEvents"

The output is a ListMonitoredEventsResponse object, which is a JSON object that contains a list of MonitoredEvent summaries. The summaries provide an overview of event timestamps and details.

The following is an example of a ListMonitoredEventsResponse object:

{
  "monitoredEvents": [
    {
      "name": "projects/user-project-id/locations/us-central1/machineLearningRuns/my-tpu-job-20260327T073000/monitoredEvents/event-id-456",
      "type": "PERFORMANCE_DEGRADATION",
      "display_name": "TPU Duty Cycle Drop",
      "start_time": "2026-03-27T07:45:10Z"
    },
    {
      "name": "projects/user-project-id/locations/us-central1/machineLearningRuns/my-tpu-job-20260327T073000/monitoredEvents/event-id-123",
      "type": "PERFORMANCE_DEGRADATION",
      "display_name": "TPU Duty Cycle Drop",
      "start_time": "2026-03-27T07:35:00Z"
    }
    // ... other events
  ]
}

Get monitored events

To retrieve details of a specific monitored event (MonitoredEvent), Send a "GET" request to the /v1alpha/{parent=projects/*/locations/*/machineLearningRuns/*}/monitoredEvents/* endpoint.

The following is an example request:

curl -X GET \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json" \
     "https://hypercomputecluster.googleapis.com/v1alpha/projects/PROJECT_ID/locations/LOCATION/machineLearningRuns/ML_RUN_ID/monitoredEvents/EVENT_ID"

The output is a MonitoredEvent object, which is a JSON object that contains information about the specified event. The analyzer_reports array within this object contains the detailed findings, status, and suggestions from each analyzer (ICI Link, Thermal, TPU Throttling, HBM Capacity) that was run in response to the detected event.

The following is an example of a MonitoredEvent object:

{
  "name": "projects/PROJECT_ID/locations/LOCATION/machineLearningRuns/ML_RUN_ID/monitoredEvents/EVENT_ID",
  "type": "PERFORMANCE_DEGRADATION",
  "display_name": "TPU Duty Cycle Drop",
  "start_time": "2026-03-23T12:00:00Z",
  // ... other MonitoredEvent specific fields like end_time, etc.
  "analyzer_reports": [
    {
      "analyzer": "TPU Duty Cycle Analyzer",
      "detection_state": "DETECTED",
      "details": "TPU duty cycle dropped by 20% below average.",
      "recommended_actions": [
        {
          "description": "Investigate node health and running processes.",
          "documentation_url": "https://docs.cloud.google.com/tpu/docs/ml-diagnostics/workload-monitoring"
        }
      ]
    },
    {
      "analyzer": "ICI Link analyzer",
      // ... results from ICI Link analyzer
    }
    // ... other analyzer reports
  ]
}