Monitor Cloud TPU health

Cloud TPU health monitoring provides real-time information on the health status of TPU VMs and slices. Using either the Cloud Monitoring console or Google Cloud CLI, you can identify hardware failures as they occur and reallocate resources, avoiding job failures and minimizing performance degradation.

TPU health monitoring assigns either a HEALTHY or UNHEALTHY status to each TPU instance in use. TPUs functioning as expected are assigned a HEALTHY status, and those that are no longer functional or performing at a severely reduced level are assigned an UNHEALTHY status.

Unhealthy TPU detection

When TPU health monitoring assigns an UNHEALTHY status to a TPU, it provides identifying information of the failing TPU through the compute.googleapis.com/instance/tpu/infra_health" service endpoint.

TPU health monitoring provides the following metric labels to identify the specific unhealthy TPU:

Label name Type Description
health_status STRING The overall health state of the TPU instance.
unhealthy_category STRING The health status cause. This label is populated only when the value of the metric is UNHEALTHY.
machine_type STRING The Compute Engine machine type of the instance.
reservation_id STRING The ID of the physical machine reservation.

TPU health monitoring also provides the following resource labels:

Label name Type Description
project STRING The project number.
service STRING The API service (compute.googleapis.com).
resource_type STRING The VM instance.
location STRING The zone of the instance.
resource_id STRING The Compute Engine instance ID.

TPU health monitoring provides the following additional metric labels for All Capacity mode reservations through TPU Cluster Director:

Label name Type Description
machine_id STRING The ID of the physical machine hosting the VM.
block_id STRING The ID of the block within the cluster hosting the VM.
cluster_id STRING The ID of the cluster hosting the VM.
subblock_id STRING The ID of the sub-block hosting the VM.

Monitoring with the console

The Monitoring Dashboard in the Google Cloud console provides real-time visualizations of machine health status, historical trends, and total counts.

To view a prebuilt a dashboard in Cloud Monitoring:

  1. In the Google Cloud console, go to the Cloud Monitoring page.
    Go to the Monitoring console
  2. In the navigation pane, click Dashboards.
  3. In the Filter search field, enter "TPU Bad Node".

The dashboard displays the following metrics by default:

  • Overall TPU instance health: The percentage of TPU instances that are categorized as HEALTHY.
  • Number of instances: The total number of TPU instances in use.
  • Number of healthy TPU instances: The total number of TPU instances categorized as HEALTHY.
  • Number of unhealthy TPU instances: The total number of TPU instances categorized as UNHEALTHY.
  • VM infra health distribution: The number of HEALTHY instances over time.

To add a custom query, click Add Query and write a PromQL query. For example, the following query retrieves all unhealthy TPU instances:

count({__name__="compute.googleapis.com/instance/tpu/infra_health", monitored_resource="gce_instance", health_status="UNHEALTHY"})

For more information on monitoring with the Google Cloud console, see Monitor Cloud TPU VMs.

Monitoring with the CLI

The Google Cloud CLI provides real-time TPU health data. The following example request retrieves all UNHEALTHY TPU:

export TOKEN=$(gcloud auth application-default print-access-token)

export PROMQL_QUERY='count({__name__="compute.googleapis.com/instance/tpu/infra_health", monitored_resource="gce_instance", health_status="UNHEALTHY"})'

export LOCATION="global"

curl -G \
  --header "Authorization: Bearer ${TOKEN}" \
  --header "Content-Type: application/json" \
  --header "X-Goog-User-Project: ${PROJECT_ID}" \
  --data-urlencode "query=${PROMQL_QUERY}" \
  "https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/${LOCATION}/prometheus/api/v1/query"

You should expect to see a response similar to the following:

{"status":"success","data":{"resultType":"vector","result":[65]}}

For more information on gcloud CLI, see gcloud CLI overview.