Monitor Cloud TPU health
Cloud TPU health monitoring provides real-time information on the health status of TPU VMs and slices. Using either the Cloud Monitoring console or Google Cloud CLI, you can identify hardware failures as they occur and reallocate resources, avoiding job failures and minimizing performance degradation.
TPU health monitoring assigns either a HEALTHY or UNHEALTHY status to each
TPU instance in use. TPUs functioning as expected are assigned a HEALTHY
status, and those that are no longer functional or performing at a severely
reduced level are assigned an UNHEALTHY status.
Unhealthy TPU detection
When TPU health monitoring assigns an UNHEALTHY status to a TPU, it provides
identifying information of the failing TPU through the
compute.googleapis.com/instance/tpu/infra_health" service endpoint.
TPU health monitoring provides the following metric labels to identify the specific unhealthy TPU:
| Label name | Type | Description |
|---|---|---|
health_status |
STRING |
The overall health state of the TPU instance. |
unhealthy_category |
STRING |
The health status cause. This label is populated only when the value of the metric is UNHEALTHY. |
machine_type |
STRING |
The Compute Engine machine type of the instance. |
reservation_id |
STRING |
The ID of the physical machine reservation. |
TPU health monitoring also provides the following resource labels:
| Label name | Type | Description |
|---|---|---|
project |
STRING |
The project number. |
service |
STRING |
The API service (compute.googleapis.com). |
resource_type |
STRING |
The VM instance. |
location |
STRING |
The zone of the instance. |
resource_id |
STRING |
The Compute Engine instance ID. |
TPU health monitoring provides the following additional metric labels for All Capacity mode reservations through TPU Cluster Director:
| Label name | Type | Description |
|---|---|---|
machine_id |
STRING |
The ID of the physical machine hosting the VM. |
block_id |
STRING |
The ID of the block within the cluster hosting the VM. |
cluster_id |
STRING |
The ID of the cluster hosting the VM. |
subblock_id |
STRING |
The ID of the sub-block hosting the VM. |
Monitoring with the console
The Monitoring Dashboard in the Google Cloud console provides real-time visualizations of machine health status, historical trends, and total counts.
To view a prebuilt a dashboard in Cloud Monitoring:
- In the Google Cloud console, go to the Cloud Monitoring page.
Go to the Monitoring console - In the navigation pane, click Dashboards.
- In the Filter search field, enter "TPU Bad Node".
The dashboard displays the following metrics by default:
- Overall TPU instance health: The percentage of TPU instances that are
categorized as
HEALTHY. - Number of instances: The total number of TPU instances in use.
- Number of healthy TPU instances: The total number of TPU instances
categorized as
HEALTHY. - Number of unhealthy TPU instances: The total number of TPU instances
categorized as
UNHEALTHY. - VM infra health distribution: The number of
HEALTHYinstances over time.
To add a custom query, click Add Query and write a PromQL query. For example, the following query retrieves all unhealthy TPU instances:
count({__name__="compute.googleapis.com/instance/tpu/infra_health", monitored_resource="gce_instance", health_status="UNHEALTHY"})
For more information on monitoring with the Google Cloud console, see Monitor Cloud TPU VMs.
Monitoring with the CLI
The Google Cloud CLI provides real-time TPU health data. The following example
request retrieves all UNHEALTHY TPU:
export TOKEN=$(gcloud auth application-default print-access-token)
export PROMQL_QUERY='count({__name__="compute.googleapis.com/instance/tpu/infra_health", monitored_resource="gce_instance", health_status="UNHEALTHY"})'
export LOCATION="global"
curl -G \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
--header "X-Goog-User-Project: ${PROJECT_ID}" \
--data-urlencode "query=${PROMQL_QUERY}" \
"https://monitoring.googleapis.com/v1/projects/${PROJECT_ID}/location/${LOCATION}/prometheus/api/v1/query"
You should expect to see a response similar to the following:
{"status":"success","data":{"resultType":"vector","result":[65]}}
For more information on gcloud CLI, see gcloud CLI overview.