Monitor TPUs
This guide explains how to use Cloud Monitoring to monitor your TPU VMs. Cloud Monitoring automatically collects metrics and logs from your TPU and its host VM. These data can be used to monitor the health of your TPU and Compute Engine.
Metrics enable you to track a numerical quantity over time, for example, CPU utilization, network usage, or TensorCore idle duration. Logs capture events at a specific point in time. Log entries are written by your own code, Google Cloud services, third-party applications, and the Google Cloud infrastructure. You can also generate metrics from the data present in a log entry by creating a log-based metric. You can also set alert policies based on metric values or log entries.
To monitor TPUs, you can also use Capacity Planner (Preview). With Capacity Planner, you can view TPU usage and forecast data for your project, folder, or organization. This data updates every 24 hours, and you can use it to analyze usage trends and plan for future capacity needs. For more information, see Capacity Planner overview.
Access TPU metrics
Compute Engine generates two types of TPU metrics: TPU runtime metrics and TPU VM infrastructure metrics. You can get the metrics in two ways:
TPU Monitoring Library: Get TPU runtime metrics from the LibTPU SDK using the TPU Monitoring Library. This enables your applications to get real-time telemetry from inside the guest environment. For more information, see TPU Monitoring Library.
AI Telemetry Collector: Get runtime metrics and VM infrastructure metrics through the AI Telemetry Collector. The AI Telemetry Collector runs inside the TPU VM and lets you access metrics through Cloud Monitoring or through your own Prometheus-based monitoring pipeline. For more information, see AI Telemetry Collector.
TPU metrics
Google Cloud metrics for Cloud TPU are automatically generated by Compute Engine VMs and the Cloud TPU runtime. The metrics in the following table are generated by Compute Engine VMs.
The "metric type" strings in this table must be prefixed with
compute.googleapis.com/. That prefix has been omitted from the entries in the
table. When querying a label, use the metric.labels prefix; for example,
metric.labels.LABEL="VALUE".
| Metric type Launch stage (Resource hierarchy levels) Display name |
|
|---|---|
| Kind, Type, Unit Monitored resources |
Description Labels |
instance/tpu/accelerator/duty_cycle
BETA
(project)
Accelerator Duty Cycle |
|
GAUGE, DOUBLE, %
gce_instance |
Percentage of time over the sample period during which the accelerator was actively processing. Values are in the range of [0,100].
accelerator_id:
Device Id of Accelerator.
|
instance/tpu/accelerator/memory_bandwidth_utilization
BETA
(project)
Accelerator Memory Bandwidth Utilization |
|
GAUGE, DOUBLE, %
gce_instance |
Current percentage of the accelerator memory bandwidth that is being used. Computed by dividing the memory bandwidth used over a sample period by the maximum supported bandwidth over the same sample period.
accelerator_id:
Device Id of Accelerator.
|
instance/tpu/accelerator/memory_total
BETA
(project)
Accelerator Memory Total |
|
GAUGE, INT64, By
gce_instance |
Total accelerator memory currently allocated in bytes.
accelerator_id:
Device Id of Accelerator.
|
instance/tpu/accelerator/memory_used
BETA
(project)
Accelerator Memory Used |
|
GAUGE, INT64, By
gce_instance |
Total accelerator memory currently used in bytes.
accelerator_id:
Device Id of Accelerator.
|
instance/tpu/accelerator/tensorcore_utilization
BETA
(project)
Accelerator TensorCore Utilization |
|
GAUGE, DOUBLE, %
gce_instance |
Current percentage of the Tensorcore that is utilized. Computed by dividing the Tensorcore operations that were performed over a sample period by the supported number of Tensorcore operations over the same sample period.
accelerator_id:
Device Id of Accelerator.
|
instance/tpu/active_chips
BETA
(project)
Active TPU Chips Count |
|
GAUGE, INT64, 1
gce_instance |
The current count of chips that are actively being utilized (i.e) not idle.
accelerator_type:
Accelerator type and generation.
reservation_id:
The ID of the physical machine reservation.
provisioning_model:
The associated provisioning model.
protection_tier:
The associated protection model.
block_id:
The ID of the block within the cluster hosting the VM.
subblock_id:
The ID of the sub-block hosting the VM.
is_exr:
(BOOL)
Indicates if the chip is part of an extended reservation.
|
instance/tpu/chip_state
BETA
(project)
TPU Chip State Count |
|
GAUGE, INT64, 1
gce_instance |
The count of TPU chips in various states like Healthy, Unhealthy and Unknown.
state:
The state of the chip.
accelerator_type:
Accelerator type and generation.
block_id:
The ID of the block within the cluster hosting the VM.
subblock_id:
The ID of the sub-block hosting the VM.
reservation_id:
The ID of the physical machine reservation.
is_exr:
(BOOL)
Indicates if the chip is part of an extended reservation.
|
instance/tpu/infra_health
BETA
(project)
TPU Instance Health |
|
GAUGE, INT64, 1
gce_instance |
Indicates the overall health status of a TPU instance. The metric labels help identify the specific health status and reasons for issues on degraded or unhealthy TPU instances, primarily focusing on TPU hardware and system health. Health status changes may take several minutes to be reflected in this metric. Sampled every 60 seconds. After sampling, data is not visible for up to 420 seconds.
health_status:
The overall health state of the TPU instance. Possible values: HEALTHY (operating as expected), UNHEALTHY (critical issue detected), DEGRADED (performance impacting issue), UNKNOWN (status cannot be determined).
unhealthy_category:
Explanation for the unhealthy VM status. This label is populated only when the value of the metric is Unhealthy.
machine_type:
The machine type of the instance (e.g., ct6e-standard-4t-tpu).
machine_id:
The ID of the physical machine hosting the VM.
block_id:
The ID of the block within the cluster hosting the VM.
cluster_id:
The ID of the cluster hosting the VM.
reservation_id:
The ID of the physical machine reservation.
subblock_id:
The ID of the sub-block hosting the VM.
|
instance/tpu/runtime/uptime
BETA
(project)
Runtime Uptime |
|
GAUGE, INT64, s
gce_instance |
Uptime of the ML Runtime since the initialization of the runtime library (libtpu.so) by the ML job. During this period the runtime library blocks the TPU devices for use by the ML job.
ml_framework_name:
Name of the ML framework.
ml_framework_version:
Version of the ML framework.
|
instance/tpu/scheduled_chips
BETA
(project)
Scheduled TPU Chips Count |
|
GAUGE, INT64, 1
gce_instance |
The current count of chips that are allocated to a VM which is HEALTHY and is NOT DISABLED for maintenance.
accelerator_type:
Accelerator type and generation.
reservation_id:
The ID of the physical machine reservation.
provisioning_model:
The associated provisioning model.
protection_tier:
The associated protection model.
block_id:
The ID of the block within the cluster hosting the VM.
subblock_id:
The ID of the sub-block hosting the VM.
is_exr:
(BOOL)
Indicates if the chip is part of an extended reservation.
|
instance/tpu/utilized_chips
BETA
(project)
Utilized TPU Chips |
|
GAUGE, DOUBLE, 1
gce_instance |
The current aggregate utilized capacity expressed as an effective number of active chips. It is equivalent to the sum of the fractional utilization (0.0 to 1.0) of all active chips.
accelerator_type:
Accelerator type and generation.
reservation_id:
The ID of the physical machine reservation.
provisioning_model:
The associated provisioning model.
protection_tier:
The associated protection model.
block_id:
The ID of the block within the cluster hosting the VM.
subblock_id:
The ID of the sub-block hosting the VM.
is_exr:
(BOOL)
Indicates if the chip is part of an extended reservation.
|
quota/tpus_per_tpu_family/exceeded
ALPHA
(project)
TPU count per TPU family. quota exceeded error |
|
DELTA, INT64, 1
compute.googleapis.com/Location |
Number of attempts to exceed the limit on quota metric compute.googleapis.com/tpus_per_tpu_family. After sampling, data is not visible for up to 150 seconds.
limit_name:
The limit name.
tpu_family:
TPU family custom dimension.
|
quota/tpus_per_tpu_family/limit
ALPHA
(project)
TPU count per TPU family. quota limit |
|
GAUGE, INT64, 1
compute.googleapis.com/Location |
Current limit on quota metric compute.googleapis.com/tpus_per_tpu_family. Sampled every 60 seconds. After sampling, data is not visible for up to 150 seconds.
limit_name:
The limit name.
tpu_family:
TPU family custom dimension.
|
quota/tpus_per_tpu_family/usage
ALPHA
(project)
TPU count per TPU family. quota usage |
|
GAUGE, INT64, 1
compute.googleapis.com/Location |
Current usage on quota metric compute.googleapis.com/tpus_per_tpu_family. After sampling, data is not visible for up to 150 seconds.
limit_name:
The limit name.
tpu_family:
TPU family custom dimension.
|
tpu/multislice/accelerator/device_to_host_transfer_latencies
BETA
(project)
Device to Host Transfer Latencies |
|
CUMULATIVE, DISTRIBUTION, us
gce_instance |
Cumulative distribution of device to host transfer latency for each chunk of data. A latency starts when the request for data to be transferred to the host is issued and ends when an acknowledgement is received that the transfer of data has completed.
buffer_size:
Buffer size.
|
tpu/multislice/accelerator/host_to_device_transfer_latencies
BETA
(project)
Host to Device Transfer Latencies |
|
CUMULATIVE, DISTRIBUTION, us
gce_instance |
Cumulative distribution of host to device transfer latency for each chunk of data of multislice traffic. A latency starts when the request for data to be transferred to the device is issued and ends when an acknowledgement is received that the transfer of data has completed.
buffer_size:
Buffer size.
|
tpu/multislice/network/collective_end_to_end_latencies
BETA
(project)
Collective End-to-End Latencies |
|
CUMULATIVE, DISTRIBUTION, us
gce_instance |
Cumulative distribution of end to end collective latency for multislice traffic. A latency starts when the request for the collective is issued and ends when an acknowledgement is received that the transfer of data has completed.
input_size:
Input size of the collective operation.
collective_type:
Type of the collective operation.
|
tpu/multislice/network/dcn_transfer_latencies
BETA
(project)
DCN Transfer Latencies |
|
CUMULATIVE, DISTRIBUTION, us
gce_instance |
Cumulative distribution of network-transfer latencies for multislice traffic. A latency starts when the request for data to be transferred over the DCN is issued and ends when an acknowledgement is received that the transfer of data has completed.
buffer_size:
Buffer size.
type:
Type.
|
tpu/multislice/network/grpc_client_call_latencies
BETA
(project)
gRPC Client Call Latencies |
|
CUMULATIVE, DISTRIBUTION, us
gce_instance |
Cumulative distribution of network-transfer latencies for gRPC library takes to complete an RPC from the caller perspective.
buffer_size:
Buffer size.
|
tpu/multislice/network/grpc_server_call_latencies
BETA
(project)
gRPC Server Call Latencies |
|
CUMULATIVE, DISTRIBUTION, us
gce_instance |
Cumulative distribution of network-transfer latencies for gRPC server to complete an RPC on transport perspective.
buffer_size:
Buffer size.
|
tpu/multislice/network/grpc_tcp_delivery_rates
BETA
(project)
gRPC TCP Delivery Rates |
|
CUMULATIVE, DISTRIBUTION, Mb/s
gce_instance |
Cumulative distribution of the TCP connections data transfer rates. Each sample is the latest mean data transfer rate for a given TCP connection over the last TCP ACK interval. Samples of data transfer rates are pulled from the Linux TCP Kernel every 20s, so it can be expected that every TCP connection creates approximately 3 samples per 60s interval. |
tpu/multislice/network/grpc_tcp_min_round_trip_times
BETA
(project)
gRPC TCP Min Round Trip Times |
|
CUMULATIVE, DISTRIBUTION, us
gce_instance |
Cumulative distribution of minimum network-transfer latencies per TCP connection. |
tpu/multislice/network/grpc_tcp_packets_retransmitted_count
BETA
(project)
gRPC TCP Packets Retransmitted Count |
|
CUMULATIVE, INT64, 1
gce_instance |
Total count of packets retransmitted. |
tpu/multislice/network/grpc_tcp_packets_sent_count
BETA
(project)
gRPC TCP Packets Sent Count |
|
CUMULATIVE, INT64, 1
gce_instance |
Total count of packets TCP sends. |
tpu/slice/capacity/available_chips
BETA
(project)
Available TPU Chips Count |
|
GAUGE, INT64, 1
compute.googleapis.com/AcceleratorSlice |
The current count of TPU chips of Extended Reservation that are actively available and ready to use. Sampled every 60 seconds. After sampling, data is not visible for up to 360 seconds.
accelerator_type:
Accelerator type and generation.
reservation_id:
The ID of the physical machine reservation.
block_id:
The block ID associated with the slice.
subblock_id:
The subblock ID associated with the slice.
provisioning_model:
The associated provisioning model.
protection_tier:
The associated protection model.
|
tpu/slice/capacity/committed_chips
BETA
(project)
Purchased TPU Chips Count |
|
GAUGE, INT64, 1
compute.googleapis.com/AcceleratorSlice |
The current count of purchased TPU chips of Extended Reservation. Sampled every 60 seconds. After sampling, data is not visible for up to 360 seconds.
accelerator_type:
Accelerator type and generation.
reservation_id:
The ID of the physical machine reservation.
block_id:
The block ID associated with the slice.
subblock_id:
The subblock ID associated with the slice.
provisioning_model:
The associated provisioning model.
protection_tier:
The associated protection model.
|
For a complete list of metrics generated by Compute Engine, see Compute Engine metrics.
AI Telemetry Collector
The AI Telemetry Collector collects and publishes TPU metrics under the
compute.googleapis.com namespace for TPUs created using the Compute Engine API.
These metrics are built-in system metrics, which provide visibility into health
and performance.
The AI Telemetry Collector architecture is designed as a lightweight, specialized OpenTelemetry (OTEL) Collector. It uses two primary receivers to capture data:
- TPU Runtime Receiver: Scrapes runtime and workload metrics (such as duty cycle and memory usage) directly from the TPU runtime when a machine learning workload is active.
- TPU Host Receiver: Captures hardware utilization metrics, such as TensorCore Utilization and Memory Bandwidth Utilization, directly from the device regardless of whether a workload is running.
The AI Telemetry Collector then uses processors to automatically apply necessary
resource tags (such as project_id, instance_id, and zone) and securely
exports the telemetry directly to Cloud Monitoring.
The AI Telemetry Collector comes pre-installed in Google's TPU-optimized Ubuntu LTS images, and runs automatically when the VM boots. To use this setup, specify the official Google accelerator image project and family when creating a TPU VM instance or instance template. Once the VM starts, the AI Telemetry Collector automatically sends metrics to Cloud Monitoring dashboards.
If you're building custom operating system images, you can use the AI Telemetry
Collector after installing and running the ai-telemetry-collector Docker
image. For more information, see Use a custom OS
image.
Configuration
The AI Telemetry Collector automatically sends metrics to Cloud Monitoring dashboards, and does not require any additional configuration steps. However, you can configure the Snap package or Docker image to add external export destinations, alter metric collection intervals, and include debugging options.
You can either replace the default configuration with a new config file, or append an additional configuration file to the existing default configuration. When adding configurations, keys that don't already exist are added and keys that already exist are overwritten. However, arrays and lists are not additive, so new lists must include both existing and new values.
The following YAML file configures AI Telemetry Collector to send metrics to Prometheus, an open-source systems monitoring and alerting toolkit. It also enables the debugging option, which print metrics within the console.
exporters:
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
metrics:
exporters:
- prometheus # For more: https://prometheus.io/docs/introduction/overview/
- googlecloud # If you do not include this, you'll lose Google Cloud Monitoring
- debug # print metrics within the console
Default OS
If you are using Google's TPU-optimized Ubuntu LTS images, run the following Snap command to add the new config file to the existing configuration:
sudo snap set \
ai-telemetry-collector \
extra-flags="--config /home/username/additional-config.yaml"
If you want to overwrite and replace the existing configuration, use the
config-path flag instead of extra-flags:
sudo snap set \
ai-telemetry-collector \
config-path="/home/username/new-config.yaml"
The snap set command should trigger an automatic restart of the AI Telemetry
Collector. To verify that the collector has restarted and successfully applied
your configurations, use the following command to view the logs:
sudo snap logs -f ai-telemetry-collector
Custom OS
If you are using a custom OS, run the following Docker command to add the new config file to the existing configuration:
# First apply the default configs via `--config=/etc/ai-telemetry-collector/config.yaml`
# Then apply your additional config by volume mount.
docker run --privileged --net=host \
-v <path>/additional-config.yaml:/etc/ai-telemetry-collector/additional-config.yaml \
ai-telemetry-collector:latest \
--config=/etc/ai-telemetry-collector/config.yaml \
--config=/etc/ai-telemetry-collector/additional-config.yaml
If you want to overwrite and replace the existing configuration, use the following Docker command:
# Mount a volume (your config file) to `/etc/ai-telemetry-collector/config.yaml`
# The binary automatically picks up this file.
docker run --privileged --net=host \
-v <path>/my-config.yaml:/etc/ai-telemetry-collector/config.yaml \
ai-telemetry-collector:latest
Audit logs
Google Cloud services generate audit logs that record administrative and access activities within your Google Cloud resources. For more information, see Compute Engine audit logging.