Monitor TPU VMs

This guide explains how to use Cloud Monitoring to monitor your TPU VMs. Cloud Monitoring automatically collects metrics and logs from your TPU and its host VM. These data can be used to monitor the health of your TPU and Compute Engine.

Metrics enable you to track a numerical quantity over time, for example, CPU utilization, network usage, or TensorCore idle duration. Logs capture events at a specific point in time. Log entries are written by your own code, Google Cloud services, third-party applications, and the Google Cloud infrastructure. You can also generate metrics from the data present in a log entry by creating a log-based metric. You can also set alert policies based on metric values or log entries.

To monitor TPUs, you can also use Capacity Planner (Preview). With Capacity Planner, you can view TPU usage and forecast data for your project, folder, or organization. This data updates every 24 hours, and you can use it to analyze usage trends and plan for future capacity needs. For more information, see Capacity Planner overview.

Access TPU metrics

Compute Engine generates two types of TPU metrics: TPU runtime metrics and TPU VM infrastructure metrics. You can get the metrics in two ways:

TPU Monitoring Library: Get TPU runtime metrics from the LibTPU SDK using the TPU Monitoring Library. This enables your applications to get real-time telemetry from inside the guest environment. For more information, see TPU Monitoring Library.
AI Telemetry Collector: Get runtime metrics and VM infrastructure metrics through the AI Telemetry Collector. The AI Telemetry Collector runs inside the TPU VM and lets you access metrics through Cloud Monitoring or through your own Prometheus-based monitoring pipeline. For more information, see AI Telemetry Collector.

TPU metrics

Google Cloud metrics for Cloud TPU are automatically generated by Compute Engine VMs and the Cloud TPU runtime. The metrics in the following table are generated by Compute Engine VMs.

The "metric type" strings in this table must be prefixed with compute.googleapis.com/. That prefix has been omitted from the entries in the table. When querying a label, use the metric.labels prefix; for example, metric.labels.LABEL="VALUE".

Metric type ^{Launch stage} (Resource hierarchy levels) Display name
Kind, Type, Unit Monitored resources	Description Labels
`instance/tpu/accelerator/duty_cycle` ^BETA *(project)* Accelerator Duty Cycle
`GAUGE`, `DOUBLE`, `%` gce_instance	Percentage of time over the sample period during which the accelerator was actively processing. Values are in the range of [0,100]. `accelerator_id`: Device Id of Accelerator.
`instance/tpu/accelerator/memory_bandwidth_utilization` ^BETA *(project)* Accelerator Memory Bandwidth Utilization
`GAUGE`, `DOUBLE`, `%` gce_instance	Current percentage of the accelerator memory bandwidth that is being used. Computed by dividing the memory bandwidth used over a sample period by the maximum supported bandwidth over the same sample period. `accelerator_id`: Device Id of Accelerator.
`instance/tpu/accelerator/memory_total` ^BETA *(project)* Accelerator Memory Total
`GAUGE`, `INT64`, `By` gce_instance	Total accelerator memory currently allocated in bytes. `accelerator_id`: Device Id of Accelerator.
`instance/tpu/accelerator/memory_used` ^BETA *(project)* Accelerator Memory Used
`GAUGE`, `INT64`, `By` gce_instance	Total accelerator memory currently used in bytes. `accelerator_id`: Device Id of Accelerator.
`instance/tpu/accelerator/tensorcore_utilization` ^BETA *(project)* Accelerator TensorCore Utilization
`GAUGE`, `DOUBLE`, `%` gce_instance	Current percentage of the Tensorcore that is utilized. Computed by dividing the Tensorcore operations that were performed over a sample period by the supported number of Tensorcore operations over the same sample period. `accelerator_id`: Device Id of Accelerator.
`instance/tpu/active_chips` ^BETA *(project)* Active TPU Chips Count
`GAUGE`, `INT64`, `1` gce_instance	The current count of chips that are actively being utilized (i.e) not idle. `accelerator_type`: Accelerator type and generation. `reservation_id`: The ID of the physical machine reservation. `provisioning_model`: The associated provisioning model. `protection_tier`: The associated protection model. `block_id`: The ID of the block within the cluster hosting the VM. `subblock_id`: The ID of the sub-block hosting the VM. `is_exr`: (BOOL) Indicates if the chip is part of an extended reservation.
`instance/tpu/chip_state` ^BETA *(project)* TPU Chip State Count
`GAUGE`, `INT64`, `1` gce_instance	The count of TPU chips in various states like Healthy, Unhealthy and Unknown. `state`: The state of the chip. `accelerator_type`: Accelerator type and generation. `block_id`: The ID of the block within the cluster hosting the VM. `subblock_id`: The ID of the sub-block hosting the VM. `reservation_id`: The ID of the physical machine reservation. `is_exr`: (BOOL) Indicates if the chip is part of an extended reservation.
`instance/tpu/infra_health` ^BETA *(project)* TPU Instance Health
`GAUGE`, `INT64`, `1` gce_instance	Indicates the overall health status of a TPU instance. The metric labels help identify the specific health status and reasons for issues on degraded or unhealthy TPU instances, primarily focusing on TPU hardware and system health. Health status changes may take several minutes to be reflected in this metric. Sampled every 60 seconds. After sampling, data is not visible for up to 420 seconds. `health_status`: The overall health state of the TPU instance. Possible values: HEALTHY (operating as expected), UNHEALTHY (critical issue detected), DEGRADED (performance impacting issue), UNKNOWN (status cannot be determined). `unhealthy_category`: Explanation for the unhealthy VM status. This label is populated only when the value of the metric is Unhealthy. `machine_type`: The machine type of the instance (e.g., ct6e-standard-4t-tpu). `machine_id`: The ID of the physical machine hosting the VM. `block_id`: The ID of the block within the cluster hosting the VM. `cluster_id`: The ID of the cluster hosting the VM. `reservation_id`: The ID of the physical machine reservation. `subblock_id`: The ID of the sub-block hosting the VM.
`instance/tpu/runtime/uptime` ^BETA *(project)* Runtime Uptime
`GAUGE`, `INT64`, `s` gce_instance	Uptime of the ML Runtime since the initialization of the runtime library (libtpu.so) by the ML job. During this period the runtime library blocks the TPU devices for use by the ML job. `ml_framework_name`: Name of the ML framework. `ml_framework_version`: Version of the ML framework.
`instance/tpu/scheduled_chips` ^BETA *(project)* Scheduled TPU Chips Count
`GAUGE`, `INT64`, `1` gce_instance	The current count of chips that are allocated to a VM which is HEALTHY and is NOT DISABLED for maintenance. `accelerator_type`: Accelerator type and generation. `reservation_id`: The ID of the physical machine reservation. `provisioning_model`: The associated provisioning model. `protection_tier`: The associated protection model. `block_id`: The ID of the block within the cluster hosting the VM. `subblock_id`: The ID of the sub-block hosting the VM. `is_exr`: (BOOL) Indicates if the chip is part of an extended reservation.
`instance/tpu/utilized_chips` ^BETA *(project)* Utilized TPU Chips
`GAUGE`, `DOUBLE`, `1` gce_instance	The current aggregate utilized capacity expressed as an effective number of active chips. It is equivalent to the sum of the fractional utilization (0.0 to 1.0) of all active chips. `accelerator_type`: Accelerator type and generation. `reservation_id`: The ID of the physical machine reservation. `provisioning_model`: The associated provisioning model. `protection_tier`: The associated protection model. `block_id`: The ID of the block within the cluster hosting the VM. `subblock_id`: The ID of the sub-block hosting the VM. `is_exr`: (BOOL) Indicates if the chip is part of an extended reservation.
`quota/tpus_per_tpu_family/exceeded` ^ALPHA *(project)* TPU count per TPU family. quota exceeded error
`DELTA`, `INT64`, `1` compute.googleapis.com/Location	Number of attempts to exceed the limit on quota metric compute.googleapis.com/tpus_per_tpu_family. After sampling, data is not visible for up to 150 seconds. `limit_name`: The limit name. `tpu_family`: TPU family custom dimension.
`quota/tpus_per_tpu_family/limit` ^ALPHA *(project)* TPU count per TPU family. quota limit
`GAUGE`, `INT64`, `1` compute.googleapis.com/Location	Current limit on quota metric compute.googleapis.com/tpus_per_tpu_family. Sampled every 60 seconds. After sampling, data is not visible for up to 150 seconds. `limit_name`: The limit name. `tpu_family`: TPU family custom dimension.
`quota/tpus_per_tpu_family/usage` ^ALPHA *(project)* TPU count per TPU family. quota usage
`GAUGE`, `INT64`, `1` compute.googleapis.com/Location	Current usage on quota metric compute.googleapis.com/tpus_per_tpu_family. After sampling, data is not visible for up to 150 seconds. `limit_name`: The limit name. `tpu_family`: TPU family custom dimension.
`tpu/multislice/accelerator/device_to_host_transfer_latencies` ^BETA *(project)* Device to Host Transfer Latencies
`CUMULATIVE`, `DISTRIBUTION`, `us` gce_instance	Cumulative distribution of device to host transfer latency for each chunk of data. A latency starts when the request for data to be transferred to the host is issued and ends when an acknowledgement is received that the transfer of data has completed. `buffer_size`: Buffer size.
`tpu/multislice/accelerator/host_to_device_transfer_latencies` ^BETA *(project)* Host to Device Transfer Latencies
`CUMULATIVE`, `DISTRIBUTION`, `us` gce_instance	Cumulative distribution of host to device transfer latency for each chunk of data of multislice traffic. A latency starts when the request for data to be transferred to the device is issued and ends when an acknowledgement is received that the transfer of data has completed. `buffer_size`: Buffer size.
`tpu/multislice/network/collective_end_to_end_latencies` ^BETA *(project)* Collective End-to-End Latencies
`CUMULATIVE`, `DISTRIBUTION`, `us` gce_instance	Cumulative distribution of end to end collective latency for multislice traffic. A latency starts when the request for the collective is issued and ends when an acknowledgement is received that the transfer of data has completed. `input_size`: Input size of the collective operation. `collective_type`: Type of the collective operation.
`tpu/multislice/network/dcn_transfer_latencies` ^BETA *(project)* DCN Transfer Latencies
`CUMULATIVE`, `DISTRIBUTION`, `us` gce_instance	Cumulative distribution of network-transfer latencies for multislice traffic. A latency starts when the request for data to be transferred over the DCN is issued and ends when an acknowledgement is received that the transfer of data has completed. `buffer_size`: Buffer size. `type`: Type.
`tpu/multislice/network/grpc_client_call_latencies` ^BETA *(project)* gRPC Client Call Latencies
`CUMULATIVE`, `DISTRIBUTION`, `us` gce_instance	Cumulative distribution of network-transfer latencies for gRPC library takes to complete an RPC from the caller perspective. `buffer_size`: Buffer size.
`tpu/multislice/network/grpc_server_call_latencies` ^BETA *(project)* gRPC Server Call Latencies
`CUMULATIVE`, `DISTRIBUTION`, `us` gce_instance	Cumulative distribution of network-transfer latencies for gRPC server to complete an RPC on transport perspective. `buffer_size`: Buffer size.
`tpu/multislice/network/grpc_tcp_delivery_rates` ^BETA *(project)* gRPC TCP Delivery Rates
`CUMULATIVE`, `DISTRIBUTION`, `Mb/s` gce_instance	Cumulative distribution of the TCP connections data transfer rates. Each sample is the latest mean data transfer rate for a given TCP connection over the last TCP ACK interval. Samples of data transfer rates are pulled from the Linux TCP Kernel every 20s, so it can be expected that every TCP connection creates approximately 3 samples per 60s interval.
`tpu/multislice/network/grpc_tcp_min_round_trip_times` ^BETA *(project)* gRPC TCP Min Round Trip Times
`CUMULATIVE`, `DISTRIBUTION`, `us` gce_instance	Cumulative distribution of minimum network-transfer latencies per TCP connection.
`tpu/multislice/network/grpc_tcp_packets_retransmitted_count` ^BETA *(project)* gRPC TCP Packets Retransmitted Count
`CUMULATIVE`, `INT64`, `1` gce_instance	Total count of packets retransmitted.
`tpu/multislice/network/grpc_tcp_packets_sent_count` ^BETA *(project)* gRPC TCP Packets Sent Count
`CUMULATIVE`, `INT64`, `1` gce_instance	Total count of packets TCP sends.
`tpu/slice/capacity/available_chips` ^BETA *(project)* Available TPU Chips Count
`GAUGE`, `INT64`, `1` compute.googleapis.com/AcceleratorSlice	The current count of TPU chips of Extended Reservation that are actively available and ready to use. Sampled every 60 seconds. After sampling, data is not visible for up to 360 seconds. `accelerator_type`: Accelerator type and generation. `reservation_id`: The ID of the physical machine reservation. `block_id`: The block ID associated with the slice. `subblock_id`: The subblock ID associated with the slice. `provisioning_model`: The associated provisioning model. `protection_tier`: The associated protection model.
`tpu/slice/capacity/committed_chips` ^BETA *(project)* Purchased TPU Chips Count
`GAUGE`, `INT64`, `1` compute.googleapis.com/AcceleratorSlice	The current count of purchased TPU chips of Extended Reservation. Sampled every 60 seconds. After sampling, data is not visible for up to 360 seconds. `accelerator_type`: Accelerator type and generation. `reservation_id`: The ID of the physical machine reservation. `block_id`: The block ID associated with the slice. `subblock_id`: The subblock ID associated with the slice. `provisioning_model`: The associated provisioning model. `protection_tier`: The associated protection model.

For a complete list of metrics generated by Compute Engine, see Compute Engine metrics.

AI Telemetry Collector

The AI Telemetry Collector collects and publishes TPU metrics under the compute.googleapis.com namespace for TPUs created using the Compute Engine API. These metrics are built-in system metrics, which provide visibility into health and performance.

The AI Telemetry Collector architecture is designed as a lightweight, specialized OpenTelemetry (OTEL) Collector. It uses two primary receivers to capture data:

TPU Runtime Receiver: Scrapes runtime and workload metrics (such as duty cycle and memory usage) directly from the TPU runtime when a machine learning workload is active.
TPU Host Receiver: Captures hardware utilization metrics, such as TensorCore Utilization and Memory Bandwidth Utilization, directly from the device regardless of whether a workload is running.

The AI Telemetry Collector then uses processors to automatically apply necessary resource tags (such as project_id, instance_id, and zone) and securely exports the telemetry directly to Cloud Monitoring.

The AI Telemetry Collector comes pre-installed in Google's TPU-optimized Ubuntu LTS images, and runs automatically when the VM boots. To use this setup, specify the official Google accelerator image project and family when creating a TPU VM instance or instance template. Once the VM starts, the AI Telemetry Collector automatically sends metrics to Cloud Monitoring dashboards.

If you're building custom operating system images, you can use the AI Telemetry Collector after installing and running the ai-telemetry-collector Docker image. For more information, see Use a custom OS image.

Configuration

The AI Telemetry Collector automatically sends metrics to Cloud Monitoring dashboards, and does not require any additional configuration steps. However, you can configure the Snap package or Docker image to add external export destinations, alter metric collection intervals, and include debugging options.

You can either replace the default configuration with a new config file, or append an additional configuration file to the existing default configuration. When adding configurations, keys that don't already exist are added and keys that already exist are overwritten. However, arrays and lists are not additive, so new lists must include both existing and new values.

The following YAML file configures AI Telemetry Collector to send metrics to Prometheus, an open-source systems monitoring and alerting toolkit. It also enables the debugging option, which print metrics within the console.

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    metrics:
      exporters:
      -   prometheus # For more: https://prometheus.io/docs/introduction/overview/
      -   googlecloud # If you do not include this, you'll lose Google Cloud Monitoring
      -   debug # print metrics within the console

Default OS

If you are using Google's TPU-optimized Ubuntu LTS images, run the following Snap command to add the new config file to the existing configuration:

sudo snap set \
  ai-telemetry-collector \
  extra-flags="--config /home/username/additional-config.yaml"

If you want to overwrite and replace the existing configuration, use the config-path flag instead of extra-flags:

sudo snap set \
  ai-telemetry-collector \
  config-path="/home/username/new-config.yaml"

The snap set command should trigger an automatic restart of the AI Telemetry Collector. To verify that the collector has restarted and successfully applied your configurations, use the following command to view the logs:

sudo snap logs -f ai-telemetry-collector

Custom OS

If you are using a custom OS, run the following Docker command to add the new config file to the existing configuration:

# First apply the default configs via `--config=/etc/ai-telemetry-collector/config.yaml`
# Then apply your additional config by volume mount.

docker run --privileged --net=host                                                                   \
  -v <path>/additional-config.yaml:/etc/ai-telemetry-collector/additional-config.yaml \
  ai-telemetry-collector:latest                                                       \
  --config=/etc/ai-telemetry-collector/config.yaml                                    \
  --config=/etc/ai-telemetry-collector/additional-config.yaml

If you want to overwrite and replace the existing configuration, use the following Docker command:

# Mount a volume (your config file) to `/etc/ai-telemetry-collector/config.yaml`
# The binary automatically picks up this file.

docker run --privileged --net=host                                               \
  -v <path>/my-config.yaml:/etc/ai-telemetry-collector/config.yaml   \
  ai-telemetry-collector:latest

Audit logs

Google Cloud services generate audit logs that record administrative and access activities within your Google Cloud resources. For more information, see Compute Engine audit logging.