Dataflow GPU metrics

This page covers GPU metrics supported in Dataflow. You can use these metrics to monitor the health and use of the GPU. Most metrics are supported on all Dataflow jobs, but some metrics require additional configuration for many GPU models.

Prerequisites

GPU metrics are only collected by Dataflow jobs that have explicitly requested GPUs. For more information, see GPU support.

Overview

While Dataflow reports many GPU metrics, the main metrics are total and used memory, which are equivalent to RAM metrics, and Streaming Multiprocessor (SM) Activity and SM Occupancy, which are the closest equivalents to Dataflow CPU metrics. More metrics are covered under common metrics and GPM metrics.

The total and used memory of every GPU device on the job are reported by default. In the Dataflow monitoring interface, these appear under "Basic GPU utilization". These metrics aren't the same as "Memory access percentage", which is also under basic GPU metrics, but reports the percentage of time that GPU device memory was being accessed.

SM Activity and SM Occupancy are GPM metrics. These metrics aren't supported on P4 and P100 devices, and are supported by default on H100 devices and later. For all other devices, such as T4 and L4 devices, additional setup is necessary. For steps to enable them, see GPM collection. If collected on the job, these metrics are under "GPU GPM utilization" in the Dataflow monitoring interface.

Dataflow GPU metric basics

All GPU metrics are sent by Dataflow workers to Cloud Monitoring. Per-device metrics can be found under dataflow.googleapis.com/worker/accelerator/gpu. All of these metrics are grouped into general categories, such as utilization or temperature, and they all have the following labels:

device_uuid: Uniquely identifies the GPU device regardless of worker or pipeline.
device_number: The number assigned to the device on the worker in the range of [0, N), where N is the number of GPU devices on the worker.
device_model: The model of the GPU, such as "Tesla T4".

Both device_uuid and device_model are independent of the worker and are always the same for the same physical device. The device_number is tied to how it's identified on that worker.

Common GPU metrics for Dataflow

Common metrics are reported by every Dataflow job with GPUs. In Monitoring, they all use the following format:

dataflow.googleapis.com/worker/accelerator/gpu/CATEGORY/NAME

The following table shows each metric and its category, name, unit, and purpose.

Metric	Category	Name	Unit	Description
Kernel Running Percentage	utilization	device_kernel_runtime	Percent	The percentage of time that at least one kernel was running on the GPU. This only shows that the GPU was being used, not if its processing resources are being used efficiently.
Memory Access Percentage	utilization	device_memory_access	Percent	The percentage of time that device memory was being read or written. This only shows that memory was being accessed, not the percentage of memory in use.
Memory Limit	memory	device_limit	MiB	The amount of memory available on the GPU.
Memory Usage	memory	device_usage	MiB	The amount of memory in use by the GPU. This includes both the memory used by the Dataflow job and the memory reserved for firmware, so some usage is expected even if no memory was transferred to the GPU yet.
Power Limit	power	device_limit	Watts	The maximum amount of power that the device is set to use. Dataflow does not change this from the default.
Power Usage	power	device_usage	Watts	The amount of power in use by the device.
Current Temperature	temperature	device_current	Celsius	The current temperature of the GPU.
Max Operating Temperature	temperature	device_max_op	Celsius	The temperature that the GPU should stay under. If the current temperature exceeds this, then the GPU drivers will attempt to cool the GPU until it is under this temperature. Dataflow does not control this.
Slowdown Temperature	temperature	device_slowdown	Celsius	The temperature at which the GPU will start throttling. If the current temperature exceeds this, then you should expect to see a performance degredation until it cools off. Dataflow does not control this.
Shutdown Temperature	temperature	device_shutdown	Celsius	The temperature at which the GPU will shut off. If the current temperature exceeds this, then the device will become unavailable. Dataflow does not control this temperature, nor does it make an active attempt to recover GPUs that have shut down due to excessively high temperature.
Current SM Clock	clock	device_sm_current	MHz	The current speed of the SM clock. If the temperature exceeds the slowdown threshold, then this may go down as part of cooling-related throttling.
Max SM Clock	clock	device_sm_max	MHz	The max speed of the SM clock.
Current Memory Clock	clock	device_memory_current	MHz	The current speed of the memory clock. If the temperature exceeds the slowdown threshold, then this may go down as part of cooling-related throttling.
Max Memory Clock	clock	device_memory_max	MHz	The max speed of the memory clock.

GPM metrics for Dataflow

Dataflow offers some support for GPM metrics. The level of support depends on the GPU model and on the accelerator configuration. By default, most Dataflow jobs with GPUs will require some additional configuration.

GPM metrics follow the same basics as common metrics.

Supported metrics

Similar to the common metrics, GPM metric paths have the following format:

dataflow.googleapis.com/worker/accelerator/gpu/CATEGORY/NAME

Some of these metrics are under the same category as some common metrics.

Metric	Category	Name	Unit	Description
SM Activity	utilization	device_sm_activity	Percent	The percentage of time a warp was active on the SM averaged across all SMs on the device. This is similar to the Kernel Running Percentage, but it offers a more granular picture that better shows if the GPU's resources are being used efficiently. NVIDIA defines effective use as 80% or more, with 50% or less being ineffective use.
SM Occupancy	utilization	device_sm_occupancy	Percent	The percent of active warps on the device relative to the max. Memory-limited jobs should have higher occupancy than compute-limited jobs, and the Memory Access Percentage metric can provide insight on this. More details can be found in NVIDIA documentation on achieved occupancy.
Tensor Pipe Activity	utilization	device_tensor_pipe_activity	Percent	The percentage of time that the Tensor Core pipe was in use. Higher values indicate more usage of the GPU's Tensor Cores, which are important to matrix operations.
FP64 Pipe Activity	utilization	device_fp64_pipe_activity	Percent	The percentage of time that the FP64 Core pipe was in use. Higher values indicate more usage of the GPU's FP64 Cores, which handle scalar operations of 64-bit floating point values.
FP32 Pipe Activity	utilization	device_fp32_pipe_activity	Percent	The percentage of time that the FP32 Core pipe was in use. Higher values indicate more usage of the GPU's FP32 Cores, which handle scalar operations of 32-bit floating point values.
FP16 Pipe Activity	utilization	device_fp16_pipe_activity	Percent	The percentage of time that the FP16 pipe was in use. Unlike FP64 and FP32, which are associated with 64-bit and 32-bit CUDA cores respectively, FP16 is associated with taking advantage of Tensor cores' half-precision capabilities.
PCIe Read	pcie	device_read	MiB/s	The rate of data read by the GPU from the host VM over PCIe.
PCIe Transfer	pcie	device_transfer	MiB/s	The rate of data transfer from the GPU to the host VM over PCIe.
NVLink Read	nvlink	device_read	MiB/s	The rate of data read by the GPU over NVLink. Since NVLink only covers GPU-to-GPU communication, this is irrelevant if each worker only has a single GPU.
NVLink Transfer	nvlink	device_transfer	MiB/s	The rate of data transfer from the GPU over NVLink. Since NVLink only covers GPU-to-GPU communication, this is irrelevant if each worker only has a single GPU.

Collecting GPM metrics

Any Dataflow job with GPUs using Hopper architecture or later (e.g. H100, H100 Mega) collect GPM metrics by default, so no additional configuration is needed. However, jobs using Pascal architecture or earlier (such as P4 and P100) don't support these metrics.

For all other models, collecting these metrics requires adding install-gke-dcgm-exporter to the worker accelerator configuration. For example:

--experiment="worker_accelerator=type:TYPE;count:COUNT;install-nvidia-driver;install-gke-dcgm-exporter"

This flag installs a GKE-managed equivalent to the NVIDIA DCGM-exporter. The following types support this option:

nvidia-l4
nvidia-tesla-a100
nvidia-a100-80gb
nvidia-tesla-t4
nvidia-tesla-v100

If another type is provides, the Dataflow service returns an error on job creation. This check helps you avoid running the container on jobs where it doesn't help with metric collection.

Legacy Metrics

In Monitoring, you might see two metrics named dataflow.googleapis.com/job/gpu_utilization and dataflow.googleapis.com/job/gpu_memory_utilization. These metrics are similar to Kernel Running Percentage and Memory Access Percentage respectively, but workers report them by averaging across GPUs on the worker. We recommend using the per-device equivalents, especially if workers are configured to have more than one GPU.

Dataflow UI

If the Dataflow job has GPUs attached to its workers, then metrics should appear in the "Job Metrics" tab on the job page, under the "Dataflow ML" category. This category doesn't appear on jobs without GPUs, and takes a few seconds to load, because it first verifies that the metrics are relevant to the job.

The following subcategories appear under "Dataflow ML":

Sub-Category	Metrics	Conditions
Basic GPU utilization	Kernel Running Percentage Memory Access Percentage Total/Used Memory	All GPU jobs
GPU performance	Power Draw/Limit Temperature Reading/Limits	All GPU jobs
GPU GPM utilization	SM Activity SM Occupancy CUDA/Tensor pipe activity	GPM metrics enabled
GPU GPM I/O	PCIe Read/Transfer NVLink Read/Transfer	GPM metrics enabled
Legacy GPU utilization	Legacy Metrics	All GPU jobs

When viewing non-legacy metrics, you can filter the charts to a specific worker name and GPU device number. The worker name is the same name as the VM if viewed under Compute Engine. The GPU device number is the same one from the metric labels. You can use this filtering to check metrics on a specific GPU device, such as seeing how close its power usage is relative to its limit:

Example of filtering GPU metrics by device