Monitor Compute Engine instances and Slurm clusters

This document explains how to use Cloud Monitoring dashboards to monitor A4X Max, A4X, A4, A3 Ultra, and A3 Mega instances that you created by using reserved capacity. Using these dashboards helps you identify and troubleshoot performance bottlenecks in your standalone Compute Engine instances or Slurm clusters, minimizing downtime in your workloads.

By creating custom dashboards or using prebuilt Monitoring dashboards, you can monitor the following:

Compute instance health
GPU performance
Network transmission efficiency
Network efficiency among blocks and sub-blocks
Machine learning (ML) workload efficiency
Straggler detection
Unresponsive workload detection

To monitor clusters Cluster Director, see Monitor cluster performance with prebuilt dashboards.

Before you begin

Before monitoring your workload, if you haven't already done so, complete the following steps:

Deploy a workload that you can monitor. To learn which workloads are supported, see the limitations in this document. To learn how to deploy a workload, see Deployment options overview.
Learn about the Google Cloud services for monitoring workloads:
- The metrics in this document use Monitoring dashboards. Learn about Monitoring dashboards, Monitoring retention periods, and Monitoring pricing.
- Straggler detection also provides log entries in Cloud Logging. Learn about Logging interfaces, Logging retention periods, and Logging pricing.

When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.

Limitations

The metrics in this document are only supported for workloads that run on compute instances that meet all of the following criteria:

The compute instances must be created as either standalone Compute Engine instances or as part of a Slurm cluster.
The compute instances must have been created by using reserved capacity.
The compute instances must use the A4X Max, A4X, A4, A3 Ultra, or A3 Mega machine series.

To monitor ML workload metrics, you must set up monitoring for your workload.

Straggler detection limitations

Straggler detection metrics have the following additional limitations:

For supported machine series other than A3 Mega, straggler detection only supports compute instances that enable the Collective Communication Analyzer (CoMMA) library to export NCCL telemetry to Google Cloud services. For more information, see CoMMA overview.
Straggler detection typically takes up to 10 minutes to report a straggler.
Unlike the other metrics in this document, you can't filter straggler detection metrics for your projects by cluster, block, sub-block, or compute instance. However, you can filter queries for straggler detection logs by the ID of one or more compute instances that are suspected stragglers.

Unresponsive workload detection limitations

Unresponsive workload detection metrics only support compute instances that use the Collective Communication Analyzer (CoMMA) library to export NCCL telemetry to Google Cloud services. For more information, see CoMMA overview.

Required roles

To get the permissions that you need to monitor metrics for AI Hypercomputer workloads, ask your administrator to grant you the following IAM roles:

To view metrics in Cloud Monitoring: Monitoring Editor (roles/monitoring.editor) on the project
To view straggler detection logs in Logging: Logs Viewer (roles/logging.viewer) on the project

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to monitor metrics for AI Hypercomputer workloads. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to monitor metrics for AI Hypercomputer workloads:

To view dashboards: monitoring.dashboards.get on the project
To create dashboards: monitoring.dashboards.create on the project
To view log entries: logging.logEntries.list on the project

You might also be able to get these permissions with custom roles or other predefined roles.

Available metrics

Depending on your use case, the following metrics are available for monitoring your compute instances and Slurm clusters:

To monitor the health, performance, and network performance of the GPUs attached to your compute instances, see Infrastructure metrics.
To monitor the efficiency of the GPUs in your ML workloads, see ML workload metrics.
To monitor suspected straggler compute instances in ML workloads with slow performance, see Straggler detection metrics.

To learn how to view these metrics, see Visualize metrics in this document.

Infrastructure metrics

To monitor the health, performance, and network performance of the GPUs attached to your compute instances, you can use the following metrics:

GPU health metrics
GPU performance metrics
GPU network performance metrics
GPU fatal errors metrics

For an overview of available metrics in Compute Engine, see Google Cloud metrics.

GPU health metrics

To monitor the health of your GPUs, use the following metrics:

Name	Metric type	Supported machine series	Description
Machine Status	`machine/machine_status`	A4X Max, A4X, A4, A3 Ultra, or A3 Mega	Whether the machine that the compute instance uses is healthy, or the machine is unhealthy and requires repair.
NVSwitch Status	`instance/gpu/nvswitch_status`	A4X Max, A4X, A4, A3 Ultra, or A3 Mega	Whether an NVLink Switch on an NVIDIA GPU attached to a compute instance is encountering issues.
VM Infra Health	`instance/gpu/infra_health`	A4X, A4, A3 Ultra, or A3 Mega	The health of the cluster, block, sub-block, and host on which your compute instances are running. If this metric shows that a compute instance's infrastructure is unhealthy, then the metric also describes the issue.
VM Failure Prediction Score	`instance/gpu/failure_prediction_score`	A4X, A4, A3 Ultra, or A3 Mega	The likelihood that the host on which the compute instance runs degrades in the next five hours. The value can be between `0.0` and `1.0`. The closer the value remains to `1.0` for a consistent period of time, the more likely the compute instance will degrade. In that case, we recommend that you move the job to a different compute instance and, if you encounter issues with the compute instance, then report its host as faulty.

GPU performance metrics

To monitor the performance of your GPUs, use the following metrics:

Name	Metric type	Supported machine series	Description
Accumulated Context Utilization	`instance/gpu/accumulated_context_utilization_seconds`	A4X Max, A4X, A4, A3 Ultra, or A3 Mega	The total time, in seconds, that the GPU is busy processing a workload.
GPU Power Consumption	`instance/gpu/power_consumption`	A4X Max, A4X, A4, A3 Ultra, or A3 Mega	The power in watts (W) and in decimal values that is consumed on individual GPUs on the host. For compute instances with multiple GPUs attached, the metric provides the power consumption separately for each GPU on the host.
SM Utilization	`instance/gpu/sm_utilization`	A4X Max, A4X, A4, A3 Ultra, or A3 Mega	A non-zero value indicates that the streaming multiprocessors (SMs) on your GPUs are actively being used.
GPU Temperature	`instance/gpu/temperature`	A4X Max, A4X, A4, A3 Ultra, or A3 Mega	The temperature in Celsius (℃) and in decimal values of individual GPUs on the host. For compute instances with multiple GPUs attached, the metric provides the temperature separately for each GPU on the host.
GPU Thermal Margin	`instance/gpu/tlimit`	A4X Max, A4X, A4, A3 Ultra, or A3 Mega	The thermal headroom in Celsius (℃) and in decimal values that individual GPUs have before they need to slow down due to high temperature. For compute instances with multiple GPUs attached, the metric provides the thermal headroom separately for each GPU on the host.

GPU network performance metrics

To monitor the network performance of your GPUs, use the following metrics:

Name	Metric type	Supported machine series	Description
Link Carrier Changes	`instance/gpu/link_carrier_changes`	A4X, A4, A3 Ultra, or A3 Mega	How often the network link carrier changes in a minute.
Network RTT	`instance/gpu/network_rtt`	A4X, A4, A3 Ultra, or A3 Mega	The round-trip time, measured in microseconds, for network data to travel between a source and destination.
Network Traffic at Inter-Block	`instance/gpu/network/inter_block_tx`	A4X, A4, A3 Ultra, or A3 Mega	The number of bytes of network traffic among blocks.
Network Traffic at Inter-Sub-block	`instance/gpu/network/inter_subblock_tx`	A4X, A4, A3 Ultra, or A3 Mega	The number of bytes of network traffic among sub-blocks.
Network Traffic at Intra-Sub-block	`instance/gpu/network/intra_subblock_tx`	A4X, A4, A3 Ultra, or A3 Mega	The number of bytes of network traffic within a single sub-block.
NVLink Active Speed	`instance/gpu/nvlink_active_speed`	A4X Max, A4X, A4, A3 Ultra, or A3 Mega	The current access link port speed, in GBps.
Throughput Rx Bytes	`instance/gpu/throughput_rx_bytes`	A4X, A4, A3 Ultra, or A3 Mega	The number of bytes received from network traffic.
Throughput Tx Bytes	`instance/gpu/throughput_tx_bytes`	A4X, A4, A3 Ultra, or A3 Mega	The number of bytes transmitted to network traffic.

GPU fatal errors metrics

To monitor the errors that your GPUs encounter and that might force your compute instances to stop, or negatively impact their performance, use the following metrics:

Name	Metric type	Supported machine series	Description
NVLink runtime error	`instance/gpu/nvlink_runtime_error`	A4X Max or A4X	Whether an NVLink runtime error occurred.
Uncorrectable DRAM ECC errors	`instance/gpu/dram_uncorrectable_ecc_error_count`	A4X Max or A4X	The number of uncorrectable error-correcting codes (ECCs) in a GPU dynamic random access memory (DRAM).
Uncorrectable DRAM row remapping count	`instance/gpu/dram_uncorrectable_row_remapping_count`	A4X Max or A4X	The number of row remappings from uncorrectable errors in GPU DRAMs.
Uncorrectable DRAM row remapping failed	`instance/gpu/dram_row_remapping_failed`	A4X Max or A4X	Whether a row remapping in GPU DRAMs has failed due to one of the following issues: A remapping attempt on a memory bank failed because the memory bank already has eight uncorrectable error rows remapped. A remapping attempt on a row failed because the row was already remapped. A remapping attempt failed because 512 total remappings have occurred.
Uncorrectable PCIe errors	`instance/gpu/pcie_fatal_error_count`	A4X Max or A4X	The number of uncorrectable peripheral component interconnect express (PCIe) errors.
Uncorrectable cache ECC errors	`instance/gpu/cache_uncorrectable_ecc_error_count`	A4X Max or A4X	The number of uncorrectable ECCs in cache memory.

ML workload metrics

To monitor the productivity—specifically, the goodput—of your ML workloads, use the following metrics:

Name	Metric type	Supported machine series	Description
Productive time	`workload/goodput_time`	A4X, A4, A3 Ultra, or A3 Mega	The time, in seconds, the workload spends on goodput activities. These activities are core, useful tasks, such as a forward or backward pass during model training.
Non-productive time	`workload/badput_time`	A4X, A4, A3 Ultra, or A3 Mega	The time, in seconds, the workload spends on badput activities. These activities are overhead tasks, such as loading or preprocessing data for training.

Straggler detection metrics

Straggler detection metrics help you notice and pinpoint suspected stragglers. Stragglers are single-point, non-crashing failures that eventually slow down the entire workload.

To monitor straggler detection for your VMs, use the following metric:

Name	Metric type	Supported machine series	Description
Suspected Stragglers	`instance/gpu/straggler_status`	A4X, A4, A3 Ultra, or A3 Mega	Whether a VM is suspected as a straggler that is affecting the performance of the workload. We recommend that you act on suspected stragglers only when other metrics indicate that the workload is experiencing issues.

You can also view straggler detection metrics in the log entries for an A4X, A4, A3 Ultra or A3 Mega instance. For example, you can use the following queries:

Description Query

Description	Query
Logs with suspected stragglers for specific VMs. Use this query to check if there are any suspected stragglers for a specific workload in your project.	logName=~ "/logs/compute.googleapis.com%2Fworkload_diagnostic" AND jsonPayload.suspectedStragglersDetection.numNodes > 0 AND jsonPayload.suspectedStragglersDetection.nodes.instanceId="`INSTANCE_ID`" Replace `INSTANCE_ID` with the ID of a VM. For each additional VM that you want to specify, add the following condition to the query: OR jsonPayload.suspectedStragglersDetection.nodes.instanceId="`INSTANCE_ID`"
All logs from straggler detection for your project. Use this query to verify if the straggler detection service is running when no suspected stragglers are detected. (Due to the limitations, you can't filter the logs without suspected stragglers by specific VMs.)	`logName=~ "/logs/compute.googleapis.com%2Fworkload_diagnostic"`

Logs with suspected stragglers for specific VMs. Use this query to check if there are any suspected stragglers for a specific workload in your project.

    logName=~ "/logs/compute.googleapis.com%2Fworkload_diagnostic" AND jsonPayload.suspectedStragglersDetection.numNodes > 0 AND jsonPayload.suspectedStragglersDetection.nodes.instanceId="INSTANCE_ID"

Replace INSTANCE_ID with the ID of a VM. For each additional VM that you want to specify, add the following condition to the query:

    OR jsonPayload.suspectedStragglersDetection.nodes.instanceId="INSTANCE_ID"

All logs from straggler detection for your project. Use this query to verify if the straggler detection service is running when no suspected stragglers are detected. (Due to the limitations, you can't filter the logs without suspected stragglers by specific VMs.)


    logName=~ "/logs/compute.googleapis.com%2Fworkload_diagnostic"

Straggler detection metrics are particularly helpful for large-scale ML workloads for the following reasons:

Large-scale ML workloads are very susceptible to stragglers. Large-scale ML workloads use synchronous and massively distributed computing. (In other words, they have many, highly interdependent components that run simultaneously.) This architecture makes large-scale ML workloads very susceptible to single-point failures like stragglers.
Noticing and pinpointing stragglers in large-scale ML workloads is very difficult. For reference, consider that there are two types of single-point failures:
- stopping failures: Failures that cause the entire system to halt; for example host errors and maintenance events. They are relatively straightforward to detect and resolve.
- slow failures: Failures that cause severe performance degradation without crashes. They are very difficult to pinpoint and debug.
Due to their slow-failure nature, stragglers are inherently difficult to notice and pinpoint, especially in large-scale synchronous workloads.

Unresponsive workload detection metrics

Unresponsive workload detection metrics help you do the following:

Notice when an entire workload has stalled (sometimes referred to as a NCCL hang)
Understand why the workload has stalled, such as whether it was caused by a process crash or a stalled network

To detect and diagnose unresponsive workloads for your compute instances, use the following metrics:

Name	Metric type	Supported machine series	Description
Unresponsive workload events detected by using NCCL telemetry	`instance/gpu/nccl_hang`	A4X Max, A4X, A4 and A3 Ultra	The number of detected unresponsive workload events, as a time series.

Enable unresponsive workload detection

To enable unresponsive workload detection, you must enable CoMMA with heartbeat telemetry, a periodic ping signal which indicates that a workload is running. For recent versions of CoMMA, this is enabled by default. However, if you're using the version of CoMMA from version 1.1.1 of the NICCL/gIB bundle, then you must manually enable heartbeat telemetry. To verify which version of the NICCL/gIB bundle you're using, see Check NCCL and gIB version.

To manually enable heartbeat telemetry for CoMMA, specify the following environment variables in your training environment:

NCCL_PROFILER_HEARTBEAT=true

NCCL_PROFILER_HEARTBEAT_UPLOAD_INTERVAL=10s

Use NCCL_PROFILER_HEARTBEAT to turn heartbeat telemetry on or off, and NCCL_PROFILER_HEARTBEAT_UPLOAD_INTERVAL to specify the frequency of the heartbeat telemetry. For more information, see CoMMA environment variables.

Turn off unresponsive workload detection

To turn off unresponsive workload detection, turn off heartbeat telemetry in CoMMA by specifying the following environment variable in your training environment:

NCCL_PROFILER_HEARTBEAT=false

Understand why workloads are unresponsive

To understand why a workload is unresponsive, check the value of the label hang_reason by completing the following steps:

In the Google Cloud console, go to the Metrics explorer page:
Go to Metrics explorer

If you use the search bar to find this page, then select the result whose subheading is Monitoring.

Search for the following metric:

compute.googleapis.com/instance/gpu/nccl_hang

Use the Aggregation feature, and select the following labels:
- instance_id
- hang_reason

The following table lists possible values for the label, what those values mean about your workloads, and recommended next steps.

Label value	Description	Recommended next steps
`MissingHeartbeatIssue`	Heartbeat telemetry has stopped for one or more ranks, which typically indicates a fatal process or node crash.	Verify if the instance is still reachable. Verify if workload processes have crashed. Check for out-of-memory (OOM) events, such as `dmesg`, in system logs. Look for hardware failures or NVIDIA XID errors.
`StalledRankIssue`	Heartbeat telemetry is still being received, but the ranks aren't progressing on NCCL operations.	Investigate potential deadlocks in app-level operations. Check if the application process is stuck in an operation that prevents it from communicating with others, such as computation or checkpointer.
`MissingCommunicatorIssue`	All of the ranks that belong to a NCCL communicator have stopped making progress.	Your workload might have been interrupted, or its NCCL communicators might have closed abruptly. If you're expecting a workload to run uninterrupted on this VM instance, then check if the workload has been abnormally interrupted or shut down.
`NoHangIssue`	The default value. No issues have been detected.	No action is required.

View metrics

To view metrics for your compute instances and Slurm clusters, use Monitoring dashboards as follows:

To view infrastructure metrics and straggler detection metrics, you can do the following:
- For a quick overview of your infrastructure health and performance, or to customize an existing dashboard, use prebuilt dashboards.
- For specific monitoring needs, create custom dashboards.
To view ML workload metrics, see the documentation for how to set up monitoring for your workload.
To view logs from straggler detection, view straggler detection logs.

If you encounter issues when you use a dashboard, then see Troubleshoot slow performance.

Use prebuilt dashboards

You can use Monitoring dashboards that are prebuilt for AI Hypercomputer to view metrics for your compute instances and Slurm clusters. You can also create a copy of a prebuilt dashboard and modify it to fit your needs.

To use a prebuilt dashboard for AI Hypercomputer, do the following:

In the Google Cloud console, go to the Dashboards page:
Go to Dashboards

If you use the search bar to find this page, then select the result whose subheading is Monitoring.
In the Name column, click the name of one of the following dashboards based on which metrics you want to view:
- To monitor compute instance health, GPU performance, and straggler detection, use the Cluster Director Health Monitoring dashboard.
  
  For more information about how to use these metrics to identify and analyze issues, also use the GCE Interactive Playbook - Cluster Director Health Monitoring playbook dashboard.
- To monitor network transmission efficiency, use the Cluster Director Transmission Efficiency dashboard.
- To monitor network efficiency among blocks and sub-blocks, use the Cluster Director Block Network dashboard.
  
  For more information about how to use these metrics to identify and analyze issues, also use the GCE Interactive Playbook - Cluster Director Block Network playbook dashboard.
The details page of your chosen dashboard opens. You can use the time-range selector in the toolbar to change the time range of the data.
Optional: To create a copy of a dashboard and customize it to fit your needs, click Copy dashboard.

Create custom dashboards

To create a custom Monitoring dashboard, do the following:

Choose the metrics to monitor. If you haven't already, then see Available metrics in this document.
Create and manage custom dashboards.

View straggler detection logs

To view straggler detection logs by using the Logs Explorer, complete the following steps:

In the Google Cloud console, go to the Logs Explorer page:
Go to Logs Explorer

If you use the search bar to find this page, then select the result whose subheading is Logging.

The page queries all logs in your project by default. Click Stop query.
Use the time-range selector in the toolbar to select the time range that you want to analyze.
In the Query pane, enter a query for straggler detection logs.
Click Run Query.

The following is an example of a straggler detection log entry.

  {
    ...
    "jsonPayload": {
      ...
      "@type": "type.googleapis.com/ml.aitelemetry.performancedebugging.output.NetworkStragglersOutput",
      "suspectedStragglersDetection": {
        "numNodes": 4,
        "nodes": [
          {
            "latencyMs": 9,
            "instanceId": "INSTANCE_ID_1"
          },
          {
            "latencyMs": 9,
            "instanceId": "INSTANCE_ID_2"
          },
          {
            "instanceId": "INSTANCE_ID_3",
            "latencyMs": 4
          },
          {
            "instanceId": "INSTANCE_ID_4",
            "latencyMs": 0
          }
        ],
        "message": "Suspected stragglers detected."
      }
    },
    "resource": {
      "type": "project",
      "labels": {
        "project_id": "PROJECT_NUMBER"
      }
    },
    ...
    "severity": "INFO",
    "logName": "projects/PROJECT_ID/logs/compute.googleapis.com%2Fworkload_diagnostic",
    ...
  }

The log entry includes the following fields:

numNodes: The number of suspected straggler compute instances that are detected in the project. In the example, four suspected straggler compute instances have been detected.
instanceId: The ID of a compute instance that was detected as a suspected straggler.