This document explains how to monitor cluster resource usage in Cluster Director by using pre-configured Cloud Monitoring dashboards.
Cluster Director provides built-in Monitoring dashboards that let you view telemetry pre-configured data and track the performance of the resources that your cluster uses. Use these dashboards to observe the health and resource usage of your cluster.
To learn more about Monitoring dashboards, see Dashboards overview.
Before you begin
When you access and use the Google Cloud console, you don't need to authenticate. You can automatically use Google Cloud services and APIs.
Required roles
To get the permissions that you need to view clusters, ask your administrator to grant you the following IAM roles on the project:
-
View the details of a cluster:
Hypercompute Cluster Viewer (
roles/hypercomputecluster.viewer) -
View Monitoring dashboards:
Monitoring Viewer (
roles/monitoring.viewer)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
View Monitoring dashboards
To view the prebuilt Monitoring dashboards in a cluster, complete the following steps:
In the Google Cloud console, go to the Cluster Director page.
In the navigation menu, click Clusters. The Clusters page appears.
In the Clusters table, in the Name column, click the name of the cluster that you want to view the details of. A page that gives the details of the cluster appears, and the Details tab is selected.
Click the Observability tab. A pane that shows the available Monitoring prebuilt dashboards for your cluster appears.
Available Monitoring dashboards
The prebuilt Monitoring dashboards that you can view in a cluster show real-time and historical data for the following resource categories:
CPU: you can monitor vCPU usage across all nodes in your cluster. This information can help you improve the performance of your workloads or save costs by deleting unused resources.
Memory: you can track memory usage to help ensure that your nodes have sufficient memory for your jobs and prevent out-of-memory errors.
Network: you can observe network traffic patterns, including bandwidth and throughput. This information is useful for diagnosing and troubleshooting communication bottlenecks in distributed workloads.
GPU: for clusters with GPU nodes, you can monitor GPU usage and GPU memory usage. This information helps you verify that your resources are efficiently used.
Health: you can view high-level health metrics for the components in your cluster, giving you a quick overview of the operational status of your nodes.