This document explains how to view the physical layout and real-time status of your cluster nodes in Cluster Director.
You can view the topology of the nodes in your cluster by using the Google Cloud console, including their health, usage, and maintenance schedules. This information can help you to manage high-performance workloads, troubleshoot networking issues, and help ensure optimal cluster configuration.
Before you begin
Before you view the topology of the nodes in your cluster, consider the following:
You can only view topology information for virtual machine (VM) instances that use an A4X, A4, A3 Ultra, or A3 Mega machine type and that were created by using a reservation.
When you access and use the Google Cloud console, you don't need to authenticate. You can automatically use Google Cloud services and APIs.
Required roles
To get the permissions that
you need to view clusters,
ask your administrator to grant you the
Hypercompute Cluster Viewer (roles/hypercomputecluster.viewer)
IAM role on the project.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the permissions required to view clusters. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to view clusters:
-
To view the details of a cluster:
hypercomputecluster.clusters.describe
You might also be able to get these permissions with custom roles or other predefined roles.
View a cluster topology
To view a cluster topology by using the Google Cloud console, complete the following steps:
In the Google Cloud console, go to the Cluster Director page.
In the navigation menu, click Clusters. The Clusters page appears.
In the Clusters table, in the Name column, click the name of the cluster that you want to view the details of. A page that gives the details of the cluster appears, and the Details tab is selected.
Click the Topology tab. Then, based on the topology information that you want to view, click one of the following tabs:
To view cluster health information, click the Health tab.
To view cluster maintenance information, click the Maintenance tab.
To view cluster CPU and GPU usage information, click the CPU utilization and GPU utilization tabs.
For more information, see Understand cluster topology tabs in this document.
Understand cluster topology tabs
When you view a cluster topology, you can view the following tabs to monitor the status and performance of your cluster:
Cluster health information
When you view the Health tab on the page that gives the details of a cluster, you can see the health status for each compute node in your cluster. Each node is color-coded to indicate one of three operational states:
Healthy: the node is fully operational.
Suspected bad: the system has detected potential issues with your node, but the node remains operational.
Unhealthy: the node has failed health checks and has stopped.
If the system flags a node as Suspected bad or Unhealthy, then you can report it to start a repair operation. This action can help you to minimize disruptions to your running workloads. For instructions on reporting faulty hosts, see Report and repair unhealthy cluster nodes.
Cluster maintenance information
When you view the Maintenance tab on the page that gives the details of a cluster, you can view information about any upcoming maintenance events for the nodes in your cluster. This information lets you anticipate and plan for necessary downtime.
If a maintenance event is scheduled for the nodes in your cluster, then you can manually start maintenance, rather than waiting for the scheduled time. For instructions on manually starting maintenance, see Manually start a cluster maintenance.
Cluster CPU and GPU usage information
When you view the CPU utilization and GPU utilization tabs on the page that gives the details of a cluster, you can see a visual representation of resource usage across all nodes in your cluster in real time. These views help you understand how your workloads are distributed and whether your resources are being used efficiently. Use this information to identify performance bottlenecks and opportunities for optimization.