Cluster management capabilities

The A4X Max, A4X, A4, and A3 Ultra machine series are designed to enable you to run large-scale artificial intelligence (AI) and machine learning (ML) clusters and provide the following cluster management capabilities:

AI infrastructure resources colocation

When you use A4X Max, A4X, A4, or A3 Ultra, you can request host machines that Compute Engine provisions as close together as possible. These machines offer the following features:

This resource arrangement minimizes network hops and optimizes for lowest network latency. To learn more about how to obtain capacity to deploy densely allocated blocks of accelerator-optimized machines, see Capacity overview.

Cluster topology-aware placement

After you create compute instances by using A4X Max, A4X, A4, or A3 Ultra machine types, you can get topology information at the node and cluster levels. This information helps you do the following:

  • Adjust your application or workload design to further minimize network latency.

  • Understand and troubleshoot network latency and performance issues for instances that communicate frequently with each other. These issues can occur if the instances are unexpectedly located far apart.

For more information, see View compute instances topology.

Cluster operational mode

When you reserve capacity to create compute instances or clusters by using A4X Max, A4X, A4, or A3 Ultra machine types, the machine type that you reserve determines the cluster operational mode for the instances. This mode specifies how your instances behave after host errors or faulty host reports. The available operational modes for an instance are managed mode, where Compute Engine automatically replaces any faulty machines but holds back part of your reserved capacity to help ensure that your instances have the necessary resources to restart. Or all capacity mode, where you have access to your full reserved capacity but are responsible for managing failures and planned maintenance.

For more information, see Reservation operational mode.

Cluster maintenance scheduling and controls

You control maintenance of A4X Max, A4X, A4, or A3 Ultra machines by using topology-aware scheduling in a block of resources. This capability helps synchronize upgrades so that your workloads are more resilient to host events and minimize disruptions. This approach helps improve the goodput of your workload.

To facilitate full control of maintenance events, you can use the following features:

Maintenance scheduling type

When you reserve capacity to create compute instances or clusters of A4X Max, A4X, A4, or A3 Ultra machines, you can define how Compute Engine maintains the infrastructure that your instances run on. Based on the machine type that you want to use for your instances, you can choose between synchronized maintenance across instances (grouped), or different maintenance schedules (independent).

For more information, see Maintenance scheduling types.

Manage host events

After you create A4X Max, A4X, A4, or A3 Ultra instances and start your workload, you can set up alerts and receive notifications when maintenance for your instances or reserved blocks is scheduled, starts, or is completed. You can also view and, if needed, manually start maintenance on an instance or reserved block before its scheduled time. These options help you proactively control and minimize downtimes to your workloads.

For more information, see the following:

Cluster monitoring and diagnostic tooling

For monitoring and troubleshooting, A4X, A4, and A3 Ultra machines include the following services:

What's next?