Vertex AI training clusters supports a variety of machine types to accommodate different workloads. You can choose from the following options when configuring your cluster node pools:
- a4-highgpu-8g
- a4x-highgpu-4g
- a3-ultragpu-8g
- a3-megagpu-8g
- n2 CPU family
A4X machine type
Vertex AI training clusters support the A4X accelerator-optimized machine type (a4x-highgpu-4g), an exascale platform based on the NVIDIA GB200 NVL72 rack-scale architecture.
Architectural comparison
The following table outlines the fundamental hardware differences between the A4X family and other accelerator-optimized families.
| Feature | A4X (a4x-highgpu-4g) | A3 / A4H |
|---|---|---|
| CPU Architecture | ARM | X86 |
| GPU Count | 4 GPUs per node | 8 GPUs per node |
| Reservation Type | All capacity mode | Managed Mode |
| Placement Policy | Strict (Compact) | Flexible |
A4X specific guidelines
- Your A4X node pool VM count must be a multiple of 18 (for example, 18, 36, 54). This is required because A4X capacity is provisioned in fixed, non-shareable 18-node blocks called NVLink domains. These domains are bound by a strict Compact Placement Policy, and any partially allocated blocks cannot be used by other clusters.
- Due to the ARM-based architecture of A4X nodes, you must make two key changes
to your training workloads:
- Use ARM-Compatible images: All training jobs must use a container image that is built for the ARM architecture.
- Adapt for 4 GPUs: Your distributed training logic must be updated to correctly recognize and use the 4 GPUs available on each A4X node.
- Host Fault Reporting process and downtime
When you report a host as faulty, be aware of the following recovery process:
- No standby capacity: The system doesn't use a standby spare pool for an instant node replacement.
- Repair-based recovery: The node remains unavailable until the underlying physical host is repaired.
- Extended downtime: This repair process typically takes 3 to 14 days.
Capacity provisioning
Choosing the right provisioning model is critical for balancing cost, speed, and resource availability. See the following provisioning options:
RESERVATION: Allocates nodes from a specific Compute Engine reservation that you've created in advance. This model ensures capacity and is the recommended choice for high-demand resources.FLEX_START: Utilizes the Dynamic Workload Scheduler to queue your job. The job begins automatically as soon as the requested compute resources become available, offering a flexible start time without requiring a reservation.SPOT: Provisions the node pool using Spot VMs. This is the most cost-effective option, but it should only be used for workloads that are fault-tolerant and can handle interruptions, as the VMs may be preempted at any time.ON_DEMAND: This is the default option for CPU-only node pools and is best suited for machine types that are not scarce. It provides standard VM instances with predictable, pay-as-you-go pricing.
Use the following guidance to make your selection:
For high-demand GPU resources (like A3 and A4): The
RESERVATIONmodel is strongly recommended. It ensures you have dedicated access to the capacity you need for critical training jobs.For bursty or flexible workloads: Consider
FLEX_STARTorSPOT.FLEX_STARTqueues your job until resources are available, whileSPOToffers significant cost savings for fault-tolerant jobs that can handle preemption.For abundant machine types: The
ON_DEMANDmodel is the preferred choice. Use it for machine types that are not scarce and where immediate availability isn't a concern.
Using a shared reservation (optional)
If you'd like to use a shared reservation, rather than a local reservation, there are additional steps to take before you can create a cluster.
Before using a shared reservation with Vertex AI training clusters, make sure
the shared
reservation works by manually creating a VM that uses the shared reservation.
If this VM creation works, move on to the next step.
In the cluster creation configuration, use the reservation name in the following
format:
projects/RESERVATION_HOST_PROJECT_ID/zones/RESERVATION_ZONE/reservations/RESERVATION_NAME.
What's next
After selecting the compute and provisioning options for your training cluster, you are ready to create the cluster and run a workload on it.
- Create a Compute Engine reservation: The
RESERVATIONmodel is used for allocating high-demand resources like GPUs. Learn how to create a new reservation in Compute Engine to get dedicated access to your required resources. - Create your training cluster: Apply the configurations you've learned about
by following the step-by-step guide to create your first persistent training
cluster using the Vertex AI API or
gcloud. - Submit a training job to your cluster: Once your cluster is active, the next
step is to run a workload. Submit a
CustomJobthat targets your persistent cluster for execution. - Adapt your code for distributed training: To take full advantage of a multi-node cluster, adapt your training code for a distributed environment.