This document shows you how to create an AI-optimized Google Kubernetes Engine (GKE) cluster that uses A4X Max Compute Engine instances to support your AI and ML workloads.
The A4X Max and A4X series lets you run large-scale AI/ML clusters by using the NVIDIA Multi-Node NVLink (MNNVL) system, a rack-scale solution which enables higher GPU power and performance. These machines offer features such as targeted workload placement, topology-aware scheduling, and advanced cluster maintenance controls. For more information, see Cluster management capabilities. With A4X Max, GKE also provides an automated networking setup which simplifies cluster configuration.
AI and ML workloads, such as distributed training, require powerful acceleration to optimize performance by reducing job completion times. GKE provides a single platform surface to run a diverse set of workloads for your organizations, reducing the operational burden of managing multiple platforms. You can run workloads such as high-performance distributed pre-training, model fine-tuning, model inference, application serving, and supporting services. For workloads that require high performance, high throughput, and low latency, GPUDirect RDMA reduces the network hops that are required to transfer payloads to and from GPUs. This approach more efficiently uses the network bandwidth that's available. For more information, see GPU networking stacks.
In this document, you learn how to create a GKE cluster with the Google Cloud CLI for maximum flexibility in configuring your cluster based on the needs of your workload. To use the gcloud CLI to create clusters with other machine types, see the following:
- A4X: Create a custom AI-optimized GKE cluster which uses A4X.
- A4 or A3 Ultra: To create a cluster which uses A4 or A3 Ultra, see Create a custom AI-optimized GKE cluster which uses A4 or A3 Ultra. You can use these machine series for running workloads with or without GPUDirect RDMA.
Alternatively, you can choose to use Cluster Toolkit to quickly deploy your cluster with default settings that reflect best practices for many use cases. For more information, see Create an AI-optimized GKE cluster with default configuration.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
Obtain capacity
You can obtain capacity for A4X Max compute instances by creating a future reservation. For more information about future reservations, see the Future reservations in AI Hypercomputer column in the table for Choose a consumption option.
To obtain capacity with a future reservation, see the Future reservations in AI Hypercomputer row in the table for How to obtain capacity.
Requirements
The following requirements apply to an AI-optimized GKE cluster with A4X Max compute instances:
For A4X Max, you must use one of the following versions:
- For 1.35 or later, use GKE version 1.35.0-gke.2745000 or later.
- For 1.34, use GKE version 1.34.3-gke.1318000 or later.
These versions help to ensure that A4X Max uses the following:
- R580.95.05, the minimum GPU driver version for A4X Max, which is enabled by default.
- Coherent Driver-based Memory Management (CDMM), which is enabled by default. NVIDIA recommends that Kubernetes clusters enable this mode to resolve memory over-reporting. CDMM allows GPU memory to be managed through the driver instead of the operating system (OS). This approach helps you to avoid OS onlining of GPU memory, and exposes the GPU memory as a Non-Uniform Memory Access (NUMA) node to the OS. Multi-instance GPUs aren't supported when CDMM is enabled. For more information about CDMM, see Hardware and Software Support.
- GPUDirect RDMA and MNNVL, which are recommended to enable A4X Max node pools to use the networking capabilities of A4X Max.
The GKE nodes must use a Container-Optimized OS node image. Ubuntu and Windows node images are not supported.
Your GKE workload must use all available GPUs and your Pod must use all available secondary NICs on a single GKE node. Multiple Pods cannot share RDMA on a single GKE node.
You must use the reservation-bound provisioning model to create clusters with A4X Max. Other provisioning models are not supported.
These instructions use DRANET to configure an AI-optimized GKE cluster with A4X Max. Multi-networking isn't supported for the
a4x-maxgpu-4g-metalmachine type.
Considerations for creating a cluster
When you create a cluster, consider the following information:
- Choose a cluster location:
- Verify that you use a location which has availability for the machine type that you choose. For more information, see Accelerator availability.
- When you create node pools in a regional
cluster—which are
recommended for production workloads—you can use the
--node-locationsflag to specify the zones for your GKE nodes.
- Choose a driver version:
- The driver version can be one of the following values:
default: install the default driver version for your GKE node version. For more information about the requirements for default driver versions, see the Requirements section.latest: install the latest available driver version for your GKE version. This option is available only for nodes that use Container-Optimized OS.disabled: skip automatic driver installation. You must manually install a driver after you create the node pool.
- For more information about the default and latest GPU driver versions for GKE node versions, see the table in the section Manually install NVIDIA GPU drivers.
- The driver version can be one of the following values:
Choose a reservation affinity:
- You can find information about your reservation, such as the name of your reservation or the name of a specific block in your reservation. To find these values, see View future reservation requests.
- The
--reservation-affinityflag can take the values ofspecificorany. However, for high performance distributed AI workloads, we recommend that you use a specific reservation. When you use a specific reservation, including shared reservations, specify the value of the
--reservationflag in the following format:projects/PROJECT_ID/reservations/RESERVATION_NAME/reservationBlocks/BLOCK_NAMEReplace the following values:
PROJECT_ID: your Google Cloud project ID.RESERVATION_NAME: the name of your reservation.BLOCK_NAME: the name of a specific block within the reservation.
We also recommend that you use a sub-block targeted reservation so that compute instances are placed on a single sub-block within the
BLOCK_NAME. Add the following to the end of the path:/reservationSubBlocks/SUB_BLOCK_NAMEReplace
SUB_BLOCK_NAMEwith the name of the sub-block.
Create an AI-optimized GKE cluster which uses A4X Max and GPUDirect RDMA
For distributed AI workloads, multiple GPU nodes are often linked together to work as a single computer. A4X Max is an exascale platform based on NVIDIA GB300 NVL72 rack-scale architecture. A4X Max compute instances use a multi-layered, hierarchical networking architecture with a rail-aligned design to optimize performance for various communication types. This machine type enables scaling and collaboration across multiple GPUs by delivering a high-performance cloud experience for AI workloads. For more information about the network architecture for A4X Max, including the network bandwidth and NIC arrangement, see A4X Max machine type (bare metal).
To create a GKE Standard cluster with A4X Max that uses GPUDirect RDMA and MNNVL, complete the steps that are described in the following sections:
- Create the GKE cluster
- Create a workload policy
- Create a node pool with A4X Max
- Configure the MRDMA NICs with
asapd-lite - Install the NVIDIA Compute Domain CRD and DRA driver
- Configure your workload manifest for RDMA and IMEX domain
These instructions use accelerator network profiles to automatically configure VPC networks and subnets for your A4X Max nodes. Alternatively, you can explicitly specify your VPC network and subnets.
Create the GKE cluster
Create a GKE Standard cluster:
gcloud container clusters create CLUSTER_NAME \ --enable-dataplane-v2 \ --enable-ip-alias \ --location=COMPUTE_REGION \ --cluster-version=CLUSTER_VERSION \ --no-enable-shielded-nodes [\ --services-ipv4-cidr=SERVICE_CIDR \ --cluster-ipv4-cidr=POD_CIDR \ --addons=GcpFilestoreCsiDriver=ENABLED]Replace the following:
CLUSTER_NAME: the name of your cluster.CLUSTER_VERSION: the version of your new cluster. For more information about which version of GKE supports your configuration, see the Requirements in this document.COMPUTE_REGION: the name of the compute region.Optionally, you can explicitly provide the secondary CIDR ranges for services and Pods. If you use these optional flags, then replace the following variables:
SERVICE_CIDR: the secondary CIDR range for services.POD_CIDR: the secondary CIDR range for Pods.
When you use these flags, you must verify that the CIDR ranges don't overlap with subnet ranges for additional node networks. For example, consider
SERVICE_CIDR=10.65.0.0/19andPOD_CIDR=10.64.0.0/19. For more information, see Adding Pod IPv4 address ranges.
To run the
kubectlcommands in the next sections, connect to your cluster:gcloud container clusters get-credentials CLUSTER_NAME --location=COMPUTE_REGIONReplace the following:
CLUSTER_NAME: the name of your cluster.COMPUTE_REGION: the name of the compute region.
For more information, see Install kubectl and configure cluster access.
Create a workload policy
A workload policy is required to create a partition. For more information, see Workload policy for MIGs.
Create a HIGH_THROUGHPUT workload policy with the accelerator_topology field
set to 1x72.
gcloud beta compute resource-policies create workload-policy WORKLOAD_POLICY_NAME \
--type HIGH_THROUGHPUT \
--accelerator-topology 1x72 \
--project PROJECT \
--region COMPUTE_REGION
Replace the following:
WORKLOAD_POLICY_NAME: the name of your workload policy.PROJECT: the name of your project.COMPUTE_REGION: the name of the compute region.
Create a node pool with A4X Max
Create the following configuration file to pre-allocate hugepages with the node pool:
cat > node_custom.yaml <<EOF linuxConfig: hugepageConfig: hugepage_size2m: 4096 EOF export NODE_CUSTOM=node_custom.yamlCreate an A4X Max node pool:
gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --location=COMPUTE_REGION \ --node-locations=COMPUTE_ZONE \ --num-nodes=NODE_COUNT \ --placement-policy=WORKLOAD_POLICY_NAME \ --machine-type=a4x-maxgpu-4g-metal \ --accelerator=type=nvidia-gb300,count=4,gpu-driver-version=latest \ --system-config-from-file=${NODE_CUSTOM} \ --accelerator-network-profile=auto \ --node-labels=cloud.google.com/gke-networking-dra-driver=true,cloud.google.com/gke-dpv2-unified-cni=cni-migration \ --reservation-affinity=specific \ --reservation=RESERVATION_NAME/reservationBlocks/BLOCK_NAME/reservationSubBlocks/SUB_BLOCK_NAMEReplace the following:
NODE_POOL_NAME: the name of the node pool.CLUSTER_NAME: the name of your cluster.COMPUTE_REGION: the compute region of the cluster.COMPUTE_ZONE: the zone of your node pool.NODE_COUNT: the number of nodes for the node pool, which must be 18 nodes or less. We recommend using 18 nodes to obtain the GPU topology of1x72in one sub-block using an NVLink domain.WORKLOAD_POLICY_NAME: the name of the workload policy you created previously.RESERVATION_NAME: the name of your reservation. To find this value, see View future reservation requests.BLOCK_NAME: the name of a specific block within the reservation. To find this value, see View future reservation requests.
This command automatically creates a network that connects all the A4X Max nodes within a single zone by using the
autoaccelerator network profile. When you create a node pool with the--accelerator-network-profile=autoflag, GKE automatically adds thegke.networks.io/accelerator-network-profile: autolabel to the nodes. To schedule workloads on these nodes, you must include this label in your workload'snodeSelectorfield.
Configure the MRDMA NICs with asapd-lite
The asapd-lite DaemonSet configures the MRDMA NICs. An unhealthy asapd-lite
DaemonSet might indicate no RDMA connectivity.
Install the DaemonSet:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/asapd-lite-installer/asapd-lite-installer-a4x-max-bm-cos.yamlValidate the replicas in the
asapd-liteDaemonSet:kubectl get daemonset -n kube-system asapd-liteThe output is similar to the following:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE asapd-lite 18 18 18 18 18 <none> 5mThe number of
READYreplicas should match the number of nodes that were created and are healthy in the node pool.
Install the NVIDIA Compute Domain CRD and DRA driver
The following steps install the NVIDIA Compute Domain CRD and DRA driver to enable the use of MNNVL. For more information, see NVIDIA DRA Driver for GPUs.
Verify that you have Helm installed in your development environment. Helm comes pre-installed on Cloud Shell.
Although there is no specific Helm version requirement, you can use the following command to verify that you have Helm installed.
helm versionIf the output is similar to
Command helm not found, then you can install the Helm CLI:curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ && chmod 700 get_helm.sh \ && ./get_helm.shAdd the NVIDIA Helm repository:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo updateCreate a
ResourceQuotaobject for the DRA Driver:export POD_QUOTA=POD_QUOTA kubectl create ns nvidia-dra-driver-gpu kubectl apply -n nvidia-dra-driver-gpu -f - << EOF apiVersion: v1 kind: ResourceQuota metadata: name: nvidia-dra-driver-gpu-quota spec: hard: pods: ${POD_QUOTA} scopeSelector: matchExpressions: - operator: In scopeName: PriorityClass values: - system-node-critical - system-cluster-critical EOFReplace
POD_QUOTAwith a number at least 2 times the number of A4X Max nodes in the cluster plus 1. For example, you must set the variable to at least 37 if you have 18 A4X Max nodes in your cluster.Install the ComputeDomain CRD and DRA driver:
helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ --set controller.args.v=4 --set kubeletPlugin.args.v=4 \ --version="25.8.0" \ --create-namespace \ --namespace nvidia-dra-driver-gpu \ -f <(cat <<EOF nvidiaDriverRoot: /home/kubernetes/bin/nvidia resources: gpus: enabled: false controller: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "nvidia.com/gpu" operator: "DoesNotExist" kubeletPlugin: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cloud.google.com/gke-accelerator operator: In values: - nvidia-gb300 - key: kubernetes.io/arch operator: In values: - arm64 tolerations: - key: nvidia.com/gpu operator: Equal value: present effect: NoSchedule - key: kubernetes.io/arch operator: Equal value: arm64 effect: NoSchedule EOF )
Configure your workload manifest for RDMA and IMEX domain
Add a node affinity rule to schedule the workload on Arm nodes:
spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/arch operator: In values: - arm64Add the following volume to the Pod specification:
spec: volumes: - name: library-dir-host hostPath: path: /home/kubernetes/bin/nvidiaAdd the following volume mounts, environment variable, and resource to the container that requests GPUs. Your workload container must request all four GPUs:
containers: - name: my-container volumeMounts: - name: library-dir-host mountPath: /usr/local/nvidia env: - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 resources: limits: nvidia.com/gpu: 4Create the
ComputeDomainresource for the workload:apiVersion: resource.nvidia.com/v1beta1 kind: ComputeDomain metadata: name: a4x-max-compute-domain spec: numNodes: NUM_NODES channel: resourceClaimTemplate: name: a4x-max-compute-domain-channelReplace
NUM_NODESwith the number of nodes the workload requires.Create a ResourceClaimTemplate to allocate network resources by using DRANET and request RDMA devices for your Pod:
apiVersion: resource.k8s.io/v1 kind: ResourceClaimTemplate metadata: name: all-mrdma spec: spec: devices: requests: - name: req-mrdma exactly: deviceClassName: mrdma.google.com allocationMode: ExactCount count: 8Specify the ResourceClaimTemplate that the Pod uses:
spec: ... volumes: ... containers: - name: my-container ... resources: limits: nvidia.com/gpu: 4 claims: - name: compute-domain-channel - name: rdma ... resourceClaims: - name: compute-domain-channel resourceClaimTemplateName: a4x-max-compute-domain-channel - name: rdma resourceClaimTemplateName: all-mrdmaEnsure that the userspace libraries and the libnccl packages are installed in the user container image:
apt update -y apt install -y curl export DOCA_URL="https://linux.mellanox.com/public/repo/doca/3.1.0/ubuntu22.04/arm64-sbsa/" BASE_URL=$([ "${DOCA_PREPUBLISH:-false}" = "true" ] && echo https://doca-repo-prod.nvidia.com/public/repo/doca || echo https://linux.mellanox.com/public/repo/doca) DOCA_SUFFIX=${DOCA_URL#*public/repo/doca/}; DOCA_URL="$BASE_URL/$DOCA_SUFFIX" curl $BASE_URL/GPG-KEY-Mellanox.pub | gpg --dearmor > /etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.pub echo "deb [signed-by=/etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.pub] $DOCA_URL ./" > /etc/apt/sources.list.d/doca.list apt update apt -y install doca-ofed-userspace # The installed libnccl2 is 2.27.7, to upgrade to 2.28.9 as we recommend apt install --only-upgrade --allow-change-held-packages -y libnccl2 libnccl-dev
A completed Pod specification looks like the following:
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
name: a4x-max-compute-domain
spec:
numNodes: NUM_NODES
channel:
resourceClaimTemplate:
name: a4x-max-compute-domain-channel
---
apiVersion: apps/v1
kind: Pod
metadata:
name: my-pod
labels:
k8s-app: my-pod
spec:
...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- arm64
volumes:
- name: library-dir-host
hostPath:
path: /home/kubernetes/bin/nvidia
hostNetwork: true
containers:
- name: my-container
volumeMounts:
- name: library-dir-host
mountPath: /usr/local/nvidia
env:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
resources:
limits:
nvidia.com/gpu: 4
claims:
- name: compute-domain-channel
- name: rdma
...
resourceClaims:
- name: compute-domain-channel
resourceClaimTemplateName: a4x-max-compute-domain-channel
- name: rdma
resourceClaimTemplateName: all-mrdma
Test network performance
We recommended that you validate the functionality of provisioned clusters. To do so, use NCCL/gIB tests, which are NVIDIA Collective Communications Library (NCCL) tests that are optimized for the Google environment.
For more information, see Run NCCL on custom GKE clusters that use A4X Max.
What's next
- To learn about scheduling workloads on your GKE clusters by using TAS and Kueue, see Schedule GKE workloads with Topology Aware Scheduling.
- To learn about managing common events relevant to GKE
clusters and AI workloads, see Manage AI-optimized GKE
clusters.