Create a custom AI-optimized GKE cluster which uses A4X Max

This document shows you how to create an AI-optimized Google Kubernetes Engine (GKE) cluster that uses A4X Max Compute Engine instances to support your AI and ML workloads.

The A4X Max and A4X series lets you run large-scale AI/ML clusters by using the NVIDIA Multi-Node NVLink (MNNVL) system, a rack-scale solution which enables higher GPU power and performance. These machines offer features such as targeted workload placement, topology-aware scheduling, and advanced cluster maintenance controls. For more information, see Cluster management capabilities. With A4X Max, GKE also provides an automated networking setup which simplifies cluster configuration.

AI and ML workloads, such as distributed training, require powerful acceleration to optimize performance by reducing job completion times. GKE provides a single platform surface to run a diverse set of workloads for your organizations, reducing the operational burden of managing multiple platforms. You can run workloads such as high-performance distributed pre-training, model fine-tuning, model inference, application serving, and supporting services. For workloads that require high performance, high throughput, and low latency, GPUDirect RDMA reduces the network hops that are required to transfer payloads to and from GPUs. This approach more efficiently uses the network bandwidth that's available. For more information, see GPU networking stacks.

In this document, you learn how to create a GKE cluster with the Google Cloud CLI for maximum flexibility in configuring your cluster based on the needs of your workload. To use the gcloud CLI to create clusters with other machine types, see the following:

A4X: Create a custom AI-optimized GKE cluster which uses A4X.
A4 or A3 Ultra: To create a cluster which uses A4 or A3 Ultra, see Create a custom AI-optimized GKE cluster which uses A4 or A3 Ultra. You can use these machine series for running workloads with or without GPUDirect RDMA.

Alternatively, you can choose to use Cluster Toolkit to quickly deploy your cluster with default settings that reflect best practices for many use cases. For more information, see Create an AI-optimized GKE cluster with default configuration.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Obtain capacity

You can obtain capacity for A4X Max compute instances by creating a future reservation. For more information about future reservations, see the Future reservations in AI Hypercomputer column in the table for Choose a consumption option.

To obtain capacity with a future reservation, see the Future reservations in AI Hypercomputer row in the table for How to obtain capacity.

Requirements

The following requirements apply to an AI-optimized GKE cluster with A4X Max compute instances:

For A4X Max, you must use one of the following versions:
- For 1.35 or later, use GKE version 1.35.0-gke.2745000 or later.
- For 1.34, use GKE version 1.34.3-gke.1318000 or later.
These versions help to ensure that A4X Max uses the following:
- R580.95.05, the minimum GPU driver version for A4X Max, which is enabled by default.
- Coherent Driver-based Memory Management (CDMM), which is enabled by default. NVIDIA recommends that Kubernetes clusters enable this mode to resolve memory over-reporting. CDMM allows GPU memory to be managed through the driver instead of the operating system (OS). This approach helps you to avoid OS onlining of GPU memory, and exposes the GPU memory as a Non-Uniform Memory Access (NUMA) node to the OS. Multi-instance GPUs aren't supported when CDMM is enabled. For more information about CDMM, see Hardware and Software Support.
- GPUDirect RDMA and MNNVL, which are recommended to enable A4X Max node pools to use the networking capabilities of A4X Max.
The GKE nodes must use a Container-Optimized OS node image. Ubuntu and Windows node images are not supported.
Your GKE workload must use all available GPUs and your Pod must use all available secondary NICs on a single GKE node. Multiple Pods cannot share RDMA on a single GKE node.
You must use the reservation-bound provisioning model to create clusters with A4X Max. Other provisioning models are not supported.
These instructions use DRANET to configure an AI-optimized GKE cluster with A4X Max. Multi-networking isn't supported for the a4x-maxgpu-4g-metal machine type.

Considerations for creating a cluster

When you create a cluster, consider the following information:

Choose a cluster location:
- Verify that you use a location which has availability for the machine type that you choose. For more information, see Accelerator availability.
- When you create node pools in a regional cluster—which are recommended for production workloads—you can use the --node-locations flag to specify the zones for your GKE nodes.
Choose a driver version:
- The driver version can be one of the following values:
  - default: install the default driver version for your GKE node version. For more information about the requirements for default driver versions, see the Requirements section.
  - latest: install the latest available driver version for your GKE version. This option is available only for nodes that use Container-Optimized OS.
  - disabled: skip automatic driver installation. You must manually install a driver after you create the node pool.
- For more information about the default and latest GPU driver versions for GKE node versions, see the table in the section Manually install NVIDIA GPU drivers.
Choose a reservation affinity:
- You can find information about your reservation, such as the name of your reservation or the name of a specific block in your reservation. To find these values, see View future reservation requests.
- The --reservation-affinity flag can take the values of specific or any. However, for high performance distributed AI workloads, we recommend that you use a specific reservation.
- When you use a specific reservation, including shared reservations, specify the value of the --reservation flag in the following format:
```
projects/PROJECT_ID/reservations/RESERVATION_NAME/reservationBlocks/BLOCK_NAME
```
  Replace the following values:
  - PROJECT_ID: your Google Cloud project ID.
  - RESERVATION_NAME: the name of your reservation.
  - BLOCK_NAME: the name of a specific block within the reservation.
  We also recommend that you use a sub-block targeted reservation so that compute instances are placed on a single sub-block within the BLOCK_NAME. Add the following to the end of the path:
```
/reservationSubBlocks/SUB_BLOCK_NAME
```
  Replace SUB_BLOCK_NAME with the name of the sub-block.
  
  Tip: If the reservation is located in the current project, you can omit projects/PROJECT_ID/reservations/from the reservation value.

Create an AI-optimized GKE cluster which uses A4X Max and GPUDirect RDMA

For distributed AI workloads, multiple GPU nodes are often linked together to work as a single computer. A4X Max is an exascale platform based on NVIDIA GB300 NVL72 rack-scale architecture. A4X Max compute instances use a multi-layered, hierarchical networking architecture with a rail-aligned design to optimize performance for various communication types. This machine type enables scaling and collaboration across multiple GPUs by delivering a high-performance cloud experience for AI workloads. For more information about the network architecture for A4X Max, including the network bandwidth and NIC arrangement, see A4X Max machine type (bare metal).

To create a GKE Standard cluster with A4X Max that uses GPUDirect RDMA and MNNVL, complete the steps that are described in the following sections:

Create the GKE cluster
Create a workload policy
Create a node pool with A4X Max
Configure the MRDMA NICs with asapd-lite
Install the NVIDIA Compute Domain CRD and DRA driver
Configure your workload manifest for RDMA and IMEX domain

These instructions use accelerator network profiles to automatically configure VPC networks and subnets for your A4X Max nodes. Alternatively, you can explicitly specify your VPC network and subnets.

Create the GKE cluster

Create a GKE Standard cluster:
```
gcloud container clusters create CLUSTER_NAME \
  --enable-dataplane-v2 \
  --enable-ip-alias \
  --location=COMPUTE_REGION \
  --cluster-version=CLUSTER_VERSION \
  --no-enable-shielded-nodes [\
  --services-ipv4-cidr=SERVICE_CIDR \
  --cluster-ipv4-cidr=POD_CIDR \
  --addons=GcpFilestoreCsiDriver=ENABLED]
```
Replace the following:
- CLUSTER_NAME: the name of your cluster.
- CLUSTER_VERSION: the version of your new cluster. For more information about which version of GKE supports your configuration, see the Requirements in this document.
- COMPUTE_REGION: the name of the compute region.
- Optionally, you can explicitly provide the secondary CIDR ranges for services and Pods. If you use these optional flags, then replace the following variables:
  - SERVICE_CIDR: the secondary CIDR range for services.
  - POD_CIDR: the secondary CIDR range for Pods.
  When you use these flags, you must verify that the CIDR ranges don't overlap with subnet ranges for additional node networks. For example, consider SERVICE_CIDR=10.65.0.0/19 and POD_CIDR=10.64.0.0/19. For more information, see Adding Pod IPv4 address ranges.
To run the kubectl commands in the next sections, connect to your cluster:
```
gcloud container clusters get-credentials CLUSTER_NAME --location=COMPUTE_REGION
```
Replace the following:
- CLUSTER_NAME: the name of your cluster.
- COMPUTE_REGION: the name of the compute region.
For more information, see Install kubectl and configure cluster access.

Create a workload policy

A workload policy is required to create a partition. For more information, see Workload policy for MIGs.

Create a HIGH_THROUGHPUT workload policy with the accelerator_topology field set to 1x72.

gcloud beta compute resource-policies create workload-policy WORKLOAD_POLICY_NAME \
    --type HIGH_THROUGHPUT \
    --accelerator-topology 1x72 \
    --project PROJECT \
    --region COMPUTE_REGION

Replace the following:

WORKLOAD_POLICY_NAME: the name of your workload policy.
PROJECT: the name of your project.
COMPUTE_REGION: the name of the compute region.

Create a node pool with A4X Max

Create the following configuration file to pre-allocate hugepages with the node pool:

cat > node_custom.yaml <<EOF
linuxConfig:
  hugepageConfig:
    hugepage_size2m: 4096
EOF

export NODE_CUSTOM=node_custom.yaml

Create an A4X Max node pool:
```
gcloud container node-pools create NODE_POOL_NAME \
    --cluster=CLUSTER_NAME \
    --location=COMPUTE_REGION \
    --node-locations=COMPUTE_ZONE \
    --num-nodes=NODE_COUNT \
    --placement-policy=WORKLOAD_POLICY_NAME \
    --machine-type=a4x-maxgpu-4g-metal \
    --accelerator=type=nvidia-gb300,count=4,gpu-driver-version=latest \
    --system-config-from-file=${NODE_CUSTOM} \
    --accelerator-network-profile=auto \
    --node-labels=cloud.google.com/gke-networking-dra-driver=true,cloud.google.com/gke-dpv2-unified-cni=cni-migration \
    --reservation-affinity=specific \
    --reservation=RESERVATION_NAME/reservationBlocks/BLOCK_NAME/reservationSubBlocks/SUB_BLOCK_NAME
```
Replace the following:
- NODE_POOL_NAME: the name of the node pool.
- CLUSTER_NAME: the name of your cluster.
- COMPUTE_REGION: the compute region of the cluster.
- COMPUTE_ZONE: the zone of your node pool.
- NODE_COUNT: the number of nodes for the node pool, which must be 18 nodes or less. We recommend using 18 nodes to obtain the GPU topology of 1x72 in one sub-block using an NVLink domain.
- WORKLOAD_POLICY_NAME: the name of the workload policy you created previously.
- RESERVATION_NAME: the name of your reservation. To find this value, see View future reservation requests.
- BLOCK_NAME: the name of a specific block within the reservation. To find this value, see View future reservation requests.
This command automatically creates a network that connects all the A4X Max nodes within a single zone by using the auto accelerator network profile. When you create a node pool with the --accelerator-network-profile=auto flag, GKE automatically adds the gke.networks.io/accelerator-network-profile: auto label to the nodes. To schedule workloads on these nodes, you must include this label in your workload's nodeSelector field.

Configure the MRDMA NICs with `asapd-lite`

The asapd-lite DaemonSet configures the MRDMA NICs. An unhealthy asapd-lite DaemonSet might indicate no RDMA connectivity.

Install the DaemonSet:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/asapd-lite-installer/asapd-lite-installer-a4x-max-bm-cos.yaml

Validate the replicas in the asapd-lite DaemonSet:

kubectl get daemonset -n kube-system asapd-lite

The output is similar to the following:

NAME         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
asapd-lite   18        18        18      18           18          <none>          5m

The number of READY replicas should match the number of nodes that were created and are healthy in the node pool.

Install the NVIDIA Compute Domain CRD and DRA driver

The following steps install the NVIDIA Compute Domain CRD and DRA driver to enable the use of MNNVL. For more information, see NVIDIA DRA Driver for GPUs.

Verify that you have Helm installed in your development environment. Helm comes pre-installed on Cloud Shell.

Although there is no specific Helm version requirement, you can use the following command to verify that you have Helm installed.
```
helm version
```
If the output is similar to Command helm not found, then you can install the Helm CLI:
```
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
  && chmod 700 get_helm.sh \
  && ./get_helm.sh
```

Add the NVIDIA Helm repository:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
  && helm repo update

Create a ResourceQuota object for the DRA Driver:

export POD_QUOTA=POD_QUOTA

kubectl create ns nvidia-dra-driver-gpu

kubectl apply -n nvidia-dra-driver-gpu -f - << EOF
apiVersion: v1
kind: ResourceQuota
metadata:
  name: nvidia-dra-driver-gpu-quota
spec:
  hard:
    pods: ${POD_QUOTA}
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
        - system-node-critical
        - system-cluster-critical
EOF

Replace POD_QUOTA with a number at least 2 times the number of A4X Max nodes in the cluster plus 1. For example, you must set the variable to at least 37 if you have 18 A4X Max nodes in your cluster.

Install the ComputeDomain CRD and DRA driver:

helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
    --set controller.args.v=4 --set kubeletPlugin.args.v=4 \
    --version="25.8.0" \
    --create-namespace \
    --namespace nvidia-dra-driver-gpu \
    -f <(cat <<EOF
nvidiaDriverRoot: /home/kubernetes/bin/nvidia
resources:
  gpus:
    enabled: false

controller:
  affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: "nvidia.com/gpu"
              operator: "DoesNotExist"

kubeletPlugin:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: In
                values:
                  - nvidia-gb300
              - key: kubernetes.io/arch
                operator: In
                values:
                  - arm64

  tolerations:
    - key: nvidia.com/gpu
      operator: Equal
      value: present
      effect: NoSchedule
    - key: kubernetes.io/arch
      operator: Equal
      value: arm64
      effect: NoSchedule
EOF
)

Configure your workload manifest for RDMA and IMEX domain

Add a node affinity rule to schedule the workload on Arm nodes:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        -   matchExpressions:
          -   key: kubernetes.io/arch
            operator: In
            values:
            -   arm64

Add the following volume to the Pod specification:

spec:
  volumes:
    - name: library-dir-host
      hostPath:
        path: /home/kubernetes/bin/nvidia

Add the following volume mounts, environment variable, and resource to the container that requests GPUs. Your workload container must request all four GPUs:

containers:
  - name: my-container
    volumeMounts:
      - name: library-dir-host
        mountPath: /usr/local/nvidia

    env:
      - name: LD_LIBRARY_PATH
        value: /usr/local/nvidia/lib64
    resources:
      limits:
        nvidia.com/gpu: 4

Create the ComputeDomain resource for the workload:

apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: a4x-max-compute-domain
spec:
  numNodes: NUM_NODES
  channel:
    resourceClaimTemplate:
      name: a4x-max-compute-domain-channel

Replace NUM_NODES with the number of nodes the workload requires.

Create a ResourceClaimTemplate to allocate network resources by using DRANET and request RDMA devices for your Pod:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: all-mrdma
spec:
  spec:
    devices:
      requests:
      - name: req-mrdma
        exactly:
          deviceClassName: mrdma.google.com
          allocationMode: ExactCount
          count: 8

Specify the ResourceClaimTemplate that the Pod uses:

spec:
  ...
  volumes:
    ...
  containers:
    - name: my-container
      ...
      resources:
        limits:
          nvidia.com/gpu: 4
  claims:
          - name: compute-domain-channel
          - name: rdma
        ...
resourceClaims:
  - name: compute-domain-channel
    resourceClaimTemplateName: a4x-max-compute-domain-channel
  - name: rdma
    resourceClaimTemplateName: all-mrdma

Ensure that the userspace libraries and the libnccl packages are installed in the user container image:

apt update -y
apt install -y curl
export DOCA_URL="https://linux.mellanox.com/public/repo/doca/3.1.0/ubuntu22.04/arm64-sbsa/"
BASE_URL=$([ "${DOCA_PREPUBLISH:-false}" = "true" ] && echo https://doca-repo-prod.nvidia.com/public/repo/doca || echo https://linux.mellanox.com/public/repo/doca)
DOCA_SUFFIX=${DOCA_URL#*public/repo/doca/}; DOCA_URL="$BASE_URL/$DOCA_SUFFIX"
curl $BASE_URL/GPG-KEY-Mellanox.pub | gpg --dearmor > /etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.pub
echo "deb [signed-by=/etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.pub] $DOCA_URL ./" > /etc/apt/sources.list.d/doca.list
apt update
apt -y install doca-ofed-userspace
# The installed libnccl2 is 2.27.7, to upgrade to 2.28.9 as we recommend
apt install --only-upgrade --allow-change-held-packages -y libnccl2 libnccl-dev

A completed Pod specification looks like the following:

apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: a4x-max-compute-domain
spec:
  numNodes:  NUM_NODES
  channel:
    resourceClaimTemplate:
      name: a4x-max-compute-domain-channel
---
apiVersion: apps/v1
kind: Pod
metadata:
  name: my-pod
  labels:
    k8s-app: my-pod
spec:
  ...
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - arm64
  volumes:
    - name: library-dir-host
      hostPath:
        path: /home/kubernetes/bin/nvidia
  hostNetwork: true
  containers:
    - name: my-container
      volumeMounts:
        - name: library-dir-host
          mountPath: /usr/local/nvidia
      env:
        - name: LD_LIBRARY_PATH
          value: /usr/local/nvidia/lib64
      resources:
        limits:
          nvidia.com/gpu: 4
        claims:
          - name: compute-domain-channel
          - name: rdma
        ...
  resourceClaims:
    - name: compute-domain-channel
      resourceClaimTemplateName: a4x-max-compute-domain-channel
    - name: rdma
      resourceClaimTemplateName: all-mrdma

Test network performance

We recommended that you validate the functionality of provisioned clusters. To do so, use NCCL/gIB tests, which are NVIDIA Collective Communications Library (NCCL) tests that are optimized for the Google environment.

For more information, see Run NCCL on custom GKE clusters that use A4X Max.

What's next

To learn about scheduling workloads on your GKE clusters by using TAS and Kueue, see Schedule GKE workloads with Topology Aware Scheduling.
To learn about managing common events relevant to GKE clusters and AI workloads, see Manage AI-optimized GKE clusters.