Create a custom AI-optimized GKE cluster which uses A4X Max

This document shows you how to create an AI-optimized Google Kubernetes Engine (GKE) cluster that uses A4X Max Compute Engine instances to support your AI and ML workloads.

The A4X Max and A4X series lets you run large-scale AI/ML clusters by using the NVIDIA Multi-Node NVLink (MNNVL) system, a rack-scale solution which enables higher GPU power and performance. These machines offer features such as targeted workload placement, topology-aware scheduling, and advanced cluster maintenance controls. For more information, see Cluster management capabilities. With A4X Max, GKE also provides an automated networking setup which simplifies cluster configuration.

AI and ML workloads, such as distributed training, require powerful acceleration to optimize performance by reducing job completion times. GKE provides a single platform surface to run a diverse set of workloads for your organizations, reducing the operational burden of managing multiple platforms. You can run workloads such as high-performance distributed pre-training, model fine-tuning, model inference, application serving, and supporting services. For workloads that require high performance, high throughput, and low latency, GPUDirect RDMA reduces the network hops that are required to transfer payloads to and from GPUs. This approach more efficiently uses the network bandwidth that's available. For more information, see GPU networking stacks.

In this document, you learn how to create a GKE cluster with the Google Cloud CLI for maximum flexibility in configuring your cluster based on the needs of your workload. To use the gcloud CLI to create clusters with other machine types, see the following:

Alternatively, you can choose to use Cluster Toolkit to quickly deploy your cluster with default settings that reflect best practices for many use cases. For more information, see Create an AI-optimized GKE cluster with default configuration.

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.

Obtain capacity

You can obtain capacity for A4X Max compute instances by creating a future reservation. For more information about future reservations, see the Future reservations in AI Hypercomputer column in the table for Choose a consumption option.

To obtain capacity with a future reservation, see the Future reservations in AI Hypercomputer row in the table for How to obtain capacity.

Requirements

The following requirements apply to an AI-optimized GKE cluster with A4X Max compute instances:

  • For A4X Max, you must use one of the following versions:

    • For 1.35 or later, use GKE version 1.35.0-gke.2745000 or later.
    • For 1.34, use GKE version 1.34.3-gke.1318000 or later.

    These versions help to ensure that A4X Max uses the following:

    • R580.95.05, the minimum GPU driver version for A4X Max, which is enabled by default.
    • Coherent Driver-based Memory Management (CDMM), which is enabled by default. NVIDIA recommends that Kubernetes clusters enable this mode to resolve memory over-reporting. CDMM allows GPU memory to be managed through the driver instead of the operating system (OS). This approach helps you to avoid OS onlining of GPU memory, and exposes the GPU memory as a Non-Uniform Memory Access (NUMA) node to the OS. Multi-instance GPUs aren't supported when CDMM is enabled. For more information about CDMM, see Hardware and Software Support.
    • GPUDirect RDMA and MNNVL, which are recommended to enable A4X Max node pools to use the networking capabilities of A4X Max.
  • The GKE nodes must use a Container-Optimized OS node image. Ubuntu and Windows node images are not supported.

  • Your GKE workload must use all available GPUs and your Pod must use all available secondary NICs on a single GKE node. Multiple Pods cannot share RDMA on a single GKE node.

  • You must use the reservation-bound provisioning model to create clusters with A4X Max. Other provisioning models are not supported.

  • These instructions use DRANET to configure an AI-optimized GKE cluster with A4X Max. Multi-networking isn't supported for the a4x-maxgpu-4g-metal machine type.

Considerations for creating a cluster

When you create a cluster, consider the following information:

  • Choose a cluster location:
    • Verify that you use a location which has availability for the machine type that you choose. For more information, see Accelerator availability.
    • When you create node pools in a regional cluster—which are recommended for production workloads—you can use the --node-locations flag to specify the zones for your GKE nodes.
  • Choose a driver version:
    • The driver version can be one of the following values:
      • default: install the default driver version for your GKE node version. For more information about the requirements for default driver versions, see the Requirements section.
      • latest: install the latest available driver version for your GKE version. This option is available only for nodes that use Container-Optimized OS.
      • disabled: skip automatic driver installation. You must manually install a driver after you create the node pool.
    • For more information about the default and latest GPU driver versions for GKE node versions, see the table in the section Manually install NVIDIA GPU drivers.
  • Choose a reservation affinity:

    • You can find information about your reservation, such as the name of your reservation or the name of a specific block in your reservation. To find these values, see View future reservation requests.
    • The --reservation-affinity flag can take the values of specific or any. However, for high performance distributed AI workloads, we recommend that you use a specific reservation.
    • When you use a specific reservation, including shared reservations, specify the value of the --reservation flag in the following format:

      projects/PROJECT_ID/reservations/RESERVATION_NAME/reservationBlocks/BLOCK_NAME
      

      Replace the following values:

      • PROJECT_ID: your Google Cloud project ID.
      • RESERVATION_NAME: the name of your reservation.
      • BLOCK_NAME: the name of a specific block within the reservation.

      We also recommend that you use a sub-block targeted reservation so that compute instances are placed on a single sub-block within the BLOCK_NAME. Add the following to the end of the path:

      /reservationSubBlocks/SUB_BLOCK_NAME
      

      Replace SUB_BLOCK_NAME with the name of the sub-block.

Create an AI-optimized GKE cluster which uses A4X Max and GPUDirect RDMA

For distributed AI workloads, multiple GPU nodes are often linked together to work as a single computer. A4X Max is an exascale platform based on NVIDIA GB300 NVL72 rack-scale architecture. A4X Max compute instances use a multi-layered, hierarchical networking architecture with a rail-aligned design to optimize performance for various communication types. This machine type enables scaling and collaboration across multiple GPUs by delivering a high-performance cloud experience for AI workloads. For more information about the network architecture for A4X Max, including the network bandwidth and NIC arrangement, see A4X Max machine type (bare metal).

To create a GKE Standard cluster with A4X Max that uses GPUDirect RDMA and MNNVL, complete the steps that are described in the following sections:

  1. Create the GKE cluster
  2. Create a workload policy
  3. Create a node pool with A4X Max
  4. Configure the MRDMA NICs with asapd-lite
  5. Install the NVIDIA Compute Domain CRD and DRA driver
  6. Configure your workload manifest for RDMA and IMEX domain

These instructions use accelerator network profiles to automatically configure VPC networks and subnets for your A4X Max nodes. Alternatively, you can explicitly specify your VPC network and subnets.

Create the GKE cluster

  1. Create a GKE Standard cluster:

    gcloud container clusters create CLUSTER_NAME \
      --enable-dataplane-v2 \
      --enable-ip-alias \
      --location=COMPUTE_REGION \
      --cluster-version=CLUSTER_VERSION \
      --no-enable-shielded-nodes [\
      --services-ipv4-cidr=SERVICE_CIDR \
      --cluster-ipv4-cidr=POD_CIDR \
      --addons=GcpFilestoreCsiDriver=ENABLED]
    

    Replace the following:

    • CLUSTER_NAME: the name of your cluster.
    • CLUSTER_VERSION: the version of your new cluster. For more information about which version of GKE supports your configuration, see the Requirements in this document.
    • COMPUTE_REGION: the name of the compute region.
    • Optionally, you can explicitly provide the secondary CIDR ranges for services and Pods. If you use these optional flags, then replace the following variables:

      • SERVICE_CIDR: the secondary CIDR range for services.
      • POD_CIDR: the secondary CIDR range for Pods.

      When you use these flags, you must verify that the CIDR ranges don't overlap with subnet ranges for additional node networks. For example, consider SERVICE_CIDR=10.65.0.0/19 and POD_CIDR=10.64.0.0/19. For more information, see Adding Pod IPv4 address ranges.

  2. To run the kubectl commands in the next sections, connect to your cluster:

    gcloud container clusters get-credentials CLUSTER_NAME --location=COMPUTE_REGION
    

    Replace the following:

    • CLUSTER_NAME: the name of your cluster.
    • COMPUTE_REGION: the name of the compute region.

    For more information, see Install kubectl and configure cluster access.

Create a workload policy

A workload policy is required to create a partition. For more information, see Workload policy for MIGs.

Create a HIGH_THROUGHPUT workload policy with the accelerator_topology field set to 1x72.

gcloud beta compute resource-policies create workload-policy WORKLOAD_POLICY_NAME \
    --type HIGH_THROUGHPUT \
    --accelerator-topology 1x72 \
    --project PROJECT \
    --region COMPUTE_REGION

Replace the following:

  • WORKLOAD_POLICY_NAME: the name of your workload policy.
  • PROJECT: the name of your project.
  • COMPUTE_REGION: the name of the compute region.

Create a node pool with A4X Max

  1. Create the following configuration file to pre-allocate hugepages with the node pool:

    cat > node_custom.yaml <<EOF
    linuxConfig:
      hugepageConfig:
        hugepage_size2m: 4096
    EOF
    
    export NODE_CUSTOM=node_custom.yaml
    
  2. Create an A4X Max node pool:

    gcloud container node-pools create NODE_POOL_NAME \
        --cluster=CLUSTER_NAME \
        --location=COMPUTE_REGION \
        --node-locations=COMPUTE_ZONE \
        --num-nodes=NODE_COUNT \
        --placement-policy=WORKLOAD_POLICY_NAME \
        --machine-type=a4x-maxgpu-4g-metal \
        --accelerator=type=nvidia-gb300,count=4,gpu-driver-version=latest \
        --system-config-from-file=${NODE_CUSTOM} \
        --accelerator-network-profile=auto \
        --node-labels=cloud.google.com/gke-networking-dra-driver=true,cloud.google.com/gke-dpv2-unified-cni=cni-migration \
        --reservation-affinity=specific \
        --reservation=RESERVATION_NAME/reservationBlocks/BLOCK_NAME/reservationSubBlocks/SUB_BLOCK_NAME
    

    Replace the following:

    • NODE_POOL_NAME: the name of the node pool.
    • CLUSTER_NAME: the name of your cluster.
    • COMPUTE_REGION: the compute region of the cluster.
    • COMPUTE_ZONE: the zone of your node pool.
    • NODE_COUNT: the number of nodes for the node pool, which must be 18 nodes or less. We recommend using 18 nodes to obtain the GPU topology of 1x72 in one sub-block using an NVLink domain.
    • WORKLOAD_POLICY_NAME: the name of the workload policy you created previously.
    • RESERVATION_NAME: the name of your reservation. To find this value, see View future reservation requests.
    • BLOCK_NAME: the name of a specific block within the reservation. To find this value, see View future reservation requests.

    This command automatically creates a network that connects all the A4X Max nodes within a single zone by using the auto accelerator network profile. When you create a node pool with the --accelerator-network-profile=auto flag, GKE automatically adds the gke.networks.io/accelerator-network-profile: auto label to the nodes. To schedule workloads on these nodes, you must include this label in your workload's nodeSelector field.

Configure the MRDMA NICs with asapd-lite

The asapd-lite DaemonSet configures the MRDMA NICs. An unhealthy asapd-lite DaemonSet might indicate no RDMA connectivity.

  1. Install the DaemonSet:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/asapd-lite-installer/asapd-lite-installer-a4x-max-bm-cos.yaml
    
  2. Validate the replicas in the asapd-lite DaemonSet:

    kubectl get daemonset -n kube-system asapd-lite
    

    The output is similar to the following:

    NAME         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
    asapd-lite   18        18        18      18           18          <none>          5m
    

    The number of READY replicas should match the number of nodes that were created and are healthy in the node pool.

Install the NVIDIA Compute Domain CRD and DRA driver

The following steps install the NVIDIA Compute Domain CRD and DRA driver to enable the use of MNNVL. For more information, see NVIDIA DRA Driver for GPUs.

  1. Verify that you have Helm installed in your development environment. Helm comes pre-installed on Cloud Shell.

    Although there is no specific Helm version requirement, you can use the following command to verify that you have Helm installed.

    helm version
    

    If the output is similar to Command helm not found, then you can install the Helm CLI:

    curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
      && chmod 700 get_helm.sh \
      && ./get_helm.sh
    
  2. Add the NVIDIA Helm repository:

    helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
      && helm repo update
    
  3. Create a ResourceQuota object for the DRA Driver:

    export POD_QUOTA=POD_QUOTA
    
    kubectl create ns nvidia-dra-driver-gpu
    
    kubectl apply -n nvidia-dra-driver-gpu -f - << EOF
    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: nvidia-dra-driver-gpu-quota
    spec:
      hard:
        pods: ${POD_QUOTA}
      scopeSelector:
        matchExpressions:
        - operator: In
          scopeName: PriorityClass
          values:
            - system-node-critical
            - system-cluster-critical
    EOF
    

    Replace POD_QUOTA with a number at least 2 times the number of A4X Max nodes in the cluster plus 1. For example, you must set the variable to at least 37 if you have 18 A4X Max nodes in your cluster.

  4. Install the ComputeDomain CRD and DRA driver:

    helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
        --set controller.args.v=4 --set kubeletPlugin.args.v=4 \
        --version="25.8.0" \
        --create-namespace \
        --namespace nvidia-dra-driver-gpu \
        -f <(cat <<EOF
    nvidiaDriverRoot: /home/kubernetes/bin/nvidia
    resources:
      gpus:
        enabled: false
    
    controller:
      affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: "nvidia.com/gpu"
                  operator: "DoesNotExist"
    
    kubeletPlugin:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: cloud.google.com/gke-accelerator
                    operator: In
                    values:
                      - nvidia-gb300
                  - key: kubernetes.io/arch
                    operator: In
                    values:
                      - arm64
    
      tolerations:
        - key: nvidia.com/gpu
          operator: Equal
          value: present
          effect: NoSchedule
        - key: kubernetes.io/arch
          operator: Equal
          value: arm64
          effect: NoSchedule
    EOF
    )
    

Configure your workload manifest for RDMA and IMEX domain

  1. Add a node affinity rule to schedule the workload on Arm nodes:

    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            -   matchExpressions:
              -   key: kubernetes.io/arch
                operator: In
                values:
                -   arm64
    
  2. Add the following volume to the Pod specification:

    spec:
      volumes:
        - name: library-dir-host
          hostPath:
            path: /home/kubernetes/bin/nvidia
    
  3. Add the following volume mounts, environment variable, and resource to the container that requests GPUs. Your workload container must request all four GPUs:

    containers:
      - name: my-container
        volumeMounts:
          - name: library-dir-host
            mountPath: /usr/local/nvidia
    
        env:
          - name: LD_LIBRARY_PATH
            value: /usr/local/nvidia/lib64
        resources:
          limits:
            nvidia.com/gpu: 4
    
  4. Create the ComputeDomain resource for the workload:

    apiVersion: resource.nvidia.com/v1beta1
    kind: ComputeDomain
    metadata:
      name: a4x-max-compute-domain
    spec:
      numNodes: NUM_NODES
      channel:
        resourceClaimTemplate:
          name: a4x-max-compute-domain-channel
    

    Replace NUM_NODES with the number of nodes the workload requires.

  5. Create a ResourceClaimTemplate to allocate network resources by using DRANET and request RDMA devices for your Pod:

    apiVersion: resource.k8s.io/v1
    kind: ResourceClaimTemplate
    metadata:
      name: all-mrdma
    spec:
      spec:
        devices:
          requests:
          - name: req-mrdma
            exactly:
              deviceClassName: mrdma.google.com
              allocationMode: ExactCount
              count: 8
    
  6. Specify the ResourceClaimTemplate that the Pod uses:

    spec:
      ...
      volumes:
        ...
      containers:
        - name: my-container
          ...
          resources:
            limits:
              nvidia.com/gpu: 4
      claims:
              - name: compute-domain-channel
              - name: rdma
            ...
    resourceClaims:
      - name: compute-domain-channel
        resourceClaimTemplateName: a4x-max-compute-domain-channel
      - name: rdma
        resourceClaimTemplateName: all-mrdma
    
  7. Ensure that the userspace libraries and the libnccl packages are installed in the user container image:

    apt update -y
    apt install -y curl
    export DOCA_URL="https://linux.mellanox.com/public/repo/doca/3.1.0/ubuntu22.04/arm64-sbsa/"
    BASE_URL=$([ "${DOCA_PREPUBLISH:-false}" = "true" ] && echo https://doca-repo-prod.nvidia.com/public/repo/doca || echo https://linux.mellanox.com/public/repo/doca)
    DOCA_SUFFIX=${DOCA_URL#*public/repo/doca/}; DOCA_URL="$BASE_URL/$DOCA_SUFFIX"
    curl $BASE_URL/GPG-KEY-Mellanox.pub | gpg --dearmor > /etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.pub
    echo "deb [signed-by=/etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.pub] $DOCA_URL ./" > /etc/apt/sources.list.d/doca.list
    apt update
    apt -y install doca-ofed-userspace
    # The installed libnccl2 is 2.27.7, to upgrade to 2.28.9 as we recommend
    apt install --only-upgrade --allow-change-held-packages -y libnccl2 libnccl-dev
    

A completed Pod specification looks like the following:

apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: a4x-max-compute-domain
spec:
  numNodes:  NUM_NODES
  channel:
    resourceClaimTemplate:
      name: a4x-max-compute-domain-channel
---
apiVersion: apps/v1
kind: Pod
metadata:
  name: my-pod
  labels:
    k8s-app: my-pod
spec:
  ...
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - arm64
  volumes:
    - name: library-dir-host
      hostPath:
        path: /home/kubernetes/bin/nvidia
  hostNetwork: true
  containers:
    - name: my-container
      volumeMounts:
        - name: library-dir-host
          mountPath: /usr/local/nvidia
      env:
        - name: LD_LIBRARY_PATH
          value: /usr/local/nvidia/lib64
      resources:
        limits:
          nvidia.com/gpu: 4
        claims:
          - name: compute-domain-channel
          - name: rdma
        ...
  resourceClaims:
    - name: compute-domain-channel
      resourceClaimTemplateName: a4x-max-compute-domain-channel
    - name: rdma
      resourceClaimTemplateName: all-mrdma

Test network performance

We recommended that you validate the functionality of provisioned clusters. To do so, use NCCL/gIB tests, which are NVIDIA Collective Communications Library (NCCL) tests that are optimized for the Google environment.

For more information, see Run NCCL on custom GKE clusters that use A4X Max.

What's next