Prepare GKE infrastructure for DRA workloads

Standard

This document explains how to manually set up your Google Kubernetes Engine (GKE) infrastructure to support dynamic resource allocation (DRA). The setup steps include creating node pools that use GPUs and installing DRA drivers.

This document is intended for platform administrators who want to create infrastructure with specialized hardware devices that application operators can claim in workloads.

Limitations

The following limitations apply:

Limitations of DRA in GKE
Device-specific limitations, which apply regardless of whether you use DRA: GPU workloads on Standard clusters

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Have a GKE Standard cluster that runs version 1.35 or later. You can also create a regional cluster.
Install Helm. If you use Cloud Shell, Helm is already installed.

Create a GKE node pool with GPUs

This section shows you how to create a GPU node pool and install the corresponding DRA drivers. The steps in this section apply only to node pools that you manually create. To create a GPU node pool that supports DRA, you must do the following:

Disable automatic GPU driver installation: specify the gpu-driver-version=disabled option in the --accelerator flag.
Disable the GPU device plugin: add the gke-no-default-nvidia-gpu-device-plugin=true node label to the node pool.
Run the DRA driver DaemonSet: add the nvidia.com/gpu.present=true node label to the node pool.
Configure autoscaling: to use the cluster autoscaler in your node pool, add the cloud.google.com/gke-nvidia-gpu-dra-driver=true node label to the node pool. The cluster autoscaler uses this node label to identify nodes that run the DRA driver for GPUs.

To create and configure GPU node pools, follow these steps:

Create a GPU node pool. The following example commands create node pools with different configurations:

Create a node pool with a g2-standard-24 instance that has two L4 GPUs:

gcloud container node-pools create NODEPOOL_NAME \
    --cluster=CLUSTER_NAME \
    --location=CONTROL_PLANE_LOCATION \
    --node-locations=NODE_LOCATION1,NODE_LOCATION2,... \
    --machine-type="g2-standard-24" \
    --accelerator="type=nvidia-l4,count=2,gpu-driver-version=disabled" \
    --num-nodes="1" \
    --node-labels=gke-no-default-nvidia-gpu-device-plugin=true,nvidia.com/gpu.present=true

Replace the following:

NODEPOOL_NAME: a name for your node pool.
CLUSTER_NAME: the name of your cluster.
CONTROL_PLANE_LOCATION: the region or zone of the cluster control plane, such as us-central1 or us-central1-a.
NODE_LOCATION1,NODE_LOCATION2,...: a comma-separated list of zones, in the same region as the control plane, to create nodes in. Choose zones that have GPU availability.

Create an autoscaled node pool with a2-ultragpu-1g instances that have one NVIDIA A100 (80 GB) GPU in each instance:

gcloud container node-pools create NODEPOOL_NAME \
    --cluster=CLUSTER_NAME \
    --location=CONTROL_PLANE_LOCATION \
    --node-locations=NODE_LOCATION1,NODE_LOCATION2,... \
    --enable-autoscaling \
    --max-nodes=5 \
    --machine-type="a2-ultragpu-1g" \
    --accelerator="type=nvidia-a100-80gb,count=1,gpu-driver-version=disabled" \
    --num-nodes="1" \
    --node-labels=gke-no-default-nvidia-gpu-device-plugin=true,nvidia.com/gpu.present=true,cloud.google.com/gke-nvidia-gpu-dra-driver=true

Manually install NVIDIA GPU drivers.
Install DRA drivers.

Install DRA drivers

Pull and update the Helm chart that contains the NVIDIA DRA driver:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update

Install the NVIDIA DRA GPU driver with version 25.8.0 or later:

helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
    --version="25.8.0" --create-namespace --namespace=nvidia-dra-driver-gpu \
    --set nvidiaDriverRoot="/home/kubernetes/bin/nvidia/" \
    --set gpuResourcesEnabledOverride=true \
    --set resources.computeDomains.enabled=false \
    --set kubeletPlugin.priorityClassName="" \
    --set 'kubeletPlugin.tolerations[0].key=nvidia.com/gpu' \
    --set 'kubeletPlugin.tolerations[0].operator=Exists' \
    --set 'kubeletPlugin.tolerations[0].effect=NoSchedule'

For Ubuntu nodes, specify the "/opt/nvidia" directory path in the --set nvidiaDriverRoot flag.

Verify that your infrastructure is ready for DRA

Verify that your DRA driver Pods are running:

kubectl get pods -n nvidia-dra-driver-gpu

The output is similar to the following:

NAME                                         READY   STATUS    RESTARTS   AGE
nvidia-dra-driver-gpu-kubelet-plugin-52cdm   1/1     Running   0          46s

Confirm that the ResourceSlice lists the hardware devices that you added:

kubectl get resourceslices -o yaml