This document explains how to manually set up your Google Kubernetes Engine (GKE) infrastructure to support dynamic resource allocation (DRA). The setup steps include creating node pools that use GPUs and installing DRA drivers.
This document is intended for platform administrators who want to create infrastructure with specialized hardware devices that application operators can claim in workloads.
Limitations
The following limitations apply:
- Limitations of DRA in GKE
- Device-specific limitations, which apply regardless of whether you use DRA: GPU workloads on Standard clusters
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
Have a GKE Standard cluster that runs version 1.35 or later. You can also create a regional cluster.
Install Helm. If you use Cloud Shell, Helm is already installed.
Create a GKE node pool with GPUs
This section shows you how to create a GPU node pool and install the corresponding DRA drivers. The steps in this section apply only to node pools that you manually create. To create a GPU node pool that supports DRA, you must do the following:
- Disable automatic GPU driver installation: specify the
gpu-driver-version=disabledoption in the--acceleratorflag. - Disable the GPU device plugin: add the
gke-no-default-nvidia-gpu-device-plugin=truenode label to the node pool. - Run the DRA driver DaemonSet: add the
nvidia.com/gpu.present=truenode label to the node pool. - Configure autoscaling: to use the cluster autoscaler in your node pool,
add the
cloud.google.com/gke-nvidia-gpu-dra-driver=truenode label to the node pool. The cluster autoscaler uses this node label to identify nodes that run the DRA driver for GPUs.
To create and configure GPU node pools, follow these steps:
Create a GPU node pool. The following example commands create node pools with different configurations:
Create a node pool with a
g2-standard-24instance that has two L4 GPUs:gcloud container node-pools create NODEPOOL_NAME \ --cluster=CLUSTER_NAME \ --location=CONTROL_PLANE_LOCATION \ --node-locations=NODE_LOCATION1,NODE_LOCATION2,... \ --machine-type="g2-standard-24" \ --accelerator="type=nvidia-l4,count=2,gpu-driver-version=disabled" \ --num-nodes="1" \ --node-labels=gke-no-default-nvidia-gpu-device-plugin=true,nvidia.com/gpu.present=trueReplace the following:
NODEPOOL_NAME: a name for your node pool.CLUSTER_NAME: the name of your cluster.CONTROL_PLANE_LOCATION: the region or zone of the cluster control plane, such asus-central1orus-central1-a.NODE_LOCATION1,NODE_LOCATION2,...: a comma-separated list of zones, in the same region as the control plane, to create nodes in. Choose zones that have GPU availability.
Create an autoscaled node pool with
a2-ultragpu-1ginstances that have one NVIDIA A100 (80 GB) GPU in each instance:gcloud container node-pools create NODEPOOL_NAME \ --cluster=CLUSTER_NAME \ --location=CONTROL_PLANE_LOCATION \ --node-locations=NODE_LOCATION1,NODE_LOCATION2,... \ --enable-autoscaling \ --max-nodes=5 \ --machine-type="a2-ultragpu-1g" \ --accelerator="type=nvidia-a100-80gb,count=1,gpu-driver-version=disabled" \ --num-nodes="1" \ --node-labels=gke-no-default-nvidia-gpu-device-plugin=true,nvidia.com/gpu.present=true,cloud.google.com/gke-nvidia-gpu-dra-driver=true
Install DRA drivers
Pull and update the Helm chart that contains the NVIDIA DRA driver:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo updateInstall the NVIDIA DRA GPU driver with version 25.8.0 or later:
helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ --version="25.8.0" --create-namespace --namespace=nvidia-dra-driver-gpu \ --set nvidiaDriverRoot="/home/kubernetes/bin/nvidia/" \ --set gpuResourcesEnabledOverride=true \ --set resources.computeDomains.enabled=false \ --set kubeletPlugin.priorityClassName="" \ --set 'kubeletPlugin.tolerations[0].key=nvidia.com/gpu' \ --set 'kubeletPlugin.tolerations[0].operator=Exists' \ --set 'kubeletPlugin.tolerations[0].effect=NoSchedule'For Ubuntu nodes, specify the
"/opt/nvidia"directory path in the--set nvidiaDriverRootflag.
Verify that your infrastructure is ready for DRA
Verify that your DRA driver Pods are running:
kubectl get pods -n nvidia-dra-driver-gpuThe output is similar to the following:
NAME READY STATUS RESTARTS AGE nvidia-dra-driver-gpu-kubelet-plugin-52cdm 1/1 Running 0 46sConfirm that the ResourceSlice lists the hardware devices that you added:
kubectl get resourceslices -o yamlThe output is similar to the following:
apiVersion: v1 items: - apiVersion: resource.k8s.io/v1 kind: ResourceSlice metadata: # Multiple lines are omitted here. spec: devices: - attributes: architecture: string: Ada Lovelace brand: string: Nvidia cudaComputeCapability: version: 8.9.0 cudaDriverVersion: version: 13.0.0 driverVersion: version: 580.65.6 index: int: 0 minor: int: 0 pcieBusID: string: "0000:00:03.0" productName: string: NVIDIA L4 resource.kubernetes.io/pcieRoot: string: pci0000:00 type: string: gpu uuid: string: GPU-ccc19e5e-e3cd-f911-65c8-89bcef084e3f capacity: memory: value: 23034Mi name: gpu-0 - attributes: architecture: string: Ada Lovelace brand: string: Nvidia cudaComputeCapability: version: 8.9.0 cudaDriverVersion: version: 13.0.0 driverVersion: version: 580.65.6 index: int: 1 minor: int: 1 pcieBusID: string: "0000:00:04.0" productName: string: NVIDIA L4 resource.kubernetes.io/pcieRoot: string: pci0000:00 type: string: gpu uuid: string: GPU-f783198d-42f9-7cef-9ea1-bb10578df978 capacity: memory: value: 23034Mi name: gpu-1 driver: gpu.nvidia.com nodeName: gke-cluster-1-dra-gpu-pool-b56c4961-7vnm pool: generation: 1 name: gke-cluster-1-dra-gpu-pool-b56c4961-7vnm resourceSliceCount: 1 kind: List metadata: resourceVersion: ""