Google Kubernetes Engine (GKE) DRANET is a managed GKE feature that builds upon the open-source DRANET project, which implements the Kubernetes DRA API for networking resources. DRANET lets you request and allocate high-performance networking resources for your Pods, including network interfaces that support Remote Direct Memory Access (RDMA). This approach provides a portable, upstream-aligned API for network resource management.
This document provides a conceptual overview of GKE DRANET and shows you how to allocate network resources to workloads in your GKE clusters.
This document is intended for Cloud architects and Networking specialists who design networks for their organizations. For an overview of all GKE documentation, see Explore GKE documentation. To learn about common roles and tasks referenced in Google Cloud content, see Common GKE user roles and tasks.
Before reading this document, ensure that you are familiar with the following:
- Dynamic Resource Allocation
- About dynamic resource allocation in GKE
- Kubernetes concepts
- GKE networking
- GKE Networking fundamentals
How GKE managed DRANET works
GKE managed DRANET is implemented through a networking-dra-driver
DaemonSet. This DaemonSet runs on nodes with GPUs or TPUs that have
GKE DRANET enabled. It functions as a node-level agent to make
network interfaces discoverable and allocatable to Pods through the Kubernetes Dynamic Resource Allocation (DRA) APIs.
In GKE versions 1.34.1-gke.1829001 and later, GKE
automatically installs DeviceClass resources for networking. These classes
define the types of network devices that you can request. For example,
GKE creates the mrdma.google.com class for RDMA-capable devices
and the netdev.google.com class for other network devices.
To use GKE DRANET, you first enable the GKE DRANET driver on a node pool with GPUs or TPUs.
To request network devices for a workload, define a ResourceClaimTemplate.
This template specifies the DeviceClass and the allocation mode, such as
requesting all available devices on a node. In your Pod specification, reference
this template in a resourceClaims field to grant your Pod access to the
requested network interfaces on the node.
When to use GKE managed DRANET
GKE DRANET provides a standardized way to manage network resources that are aware of topology and dependencies. This standardization makes it a suitable solution for AI and ML workloads that require high-performance networking.
Common use cases for requesting network interfaces for a Pod include:
- Requesting all available RDMA-capable interfaces.
- Requesting a specific number of RDMA-capable interfaces.
- Requesting all available non-RDMA interfaces.
- Requesting a specific number of non-RDMA interfaces.
Key considerations when using GKE managed DRANET for networking
Consider the following points when using GKE DRANET for networking:
Dedicated network interfaces
When you use GKE DRANET to claim a network interface for a Pod, that interface is dedicated to that Pod. Other Pods on the same node can't share it. This ensures that the Pod has sole access to the full bandwidth and resources of that interface, which is a key benefit for performance-sensitive workloads.
Use the GKE managed DRANET driver independently
You can enable the GKE DRA driver to manage network resources without enabling other GKE DRANET drivers. To do so, add the
cloud.google.com/gke-networking-dra-driver=truelabel to a node pool with GPUs and TPUs.Use other GKE DRA drivers
To achieve higher throughput in demanding AI/ML workloads, combine the DRA API for accelerators (like GPUs and TPUs) with GKE DRANET managed networking. This combined approach improves resource alignment and topology awareness. For guidance on using DRA for other resources, see Prepare GKE infrastructure for DRA workloads.
Avoid conflicting configurations
The GKE DRANET driver manages RDMA interfaces and gVNICs that don't have any secondary IP address ranges configured. Don't use both the GKE DRANET driver and the GKE multi-network API with a Network resource of
Devicetype in the same cluster. Using the driver and API together is not supported because both APIs attempt to manage the same set of NICs, which can lead to an incorrect setup and unpredictable behavior.
Requirements
To use GKE managed DRANET, your environment must meet the following requirements:
- GKE version 1.34.1-gke.1829001 or later.
- GKE Dataplane V2 is enabled on the cluster.
- (Preview) GKE DRANET is available on A4X Max machines.
Limitations
GKE DRANET has the following limitations:
- You can't use GKE DRANET to allocate the default network interface card (NIC) or virtual NICs (such as veth).
- Cluster autoscaling and Autopilot are not supported.
- You can't use GKE DRANET with interfaces where you have
configured a
secondaryPodRange.
Before you begin
To get the permissions to create clusters and allocate network
resources using GKE DRANET, ask your administrator to grant you
the Kubernetes Engine Admin (roles/container.admin) IAM role on
your project.
Before you begin, complete the following tasks:
Enable the Google Kubernetes Engine API.
gcloud services enable container.googleapis.com --project=PROJECT_IDReplace
PROJECT_IDwith your Google Cloud project ID.If you plan to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running
gcloud components update.Create a GKE cluster with GKE Dataplane V2 and IP alias enabled:
gcloud container clusters create CLUSTER_NAME \ --enable-dataplane-v2 \ --enable-ip-alias \ --region=REGION \ --project=PROJECT_ID \ --cluster-version=CLUSTER_VERSIONReplace the following:
CLUSTER_NAME: the name of your new cluster.REGION: the Google Cloud region for your cluster.PROJECT_ID: your Google Cloud project ID.CLUSTER_VERSION: the GKE version of your cluster. This must be 1.34.1-gke.1829001 or later.
Use RDMA interfaces from a GPU node pool
The following sections describe how to configure a GPU node pool and workload to use RDMA network interfaces with GKE DRANET.
Enable GKE managed DRANET driver on a GPU node pool
To enable the GKE DRANET driver on a GPU node pool that supports
RDMA, add the cloud.google.com/gke-networking-dra-driver=true label when you
create the node pool.
gcloud container node-pools create NODE_POOL_NAME \
--region=REGION \
--cluster=CLUSTER_NAME \
--node-locations=NODE_LOCATIONS \
--accelerator type=ACCELERATOR_TYPE,count=ACCELERATOR_COUNT,gpu-driver-version=DRIVER_VERSION \
--machine-type=MACHINE_TYPE \
--num-nodes=NUM_NODES \
--reservation-affinity=specific \
--reservation=projects/RESERVATION_PROJECT/reservations/RESERVATION_NAME/reservationBlocks/RESERVATION_BLOCK \
--accelerator-network-profile=auto \
--node-labels=cloud.google.com/gke-networking-dra-driver=true
Replace the following:
NODE_POOL_NAME: the name of your new node pool.REGION: the Google Cloud region for your cluster.CLUSTER_NAME: the name of your cluster.ACCELERATOR_TYPE: the type of GPU accelerator:For example:
- A4 VMs: enter
nvidia-b200. - A3 Ultra VMs: enter
nvidia-h200-141gb.
- A4 VMs: enter
ACCELERATOR_COUNT: the number of GPUs to attach to nodes in the node pool. For example, for both a4-highgpu-8g and a3-ultragpu-8g VMs, the amount of GPUs is 8.DRIVER_VERSION: the GPU driver version to use. For example,defaultorlatest.MACHINE_TYPE: the machine type for the node pool, for example,a3-ultragpu-8g.NUM_NODES: the number of nodes for the node pool. For flex-start, this value must be set to 0.RESERVATION_PROJECT: the project ID of the reservation.RESERVATION_NAME: the name of your reservation. To find this value, see View future reservation requests.RESERVATION_BLOCK: the name of a specific block within the reservation. To find this value, see View future reservation requests.
This command uses accelerator network profiles to automatically configure VPC networks and subnets for your accelerator VMs. Alternatively, you can explicitly specify your VPC network and subnets.
Deploy a workload RDMA resources
To allocate RDMA resources for a Pod, specify a ResourceClaimTemplate.
Create a
ResourceClaimTemplateto define how to allocate the RDMA devices. The following manifest requests all availablemrdmadevices on the node. Save the manifest asall-mrdma-template.yaml:apiVersion: resource.k8s.io/v1 kind: ResourceClaimTemplate metadata: name: all-mrdma spec: spec: devices: requests: - name: req-mrdma exactly: deviceClassName: mrdma.google.com allocationMode: AllApply the manifest:
kubectl apply -f all-mrdma-template.yamlDeploy your workload and reference the
ResourceClaimTemplate. The following manifest deploys a Pod that references theall-mrdmatemplate, which grants the Pod access to the RDMA interfaces on the node. Save the manifest asagnhost-rdma-pod.yaml:apiVersion: v1 kind: Pod metadata: name: agnhost-rdma namespace: default labels: app: agnhost spec: containers: - name: agnhost image: registry.k8s.io/e2e-test-images/agnhost:2.39 args: ["netexec", "--http-port", "80"] ports: - name: agnhost-port containerPort: 80 resources: claims: - name: rdma limits: nvidia.com/gpu: 1 resourceClaims: - name: rdma resourceClaimTemplateName: all-mrdmaApply the manifest:
kubectl apply -f agnhost-rdma-pod.yamlVerify that the additional allocated network interfaces are visible inside the Pod.
kubectl exec agnhost-rdma -- ls /sys/class/netThe following example output shows the default
eth0andlointerfaces, as well as the allocated RDMA interfaces, such asgpu0rdma0. The number and names of the network interfaces (NICs) vary based on the GKE node's machine type.eth0 gpu0rdma0 gpu1rdma0 gpu2rdma0 gpu3rdma0 lo
Use non-RDMA network interfaces in a TPU node pool
The following sections describe how to configure a TPU node pool and workload to use non-RDMA network interfaces with GKE DRANET.
Verify networking DeviceClasses
Verify that the DeviceClass resources for networking exist in your cluster.
kubectl get deviceclass netdev.google.com
The output is similar to the following:
NAME AGE
netdev.google.com 2d22h
Enable GKE managed DRANET driver on a TPU slice node pool
To enable the GKE DRANET driver when creating a TPU slice node pool, add the cloud.google.com/gke-networking-dra-driver=true label.
gcloud container node-pools create NODE_POOL_NAME \
--location=LOCATION \
--cluster=CLUSTER_NAME \
--node-locations=NODE_LOCATIONS \
--machine-type=MACHINE_TYPE \
--tpu-topology=TPU_TOPOLOGY \
--num-nodes=NUM_NODES \
--accelerator-network-profile=auto \
--node-labels=cloud.google.com/gke-networking-dra-driver=true
Replace the following:
NODE_POOL_NAME: The name of your new node pool.LOCATION: The Google Cloud region or zone for your cluster.CLUSTER_NAME: The name of your cluster.NODE_LOCATIONS: The Google Cloud zones for the nodes in the node pool.MACHINE_TYPE: The type of machine to use for nodes. For more information about TPU-compatible machine types, see Choose the TPU version.TPU_TOPOLOGY: The TPU topology, for example,2x4x4. The format of the topology depends on the TPU version. To learn more about TPU topologies, see Choose a topology.NUM_NODES: The number of nodes in the node pool.
For more information, see Create a single-host TPU slice node pool.
Deploy a workload claiming all network devices
To allocate non-RDMA network devices for a Pod, specify a ResourceClaimTemplate.
Create a
ResourceClaimTemplatethat references thenetdev.google.comDeviceClass. The following manifest requests all available non-RDMA network devices on the node.Save the manifest as
all-netdev-template.yaml:apiVersion: resource.k8s.io/v1 kind: ResourceClaimTemplate metadata: name: all-netdev spec: spec: devices: requests: - name: req-netdev exactly: deviceClassName: netdev.google.com allocationMode: AllApply the manifest:
kubectl apply -f all-netdev-template.yamlDeploy your workload and reference the
ResourceClaimTemplate. The following manifest deploys a Pod that uses theall-netdevtemplate to grant the Pod access to all non-RDMA network devices on the node. Save the manifest asnetdev-pod.yaml:apiVersion: v1 kind: Pod metadata: name: agnhost-netdev namespace: default labels: app: agnhost spec: containers: - name: agnhost image: registry.k8s.io/e2e-test-images/agnhost:2.39 args: ["netexec", "--http-port", "80"] ports: - name: agnhost-port containerPort: 80 resources: claims: - name: netdev limits: google.com/tpu: 4 nodeSelector: cloud.google.com/gke-tpu-accelerator: TPU_ACCELERATOR cloud.google.com/gke-tpu-topology: TPU_TOPOLOGY resourceClaims: - name: netdev resourceClaimTemplateName: all-netdevReplace the following:
TPU_ACCELERATOR: The TPU accelerator type, for example,tpu-v5p-slice.TPU_TOPOLOGY: The TPU topology, for example,2x4x4.
Apply the manifest:
kubectl apply -f netdev-pod.yamlVerify that the additional allocated network interfaces are visible inside the Pod.
kubectl exec agnhost-netdev -- ls /sys/class/netThe following example output shows the default
eth0andlointerfaces, along with the allocated network devices, which have names likeeth1andeth2. The number of NICs and their names will vary based on the machine type of the GKE node.eth0 eth1 eth2 lo
Request a specific number of network devices
The preceding examples show how to request all available network devices of a
certain type by setting allocationMode to All. If you need to request a
specific number of devices instead, you can set allocationMode to ExactCount
in your ResourceClaimTemplate.
The following example requests two RDMA network devices:
kind: ResourceClaimTemplate
metadata:
name: two-mrdma
spec:
spec:
devices:
requests:
- name: req-mrdma
exactly:
deviceClassName: mrdma.google.com
allocationMode: ExactCount
count: 2
What's next
- Learn more about Dynamic Resource Allocation.
- Learn more about Configure automated networking for accelerator VMs.