Create an A4 GKE cluster

This document describes how to deploy an A4 GKE cluster. These clusters provide high-performance network topologies that let you efficiently run large-scale artificial intelligence (AI) and machine learning (ML) training workloads.

A4 accelerator-optimized machine types have NVIDIA B200 Blackwell GPUs attached and are ideal for foundation model training and serving.

To learn more about the A4 accelerator-optimized machine types, see the A4 machine type section in the Compute Engine documentation.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

To use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Required roles

To get the permissions that you need to deploy the cluster, ask your administrator to grant you the following IAM roles on the project:

Kubernetes Engine Admin (roles/container.admin)
Compute Admin (roles/compute.admin)
Storage Admin (roles/storage.admin)
Project IAM Admin (roles/resourcemanager.projectIamAdmin)
Service Account Admin (roles/iam.serviceAccountAdmin)
Service Account User (roles/iam.serviceAccountUser)
Service Usage Consumer (roles/serviceusage.serviceUsageConsumer)

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Choose a consumption option and obtain capacity

To obtain capacity, complete the following steps:

Choose a consumption option: Make your choice based on how you want to get and use GPU resources. To learn more, see Choose a consumption option.

For GKE, consider the following additional information when choosing a consumption option:
- For more information about flex-start (Preview) and GKE, see About GPU obtainability with flex-start.
- Flex-start uses best-effort compact placement. To examine your topology, see View the physical topology of nodes in your GKE cluster.
- You can only get topology information when using Spot VMs if you configure compact placement.
Obtain capacity: The process to obtain capacity differs for each consumption option. To learn about the process for your chosen consumption option, see Capacity overview.

Requirements

The following requirements apply to an AI-optimized GKE cluster that uses A4 instances:

GPU drivers: The B200 GPUs in A4 VM instances require a minimum of the R570 GPU driver version. GKE automatically installs this driver version by default on all A4 nodes that run the required minimum version for A4 (1.32.1-gke.1729000 or later).
GPUDirect RDMA: To use GPUDirect RDMA with A4, use GKE version 1.32.2-gke.1475000 or later.
Node images: To use GPUDirect RDMA, the GKE nodes must use a Container-Optimized OS node image. Ubuntu and Windows node images are not supported.