This guide demonstrates how to deploy and serve a Stable Diffusion model on Google Kubernetes Engine (GKE) using TPUs, Ray Serve, and the Ray Operator add-on.
This guide is intended for Generative AI customers, new or existing users of GKE, ML Engineers, MLOps (DevOps) engineers, or platform administrators who are interested in using Kubernetes container orchestration capabilities for serving models using Ray.
About Ray and Ray Serve
Ray is an open-source scalable compute framework for AI/ML applications. Ray Serve is a model serving library for Ray used for scaling and serving models in a distributed environment. For more information, see Ray Serve in the Ray documentation.
About TPUs
Tensor Processing Units (TPUs) are specialized hardware accelerators designed to significantly speed up the training and inference of large-scale machine learning models. Using Ray with TPUs lets you seamlessly scale high-performance ML applications. For more information about TPUs, see Introduction to Cloud TPU in the Cloud TPU documentation.
About the KubeRay TPU initialization webhook
As part of the Ray Operator add-on, GKE provides validating and
mutating
webhooks
that handle TPU Pod scheduling and certain TPU environment variables required
by frameworks like
JAX for container initialization. The KubeRay
TPU webhook mutates Pods with the app.kubernetes.io/name: kuberay
label
requesting TPUs with the following properties:
TPU_WORKER_ID
: A unique integer for each worker Pod in the TPU slice.TPU_WORKER_HOSTNAMES
: A list of DNS hostnames for all TPU workers that need to communicate with each other within the slice. This variable is only injected for TPU Pods in a multi-host group.replicaIndex
: A Pod label that contains a unique identifier for the worker-group replica the Pod belongs to. This is useful for multi-host worker groups, where multiple worker Pods might belong to the same replica, and is used by Ray to enable multi-host autoscaling.TPU_NAME
: A string representing the GKE TPU PodSlice this Pod belongs to, set to the same value as thereplicaIndex
label.podAffinity
: Ensures GKE schedules TPU Pods with matchingreplicaIndex
labels on the same node pool. This lets GKE scale multi-host TPUs atomically by node pools, rather than single nodes.
Prepare your environment
To prepare your environment, follow these steps:
Launch a Cloud Shell session from the Google Cloud console, by clicking
Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of the Google Cloud console.
Set environment variables:
export PROJECT_ID=PROJECT_ID export CLUSTER_NAME=ray-cluster export COMPUTE_REGION=us-central2-b export CLUSTER_VERSION=CLUSTER_VERSION
Replace the following:
PROJECT_ID
: your Google Cloud project ID.CLUSTER_VERSION
: the GKE version to use. Must be1.30.1
or later.
Clone the GitHub repository:
git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples
Change to the working directory:
cd kubernetes-engine-samples/ai-ml/gke-ray/rayserve/stable-diffusion
Create a cluster with a TPU node pool
Create a Standard GKE cluster with a TPU node pool:
Create a Standard mode cluster with the Ray Operator enabled:
gcloud container clusters create ${CLUSTER_NAME} \ --addons=RayOperator \ --machine-type=n1-standard-8 \ --cluster-version=${CLUSTER_VERSION} \ --location=${COMPUTE_REGION}
Create a single-host TPU node pool:
gcloud container node-pools create tpu-pool \ --location=${COMPUTE_REGION} \ --cluster=${CLUSTER_NAME} \ --machine-type=ct4p-hightpu-4t \ --num-nodes=1
To use TPUs with Standard mode, you must select:
- A Compute Engine location with capacity for TPU accelerators
- A compatible machine type for the TPU and
- The physical topology of the TPU PodSlice
Configure a RayCluster resource with TPUs
Configure your RayCluster manifest to prepare your TPU workload:
Configure TPU nodeSelector
GKE uses Kubernetes nodeSelectors to ensure that TPU workloads are scheduled on the appropriate TPU topology and accelerator. For more information about selecting TPU nodeSelectors, see Deploy TPU workloads in GKE Standard.
Update the ray-cluster.yaml
manifest to schedule your Pod on a v4 TPU podslice
with a 2x2x1 topology:
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
cloud.google.com/gke-tpu-topology: 2x2x1
Configure a TPU container resource
To use a TPU accelerator, you must specify the number of TPU chips that
GKE should allocate to each Pod by configuring the
google.com/tpu
resource limits
and requests
in the TPU container field
of your RayCluster manifest workerGroupSpecs
.
Update the ray-cluster.yaml
manifest with resource limits and requests:
resources:
limits:
cpu: "1"
ephemeral-storage: 10Gi
google.com/tpu: "4"
memory: "2G"
requests:
cpu: "1"
ephemeral-storage: 10Gi
google.com/tpu: "4"
memory: "2G"
Configure worker group numOfHosts
KubeRay v1.1.0 adds a numOfHosts
field to the RayCluster custom resource,
which specifies the number of TPU hosts to create per worker group replica.
For multi-host worker groups, replicas are treated as PodSlices rather than
individual workers, with numOfHosts
worker nodes being created per replica.
Update the ray-cluster.yaml
manifest with the following:
workerGroupSpecs:
# Several lines omitted
numOfHosts: 1 # the number of "hosts" or workers per replica
Create a RayService custom resource
Create a RayService custom resource:
Review the following manifest:
This manifest describes a RayService custom resource that creates a RayCluster resource with 1 head node and a TPU worker group with a 2x2x1 topology, meaning each worker node will have 4 v4 TPU chips.
The TPU node belongs to a single v4 TPU podslice with a 2x2x1 topology. To create a multi-host worker group, replace the
gke-tpu nodeSelector
values,google.com/tpu
container limits and requests, andnumOfHosts
values with your multi-host configuration. For more information about TPU multi-host topologies, see System architecture in the Cloud TPU documentation.Apply the manifest to your cluster:
kubectl apply -f ray-service-tpu.yaml
Verify the RayService resource is running:
kubectl get rayservices
The output is similar to the following:
NAME SERVICE STATUS NUM SERVE ENDPOINTS stable-diffusion-tpu Running 2
In this output,
Running
in theSERVICE STATUS
column indicates the RayService resource is ready.
(Optional) View the Ray Dashboard
You can view your Ray Serve deployment and relevant logs from the Ray Dashboard.
Establish a port-forwarding session to the Ray dashboard from the Ray head service:
kubectl port-forward svc/stable-diffusion-tpu-head-svc 8265:8265
In a web browser, go to
http://localhost:8265/
.Click the Serve tab.
Send prompts to the model server
Establish a port-forwarding session to the Serve endpoint from the Ray head service:
kubectl port-forward svc/stable-diffusion-tpu-serve-svc 8000
Open a new Cloud Shell session.
Submit a text-to-image prompt to the Stable Diffusion model server:
python stable_diffusion_tpu_req.py --save_pictures
The results of the stable diffusion inference are saved to a file named
diffusion_results.png
.