This document describes how to set up the Google Kubernetes Engine (GKE)
multi-cluster Inference Gateway to intelligently load-balance
your AI/ML inference workloads across multiple GKE clusters,
which can span different regions. This setup uses Gateway API, Multi Cluster Ingress, and
custom resources like InferencePool and InferenceObjective to improve
scalability, help ensure high availability, and optimize resource utilization for
your model-serving deployments.
To understand this document, be familiar with the following:
- AI/ML orchestration on GKE.
- Generative AI terminology.
- GKE networking concepts, including:
- Load balancing in Google Cloud, especially how load balancers interact with GKE.
This document is for the following personas:
- Machine learning (ML) engineers, Platform admins and operators, or Data and AI specialists who want to use GKE's container orchestration capabilities for serving AI/ML workloads.
- Cloud architects or Networking specialists who interact with GKE networking.
To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE Enterprise user roles and tasks.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
Enable the Compute Engine API, Google Kubernetes Engine API, Model Armor, and the Network Services API.
Go to Enable access to APIs and follow the instructions.
Enable the Autoscaling API.
Go to Autoscaling API and follow the instructions.
Hugging Face prerequisites:
- Create a Hugging Face account if you don't already have one.
- Request and get approval for access to the Llama 3.1 model on Hugging Face.
- Sign the license consent agreement on the model's page on Hugging Face.
- Generate a Hugging Face access token with at least
Readpermissions.
Requirements
- Ensure your project has sufficient quota for H100 GPUs. For more information, see Plan GPU quota and Allocation quotas.
- Use GKE version 1.34.1-gke.1127000 or later.
- Use gcloud CLI version 480.0.0 or later.
- Your node service accounts must have permissions to write metrics to the Autoscaling API.
- You must have the following IAM roles on the project:
roles/container.adminandroles/iam.serviceAccountAdmin.
Set up multi-cluster Inference Gateway
To set up the GKE multi-cluster Inference Gateway, follow these steps:
Create clusters and node pools
To host your AI/ML inference workloads and enable cross-regional load balancing, create two GKE clusters in different regions, each with an H100 GPU node pool.
Create the first cluster:
gcloud container clusters create CLUSTER_1_NAME \ --region LOCATION \ --project=PROJECT_ID \ --gateway-api=standard \ --release-channel "rapid" \ --cluster-version=GKE_VERSION \ --machine-type="MACHINE_TYPE" \ --disk-type="DISK_TYPE" \ --enable-managed-prometheus --monitoring=SYSTEM,DCGM \ --hpa-profile=performance \ --async # Allows the command to return immediatelyReplace the following:
CLUSTER_1_NAME: the name of the first cluster, for examplegke-west.LOCATION: the region for the first cluster, for exampleeurope-west3.PROJECT_ID: your project ID.GKE_VERSION: the GKE version to use, for example1.34.1-gke.1127000.MACHINE_TYPE: the machine type for the cluster nodes, for examplec2-standard-16.DISK_TYPE: the disk type for the cluster nodes, for examplepd-standard.
Create an H100 node pool for the first cluster:
gcloud container node-pools create NODE_POOL_NAME \ --accelerator "type=nvidia-h100-80gb,count=2,gpu-driver-version=latest" \ --project=PROJECT_ID \ --location=CLUSTER_1_ZONE \ --node-locations=CLUSTER_1_ZONE \ --cluster=CLUSTER_1_NAME \ --machine-type=NODE_POOL_MACHINE_TYPE \ --num-nodes=NUM_NODES \ --spot \ --async # Allows the command to return immediatelyReplace the following:
NODE_POOL_NAME: the name of the node pool, for exampleh100.PROJECT_ID: your project ID.CLUSTER_1_ZONE: the zone for the first cluster, for exampleeurope-west3-c.CLUSTER_1_NAME: the name of the first cluster, for examplegke-west.NODE_POOL_MACHINE_TYPE: the machine type for the node pool, for examplea3-highgpu-2g.NUM_NODES: the number of nodes in the node pool, for example3.
Get the credentials:
gcloud container clusters get-credentials CLUSTER_1_NAME \ --location CLUSTER_1_ZONE \ --project=PROJECT_IDReplace the following:
PROJECT_ID: your project ID.CLUSTER_1_NAME: the name of the first cluster, for examplegke-west.CLUSTER_1_ZONE: the zone for the first cluster, for exampleeurope-west3-c.
On the first cluster, create a secret for the Hugging Face token:
kubectl create secret generic hf-token \ --from-literal=token=HF_TOKENReplace the
HF_TOKEN: your Hugging Face access token.Create the second cluster in a different region from the first cluster:
gcloud container clusters create gke-east --region LOCATION \ --project=PROJECT_ID \ --gateway-api=standard \ --release-channel "rapid" \ --cluster-version=GKE_VERSION \ --machine-type="MACHINE_TYPE" \ --disk-type="DISK_TYPE" \ --enable-managed-prometheus \ --monitoring=SYSTEM,DCGM \ --hpa-profile=performance \ --async # Allows the command to return immediately while the cluster is created in the background.Replace the following:
LOCATION: the region for the second cluster. This must be a different region than the first cluster. For example,us-east4.PROJECT_ID: your project ID.GKE_VERSION: the GKE version to use, for example1.34.1-gke.1127000.MACHINE_TYPE: the machine type for the cluster nodes, for examplec2-standard-16.DISK_TYPE: the disk type for the cluster nodes, for examplepd-standard.
Create an H100 node pool for the second cluster:
gcloud container node-pools create h100 \ --accelerator "type=nvidia-h100-80gb,count=2,gpu-driver-version=latest" \ --project=PROJECT_ID \ --location=CLUSTER_2_ZONE \ --node-locations=CLUSTER_2_ZONE \ --cluster=CLUSTER_2_NAME \ --machine-type=NODE_POOL_MACHINE_TYPE \ --num-nodes=NUM_NODES \ --spot \ --async # Allows the command to return immediatelyReplace the following:
PROJECT_ID: your project ID.CLUSTER_2_ZONE: the zone for the second cluster, for exampleus-east4-a.CLUSTER_2_NAME: the name of the second cluster, for examplegke-east.NODE_POOL_MACHINE_TYPE: the machine type for the node pool, for examplea3-highgpu-2g.NUM_NODES: the number of nodes in the node pool, for example3.
For the second cluster, get credentials and create a secret for the Hugging Face token:
gcloud container clusters get-credentials CLUSTER_2_NAME \ --location CLUSTER_2_ZONE \ --project=PROJECT_ID kubectl create secret generic hf-token --from-literal=token=HF_TOKENReplace the following:
CLUSTER_2_NAME: the name of the second cluster, for examplegke-east.CLUSTER_2_ZONE: the zone for the second cluster, for exampleus-east4-a.PROJECT_ID: your project ID.HF_TOKEN: your Hugging Face access token.
Register clusters to a fleet
To enable multi-cluster capabilities, such as the GKE Multi-cluster Inference Gateway, register your clusters to a fleet.
Register both clusters to your project's fleet:
gcloud container fleet memberships register CLUSTER_1_NAME \ --gke-cluster CLUSTER_1_ZONE/CLUSTER_1_NAME \ --location=global \ --project=PROJECT_ID gcloud container fleet memberships register CLUSTER_2_NAME \ --gke-cluster CLUSTER_2_ZONE/CLUSTER_2_NAME \ --location=global \ --project=PROJECT_IDReplace the following:
CLUSTER_1_NAME: the name of the first cluster, for examplegke-west.CLUSTER_1_ZONE: the zone for the first cluster, for exampleeurope-west3-c.PROJECT_ID: your project ID.CLUSTER_2_NAME: the name of the second cluster, for examplegke-east.CLUSTER_2_ZONE: the zone for the second cluster, for exampleus-east4-a.
To allow a single Gateway to manage traffic across multiple clusters, enable the multi-cluster Ingress feature and designate a config cluster:
gcloud container fleet ingress enable \ --config-membership=projects/PROJECT_ID/locations/global/memberships/CLUSTER_1_NAMEReplace the following:
PROJECT_ID: your project ID.CLUSTER_1_NAME: the name of the first cluster, for examplegke-west.
Create proxy-only subnets
For an internal gateway, create a proxy-only subnet in each region. The internal Gateway's Envoy proxies use these dedicated subnets to handle traffic within your VPC network.
Create a subnet in the first cluster's region:
gcloud compute networks subnets create CLUSTER_1_REGION-subnet \ --purpose=GLOBAL_MANAGED_PROXY \ --role=ACTIVE \ --region=CLUSTER_1_REGION \ --network=default \ --range=10.0.0.0/23 \ --project=PROJECT_IDCreate a subnet in the second cluster's region:
gcloud compute networks subnets create CLUSTER_2_REGION-subnet \ --purpose=GLOBAL_MANAGED_PROXY \ --role=ACTIVE \ --region=CLUSTER_2_REGION \ --network=default \ --range=10.5.0.0/23 \ --project=PROJECT_IDReplace the following:
PROJECT_ID: your project ID.CLUSTER_1_REGION: the region for the first cluster, for exampleeurope-west3.CLUSTER_2_REGION: the region for the second cluster, for exampleus-east4.
Install the required CRDs
The GKE Multi-cluster Inference Gateway uses
custom resources such as InferencePool and InferenceObjective. The
GKE Gateway API controller manages the InferencePool Custom Resource
Definition (CRD). However, you must manually install the InferenceObjective
CRD, which is in alpha, on your clusters.
Define context variables for your clusters:
CLUSTER1_CONTEXT="gke_PROJECT_ID_CLUSTER_1_ZONE_CLUSTER_1_NAME" CLUSTER2_CONTEXT="gke_PROJECT_ID_CLUSTER_2_ZONE_CLUSTER_2_NAME"Replace the following:
PROJECT_ID: your project ID.CLUSTER_1_ZONE: the zone for the first cluster, for exampleeurope-west3-c.CLUSTER_1_NAME: the name of the first cluster, for examplegke-west.CLUSTER_2_ZONE: the zone for the second cluster, for exampleus-east4-a.CLUSTER_2_NAME: the name of the second cluster, for examplegke-east.
Install the
InferenceObjectiveCRD on both clusters:kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml --context=CLUSTER1_CONTEXT kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml --context=CLUSTER2_CONTEXTReplace the following:
CLUSTER1_CONTEXT: the context for the first cluster, for examplegke_my-project_europe-west3-c_gke-west.CLUSTER2_CONTEXT: the context for the second cluster, for examplegke_my-project_us-east4-a_gke-east.
Deploy resources to the target clusters
To make your AI/ML inference workloads available on each cluster, deploy the
required resources, such as the model servers and InferenceObjective
custom resources.
Deploy the model servers to both clusters:
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/manifests/vllm/gpu-deployment.yaml --context=CLUSTER1_CONTEXT kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/manifests/vllm/gpu-deployment.yaml --context=CLUSTER2_CONTEXTReplace the following:
CLUSTER1_CONTEXT: the context for the first cluster, for examplegke_my-project_europe-west3-c_gke-west.CLUSTER2_CONTEXT: the context for the second cluster, for examplegke_my-project_us-east4-a_gke-east.
Deploy the
InferenceObjectiveresources to both clusters. Save the following sample manifest to a file namedinference-objective.yaml:apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceObjective metadata: name: food-review spec: priority: 10 poolRef: name: llama3-8b-instruct group: "inference.networking.k8s.io"Apply the manifest to both clusters:
kubectl apply -f inference-objective.yaml --context=CLUSTER1_CONTEXT kubectl apply -f inference-objective.yaml --context=CLUSTER2_CONTEXTReplace the following:
CLUSTER1_CONTEXT: the context for the first cluster, for examplegke_my-project_europe-west3-c_gke-west.CLUSTER2_CONTEXT: the context for the second cluster, for examplegke_my-project_us-east4-a_gke-east.
Deploy the
InferencePoolresources to both clusters by using Helm:helm install vllm-llama3-8b-instruct \ --kube-context CLUSTER1_CONTEXT \ --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \ --set provider.name=gke \ --version v1.1.0 \ oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepoolhelm install vllm-llama3-8b-instruct \ --kube-context CLUSTER2_CONTEXT \ --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \ --set provider.name=gke \ --set inferenceExtension.monitoring.gke.enabled=true \ --version v1.1.0 \ oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepoolReplace the following:
CLUSTER1_CONTEXT: the context for the first cluster, for examplegke_my-project_europe-west3-c_gke-west.CLUSTER2_CONTEXT: the context for the second cluster, for examplegke_my-project_us-east4-a_gke-east.
Mark the
InferencePoolresources as exported on both clusters. This annotation makes theInferencePoolavailable for import by the config cluster, which is a required step for multi-cluster routing.kubectl annotate inferencepool vllm-llama3-8b-instruct networking.gke.io/export="True" \ --context=CLUSTER1_CONTEXTkubectl annotate inferencepool vllm-llama3-8b-instruct networking.gke.io/export="True" \ --context=CLUSTER2_CONTEXTReplace the following:
CLUSTER1_CONTEXT: the context for the first cluster, for examplegke_my-project_europe-west3-c_gke-west.CLUSTER2_CONTEXT: the context for the second cluster, for examplegke_my-project_us-east4-a_gke-east.
Deploy resources to the config cluster
To define how traffic is routed and load-balanced across the InferencePool
resources in all registered clusters, deploy the Gateway,
HTTPRoute, and HealthCheckPolicy resources. You deploy these resources
only to the designated config cluster, which is gke-west in this document.
Create a file named
mcig.yamlwith the following content:--- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: cross-region-gateway namespace: default spec: gatewayClassName: gke-l7-cross-regional-internal-managed-mc addresses: - type: networking.gke.io/ephemeral-ipv4-address/europe-west3 value: "europe-west3" - type: networking.gke.io/ephemeral-ipv4-address/us-east4 value: "us-east4" listeners: - name: http protocol: HTTP port: 80 --- apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: vllm-llama3-8b-instruct-default spec: parentRefs: - name: cross-region-gateway kind: Gateway rules: - backendRefs: - group: networking.gke.io kind: GCPInferencePoolImport name: vllm-llama3-8b-instruct --- apiVersion: networking.gke.io/v1 kind: HealthCheckPolicy metadata: name: health-check-policy namespace: default spec: targetRef: group: "networking.gke.io" kind: GCPInferencePoolImport name: vllm-llama3-8b-instruct default: config: type: HTTP httpHealthCheck: requestPath: /health port: 8000Apply the manifest:
kubectl apply -f mcig.yaml --context=CLUSTER1_CONTEXTReplace
CLUSTER1_CONTEXTwith the context for the first cluster (the config cluster), for examplegke_my-project_europe-west3-c_gke-west.
Enable custom metrics reporting
To enable custom metrics reporting and improve cross-regional load balancing, you export KV Cache usage metrics from all clusters. The load balancer uses this exported KV Cache usage data as a custom load signal. Using this custom load signal allows for more intelligent load balancing decisions based on each cluster's actual workload.
Create a file named
metrics.yamlwith the following content:apiVersion: autoscaling.gke.io/v1beta1 kind: AutoscalingMetric metadata: name: gpu-cache namespace: default spec: selector: matchLabels: app: vllm-llama3-8b-instruct endpoints: - port: 8000 path: /metrics metrics: - name: vllm:kv_cache_usage_perc # For vLLM versions v0.10.2 and newer exportName: kv-cache - name: vllm:gpu_cache_usage_perc # For vLLM versions v0.6.2 and newer exportName: kv-cache-oldApply the metrics configuration to both clusters:
kubectl apply -f metrics.yaml --context=CLUSTER1_CONTEXT kubectl apply -f metrics.yaml --context=CLUSTER2_CONTEXTReplace the following:
CLUSTER1_CONTEXT: the context for the first cluster, for examplegke_my-project_europe-west3-c_gke-west.CLUSTER2_CONTEXT: the context for the second cluster, for examplegke_my-project_us-east4-a_gke-east.
Configure the load balancing policy
To optimize how your AI/ML inference requests are distributed across your GKE clusters, configure a load balancing policy. Choosing the right balancing mode helps ensure efficient resource utilization, prevents overloading individual clusters, and improves the overall performance and responsiveness of your inference services.
Configure timeouts
If your requests are expected to have long durations, configure a longer timeout
for the load balancer. In the GCPBackendPolicy, set the timeoutSec field to
at least twice your estimated P99 request latency.
For example, the following manifest sets the load balancer timeout to 100 seconds.
apiVersion: networking.gke.io/v1
kind: GCPBackendPolicy
metadata:
name: my-backend-policy
spec:
targetRef:
group: "networking.gke.io"
kind: GCPInferencePoolImport
name: vllm-llama3-8b-instruct
default:
timeoutSec: 100
balancingMode: CUSTOM_METRICS
trafficDuration: LONG
customMetrics:
- name: gke.named_metrics.kv-cache
dryRun: false
For more information, see multi-cluster Gateway limitations.
Because the Custom metrics and In-flight requests load balancing modes are
mutually exclusive, configure only one of these modes in your
GCPBackendPolicy.
Choose a load balancing mode for your deployment.
Custom metrics
For optimal load balancing, start with a target utilization of 60%. To
achieve this target, set maxUtilization: 60 in your GCPBackendPolicy's
customMetrics configuration.
Create a file named
backend-policy.yamlwith the following content to enable load balancing based on thekv-cachecustom metric:apiVersion: networking.gke.io/v1 kind: GCPBackendPolicy metadata: name: my-backend-policy spec: targetRef: group: "networking.gke.io" kind: GCPInferencePoolImport name: vllm-llama3-8b-instruct default: balancingMode: CUSTOM_METRICS trafficDuration: LONG customMetrics: - name: gke.named_metrics.kv-cache dryRun: false maxUtilization: 60Apply the new policy:
kubectl apply -f backend-policy.yaml --context=CLUSTER1_CONTEXTReplace
CLUSTER1_CONTEXTwith the context for the first cluster, for examplegke_my-project-europe-west3-c-gke-west.
In-flight requests
To use the in-flight balancing mode, estimate the number of in-flight requests each backend can handle and explicitly configure a capacity value.
Create a file named
backend-policy.yamlwith the following content to enable load balancing based on the number of in-flight requests:kind: GCPBackendPolicy apiVersion: networking.gke.io/v1 metadata: name: my-backend-policy spec: targetRef: group: "networking.gke.io" kind: GCPInferencePoolImport name: vllm-llama3-8b-instruct default: balancingMode: IN_FLIGHT trafficDuration: LONG maxInFlightRequestsPerEndpoint: 1000 dryRun: falseApply the new policy:
kubectl apply -f backend-policy.yaml --context=CLUSTER1_CONTEXTReplace
CLUSTER1_CONTEXTwith the context for the first cluster, for examplegke_my-project_europe-west3-c_gke-west.
Verify the deployment
To verify the internal load balancer, you must send requests from within your VPC network because, as internal load balancers use private IP addresses. Run a temporary Pod inside one of the clusters to send requests from your VPC network and verify the internal load balancer:
Start an interactive shell session in a temporary Pod:
kubectl run -it --rm --image=curlimages/curl curly --context=CLUSTER1_CONTEXT -- /bin/shReplace
CLUSTER1_CONTEXTwith the context for the first cluster, for examplegke_my-project_europe-west3-c_gke-west.From the new shell, get the Gateway IP address and send a test request:
GW_IP=$(kubectl get gateway/cross-region-gateway -n default -o jsonpath='{.status.addresses[0].value}') curl -i -X POST ${GW_IP}:80/v1/completions -H 'Content-Type: application/json' -d '{ "model": "food-review-1", "prompt": "What is the best pizza in the world?", "max_tokens": 100, "temperature": 0 }'The following is an example of a successful response:
{ "id": "cmpl-...", "object": "text_completion", "created": 1704067200, "model": "food-review-1", "choices": [ { "text": "The best pizza in the world is subjective, but many argue for Neapolitan pizza...", "index": 0, "logprobs": null, "finish_reason": "length" } ], "usage": { "prompt_tokens": 10, "completion_tokens": 100, "total_tokens": 110 } }
What's next
- Learn more about the GKE Gateway API.
- Learn more about GKE multi-cluster Inference Gateway.
- Learn more about Multi Cluster Ingress.