This document shows you how to configure elastic cross-region high availability for AI inference workloads by using Google Kubernetes Engine (GKE) Multi-Cluster Inference Gateway and the GKE Autoscaling feature. This setup lets you intelligently load balance workloads across multiple GKE clusters in different regions.
For more information about GKE Multi-Cluster Inference Gateway, see About GKE Multi-Cluster Inference Gateway.
Before you begin
Enable the Google Kubernetes Engine API.
Install and initialize the Google Cloud CLI if you plan to use it for this task.
Ensure your project has sufficient quota for H100 GPUs. For more information, see GPU quota and resource allocation.
Use GKE version 1.34.1-gke.1127000 or later.
Use gcloud CLI version 480.0.0 or later.
Make sure that the service account used by your nodes has the
roles/monitoring.metricWriterandroles/stackdriver.resourceMetadata.writerpermissions.Make sure that you have the
roles/container.adminandroles/iam.serviceAccountAdminIdentity and Access Management (IAM) roles on the project.Complete the following Hugging Face prerequisites:
- Create a Hugging Face account.
- Request and get approval for access to the Llama 3.1 model on Hugging Face.
- Sign the license consent agreement on the model's page on Hugging Face.
- Generate a Hugging Face access token with at least Read permissions.
Create clusters and node pools
To create two GKE clusters in different regions and configure their node pools, follow these steps:
Create the first cluster:
gcloud container clusters create gke-west --zone \ CLUSTER_1_ZONE \ --project=PROJECT_ID \ --gateway-api=standard \ --cluster-version=GKE_VERSION \ --machine-type="MACHINE_TYPE" \ --disk-type="DISK_TYPE" \ --enable-managed-prometheus --monitoring=SYSTEM,DCGM \ --hpa-profile=performance \ --workload-pool=PROJECT_ID.svc.id.goog \ --asyncReplace the following:
PROJECT_ID: your project IDCLUSTER_1_ZONE: the zone for the first cluster, for exampleeurope-west3-cGKE_VERSION: the GKE version to use, for example1.34.1-gke.1127000MACHINE_TYPE: the machine type for the cluster nodes, for examplec2-standard-16DISK_TYPE: the disk type for the cluster nodes, for examplepd-standard
Create an H100 node pool in the first cluster:
gcloud container node-pools create h100 \ --accelerator "type=nvidia-h100-80gb,count=2,gpu-driver-version=latest" \ --project=PROJECT_ID \ --location=CLUSTER_1_ZONE \ --node-locations=CLUSTER_1_ZONE \ --cluster=CLUSTER_1_NAME \ --machine-type=NODE_POOL_MACHINE_TYPE \ --num-nodes=NUM_NODES \ --spot \ --min-nodes=MIN_NUM_NODES \ --max-nodes=MAX_NUM_NODES \ --enable-autoscaling \ --asyncReplace the following:
PROJECT_ID: your project IDCLUSTER_1_ZONE: the zone for the first cluster, for exampleeurope-west3-cCLUSTER_1_NAME: the name of the first cluster, for examplegke-westNODE_POOL_MACHINE_TYPE: the machine type for the node pool, for examplea3-highgpu-2gNUM_NODES: the number of nodes in the node pool, for example3MIN_NUM_NODES: the minimum number of nodes for autoscaling in the node pool, for example1MAX_NUM_NODES: the maximum number of nodes for autoscaling in the node pool, for example10
Get credentials and create a Hugging Face token secret in the first cluster:
gcloud container clusters get-credentials CLUSTER_1_NAME \ --location CLUSTER_1_ZONE \ --project=PROJECT_ID kubectl create secret generic hf-token \ --from-literal=token=HF_TOKENReplace the following:
PROJECT_ID: your project IDCLUSTER_1_NAME: the name of the first cluster, for examplegke-westCLUSTER_1_ZONE: the zone for the first cluster, for exampleeurope-west3-cHF_TOKEN: your Hugging Face access token
Create the second cluster in a different region from the first cluster:
gcloud container clusters create gke-east --zone CLUSTER_2_ZONE \ --project=PROJECT_ID \ --gateway-api=standard \ --cluster-version=GKE_VERSION \ --machine-type="MACHINE_TYPE" \ --disk-type="DISK_TYPE" \ --enable-managed-prometheus \ --monitoring=SYSTEM,DCGM \ --hpa-profile=performance \ --workload-pool=PROJECT_ID.svc.id.goog \ --asyncReplace
CLUSTER_2_ZONEwith the zone for the second cluster, for exampleus-east4-a.Create an H100 node pool for the second cluster:
gcloud container node-pools create h100 \ --accelerator "type=nvidia-h100-80gb,count=2,gpu-driver-version=latest" \ --project=PROJECT_ID \ --location=CLUSTER_2_ZONE \ --node-locations=CLUSTER_2_ZONE \ --cluster=CLUSTER_2_NAME \ --machine-type=NODE_POOL_MACHINE_TYPE \ --num-nodes=NUM_NODES \ --spot \ --min-nodes=MIN_NUM_NODES \ --max-nodes=MAX_NUM_NODES \ --enable-autoscaling \ --asyncReplace the following:
PROJECT_ID: your project IDCLUSTER_2_ZONE: the zone for the second cluster, for exampleus-east4-aCLUSTER_2_NAME: the name of the second cluster, for examplegke-eastNODE_POOL_MACHINE_TYPE: the machine type for the node pool, for examplea3-highgpu-2gNUM_NODES: the number of nodes in the node pool, for example3MIN_NUM_NODES: the minimum number of nodes for autoscaling in the node pool, for example1MAX_NUM_NODES: the maximum number of nodes for autoscaling in the node pool, for example10
Get credentials and create a secret for the Hugging Face token on the second cluster:
gcloud container clusters get-credentials CLUSTER_2_NAME \ --location CLUSTER_2_ZONE \ --project=PROJECT_ID kubectl create secret generic hf-token --from-literal=token=HF_TOKENReplace the following. For definitions of other variables, refer to previous steps:
HF_TOKEN: your Hugging Face access token
Register clusters to a fleet
Register your clusters to your project's fleet:
gcloud container fleet memberships register CLUSTER_1_NAME \ --gke-cluster CLUSTER_1_ZONE/CLUSTER_1_NAME \ --location=global \ --project=PROJECT_ID gcloud container fleet memberships register CLUSTER_2_NAME \ --gke-cluster CLUSTER_2_ZONE/CLUSTER_2_NAME \ --location=global \ --project=PROJECT_IDReplace the following, and refer to previous steps for definitions of other variables:
CLUSTER_1_NAME: the name of the first cluster, for examplegke-westCLUSTER_2_NAME: the name of the second cluster, for examplegke-east
Enable the Multi=Cluster Ingress feature and designate a configuration cluster:
gcloud container fleet ingress enable \ --config-membership=projects/PROJECT_ID/locations/global/memberships/CLUSTER_1_NAMEReplace the following. For definitions of other variables, refer to previous steps:
PROJECT_ID: your project IDCLUSTER_1_NAME: the name of the first cluster, for examplegke-west
Create proxy-only subnets
Create a subnet in the first cluster's region:
gcloud compute networks subnets create CLUSTER_1_REGION-subnet \ --purpose=GLOBAL_MANAGED_PROXY \ --role=ACTIVE \ --region=CLUSTER_1_REGION \ --network=default \ --range=SUBNET_RANGE_1 \ --project=PROJECT_IDReplace the following:
PROJECT_ID: your project IDCLUSTER_1_REGION: the region for the first cluster, for exampleeurope-west3SUBNET_RANGE_1: the subnet IP range for the proxy-only subnet in the first cluster's region, for example10.0.0.0/23
Create a subnet in the second cluster's region:
gcloud compute networks subnets create CLUSTER_2_REGION-subnet \ --purpose=GLOBAL_MANAGED_PROXY \ --role=ACTIVE \ --region=CLUSTER_2_REGION \ --network=default \ --range=SUBNET_RANGE_2 \ --project=PROJECT_IDReplace the following:
PROJECT_ID: your project IDCLUSTER_2_REGION: the region for the second cluster, for exampleus-east4SUBNET_RANGE_2: the subnet IP range for the proxy-only subnet in the second cluster's region, for example10.5.0.0/23
Install the required custom resources
Define context variables for your clusters:
CLUSTER1_CONTEXT="gke_PROJECT_ID_CLUSTER_1_ZONE_CLUSTER_1_NAME" CLUSTER2_CONTEXT="gke_PROJECT_ID_CLUSTER_2_ZONE_CLUSTER_2_NAME"Replace the following:
PROJECT_ID: your project IDCLUSTER_1_ZONE: the zone for the first cluster, for exampleeurope-west3-cCLUSTER_1_NAME: the name of the first cluster, for examplegke-westCLUSTER_2_ZONE: the zone for the second cluster, for exampleus-east4-aCLUSTER_2_NAME: the name of the second cluster, for examplegke-east
Install the InferencePool and InferenceObjective custom resource on both clusters:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.1/manifests.yaml --context=$CLUSTER1_CONTEXT kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.1/manifests.yaml --context=$CLUSTER2_CONTEXT
Deploy resources to the target clusters
Deploy the model servers to both clusters:
kubectl apply -f \ https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/release-1.0/config/manifests/vllm/gpu-deployment.yaml \ --context=$CLUSTER1_CONTEXT kubectl apply -f \ https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/release-1.0/config/manifests/vllm/gpu-deployment.yaml \ --context=$CLUSTER2_CONTEXTSave the following manifest to a file named
inference-objective.yaml:apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceObjective metadata: name: food-review spec: priority: 10 poolRef: name: llama3-8b-instruct group: "inference.networking.k8s.io"Apply the manifest to both clusters:
kubectl apply -f inference-objective.yaml --context=$CLUSTER1_CONTEXT kubectl apply -f inference-objective.yaml --context=$CLUSTER2_CONTEXTDeploy the InferencePool resources to both clusters using Helm:
helm install vllm-llama3-8b-instruct \ --kube-context $CLUSTER1_CONTEXT \ --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \ --set provider.name=gke \ --set inferenceExtension.monitoring.gke.enabled=true \ --version v1.0.1 \ oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool helm install vllm-llama3-8b-instruct \ --kube-context $CLUSTER2_CONTEXT \ --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \ --set provider.name=gke \ --set inferenceExtension.monitoring.gke.enabled=true \ --version v1.0.1 \ oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepoolIn both clusters, mark the
InferencePoolresources as exported in both clusters:kubectl annotate inferencepool vllm-llama3-8b-instruct networking.gke.io/export="True" \ --context=$CLUSTER1_CONTEXT kubectl annotate inferencepool vllm-llama3-8b-instruct networking.gke.io/export="True" \ --context=$CLUSTER2_CONTEXT
Deploy the cross-region Inference Gateway
Save the following manifest as
mygateway.yaml:--- kind: Gateway apiVersion: gateway.networking.k8s.io/v1beta1 metadata: name: cross-region-gateway namespace: default spec: gatewayClassName: gke-l7-cross-regional-internal-managed-mc addresses: - type: networking.gke.io/ephemeral-ipv4-address/europe-west3 value: "europe-west3" - type: networking.gke.io/ephemeral-ipv4-address/us-east4 value: "us-east4" listeners: - name: http protocol: HTTP port: 80 allowedRoutes: kinds: - kind: HTTPRoute namespaces: from: All --- apiVersion: gateway.networking.k8s.io/v1beta1 kind: HTTPRoute metadata: name: vllm-llama3-8b-instruct-default spec: parentRefs: - name: cross-region-gateway kind: Gateway rules: - backendRefs: - group: networking.gke.io kind: GCPInferencePoolImport name: vllm-llama3-8b-instruct --- kind: HealthCheckPolicy apiVersion: networking.gke.io/v1 metadata: name: health-check-policy namespace: default spec: targetRef: group: "networking.gke.io" kind: GCPInferencePoolImport name: vllm-llama3-8b-instruct default: config: type: HTTP httpHealthCheck: requestPath: /health port: 8000Apply the manifest to the config cluster:
kubectl apply -f mygateway.yaml --context=CLUSTER1_CONTEXTReplace the following:
CLUSTER1_CONTEXT: the context for the first cluster, for examplegke_my-project_europe-west3-c_gke-west
Enable custom metrics reporting
Create a file named
metrics.yamlwith the following content:apiVersion: autoscaling.gke.io/v1beta1 kind: AutoscalingMetric metadata: name: gpu-cache namespace: default spec: selector: matchLabels: app: vllm-llama3-8b-instruct endpoints: - port: 8000 path: /metrics metrics: - name: vllm:kv_cache_usage_perc exportName: kv-cache - name: vllm:gpu_cache_usage_perc exportName: kv-cache-oldFor each cluster, apply the metrics configuration:
kubectl apply -f metrics.yaml --context=CLUSTER1_CONTEXT kubectl apply -f metrics.yaml --context=CLUSTER2_CONTEXTFor definitions of
CLUSTER1_CONTEXTandCLUSTER2_CONTEXT, see Install the required custom resources.
Configure the load balancing policy
This section describes how to configure the load balancing policy. This policy defines how traffic is distributed across your inference pools based on custom metrics and regional preferences. The following configuration sets us-east4 as the
preferred region. Traffic spills over to other regions only when the gke.named_metrics.kv-cache custom Metric in the us-east4 region reaches 80% utilization.
- For vLLM versions v0.10.2 and newer, use the
gke.named_metrics.kv-cachemetric. - For earlier versions, use the
gke.named_metrics.kv-cache-oldmetric.
Create a file named
backend-policy.yamlwith the following content:kind: GCPBackendPolicy apiVersion: networking.gke.io/v1 metadata: name: my-backend-policy spec: targetRef: group: "networking.gke.io" kind: GCPInferencePoolImport name: vllm-llama3-8b-instruct default: timeoutSec: 100 balancingMode: CUSTOM_METRICS trafficDuration: LONG customMetrics: - name: gke.named_metrics.kv-cache maxUtilizationPercent: 80 dryRun: false scopes: - selector: gke.io/region: "us-east4" backendPreference: PREFERREDApply the new policy:
kubectl apply -f backend-policy.yaml --context=CLUSTER1_CONTEXTReplace
CLUSTER1_CONTEXTwith the context for the first cluster, for examplegke_my-project_europe-west3-c_gke-west
Configure autoscaling
To help ensure that each cluster can handle an increasing load before traffic spills over to other regions, you need to configure the Horizontal-Pod-Autoscaler (HPA) for your model server deployments.
Key configuration principles
Use the same custom metrics: the HPA should be configured to scale based on the same custom metrics that you use in the
GCPBackendPolicyfor the Multi-Cluster Inference Gateway (for example,vllm:kv_cache_usage_perc). This approach helps to ensures that both load balancing and scaling decisions are driven by the same signal from your inference servers. The metrics you choose must have a value between 0 and 1 to represent utilization. If the metric value is greater than 1, it is interpreted as 100% utilization by the load balancer, which can cause unexpected routing behaviors.Set a lower HPA target: the target value for the metric in the HPA configuration must be set lower than the
maxUtilizationPercentsetting that's defined in theGCPBackendPolicy. By setting the HPA's target utilization lower (for example, the HPA scales at 50% average utilization), you allow the cluster to add more replicas before the load balancer's threshold (for example, 80% utilization) is reached. This approach helps to maximize the capacity within the preferred Region. A lower utilization target also helps to prevent premature traffic spillover by reserving elastic cross-region high availability for when the current region genuinely approaches its limits.
Grant your user the ability to create the required authorization roles:
kubectl create clusterrolebinding cluster-admin-binding \ --clusterrole cluster-admin --user "$(gcloud config get-value account)" --context=CLUSTER1_CONTEXT kubectl create clusterrolebinding cluster-admin-binding \ --clusterrole cluster-admin --user "$(gcloud config get-value account)" --context=CLUSTER2_CONTEXTThe
CLUSTER1_CONTEXTandCLUSTER2_CONTEXTvariables are defined in the Install the required custom resources section.For each cluster, apply the manifest:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml --context=CLUSTER1_CONTEXT kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml --context=CLUSTER2_CONTEXTAllow the
custom-metrics-stackdriver-adapterservice account to read Cloud Monitoring metrics:PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)") gcloud projects add-iam-policy-binding projects/PROJECT_ID \ --role roles/monitoring.viewer \ --member=principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/custom-metrics/sa/custom-metrics-stackdriver-adapterReplace
PROJECT_IDwith your project IDSave the following manifest to a file named
pod-monitoring.yaml:apiVersion: monitoring.googleapis.com/v1 kind: PodMonitoring metadata: name: inference-server-podmon spec: selector: matchLabels: app: vllm-llama3-8b-instruct endpoints: - port: 8000 path: /metrics interval: 5sApply the manifest to both clusters:
kubectl apply -f pod-monitoring.yaml --context=CLUSTER1_CONTEXT kubectl apply -f pod-monitoring.yaml --context=CLUSTER2_CONTEXTSave the following manifest to a file named
hpa.yaml:apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: inference-server-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: vllm-llama3-8b-instruct minReplicas: 1 maxReplicas: 10 metrics: - type: Pods pods: metric: name: prometheus.googleapis.com|vllm:gpu_cache_usage_perc|gauge target: type: AverageValue averageValue: "0.5"Apply the manifest to both clusters:
kubectl apply -f hpa.yaml --context=CLUSTER1_CONTEXT kubectl apply -f hpa.yaml --context=CLUSTER2_CONTEXT
Verify the deployment
Get the Gateway IP address:
export GW_IP=$(kubectl get gateway/cross-region-gateway -n default --context=CLUSTER1_CONTEXT -o jsonpath='{.status.addresses[0].value}') echo ${GW_IP}The
CLUSTER1_CONTEXTvariable is defined in the Install the required custom resources section.Start an interactive
shsession in a temporary Pod:kubectl run -it --rm --image=curlimages/curl curly --context=CLUSTER1_CONTEXT -- /bin/shThe
CLUSTER1_CONTEXTvariable is defined in the Install the required custom resources section.From inside the
curlyPod, send a test request:curl -i -X POST <var>GW_IP</var>:80/v1/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "food-review-1", "prompt": "What is the best pizza in the world?", "max_tokens": 100, "temperature": 0 }'Replace
GW_IPwith the gateway IP address from the previous step.
Load test the gateway
Apply a sustained load to the Gateway IP address using a load generator within the same VPC network.
Start with a moderate load that you expect the preferred region (
us-east4) to handle.Gradually increase the request rate or concurrency of your load test.
While the load test runs, monitor the system in the Google Cloud console or using
kubectl:- Pod scaling (HPA): check the number of Pods in the
vllm-llama3-8b-instructDeployment in both clusters. - Node scaling (cluster autoscaler): monitor the number of nodes in
the
h100node pools in both clusters. - Custom metrics: watch the
vllm:kv_cache_usage_percmetric (for vllm version v0.10.2 and higher) or thevllm:gpu_cache_usage_percmetric (for vllm version lower than v.0.10.2) in Monitoring for model server deployments in both clusters. - Load Balancer metrics: examine the metrics for the Load Balancer
associated with the
cross-region-gatewayin Monitoring.
- Pod scaling (HPA): check the number of Pods in the
As you increase the load, utilization in us-east4 increases. When the HPA in
the gke-east cluster scales out and the average utilization approaches the
maxUtilization value (80%) that is defined in the GCPBackendPolicy, the load
balancer begins to route requests to the gke-west cluster in europe-west3.
What's next
- Learn more about the GKE Gateway API.
- Learn more about the GKE Multi-Cluster Inference Gateway.
- Learn more about Multi-Cluster Ingress.