Set up the GKE multi-cluster Inference Gateway

Autopilot Standard

This document describes how to set up the Google Kubernetes Engine (GKE) multi-cluster Inference Gateway to intelligently load-balance your AI/ML inference workloads across multiple GKE clusters, which can span different regions. This setup uses Gateway API, Multi Cluster Ingress, and custom resources like InferencePool and InferenceObjective to improve scalability, help ensure high availability, and optimize resource utilization for your model-serving deployments.

To understand this document, be familiar with the following:

AI/ML orchestration on GKE.
Generative AI terminology.
GKE networking concepts, including:
Load balancing in Google Cloud, especially how load balancers interact with GKE.

This document is for the following personas:

Machine learning (ML) engineers, Platform admins and operators, or Data and AI specialists who want to use GKE's container orchestration capabilities for serving AI/ML workloads.
Cloud architects or Networking specialists who interact with GKE networking.

To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE Enterprise user roles and tasks.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Enable the Compute Engine API, Google Kubernetes Engine API, Model Armor, and the Network Services API.

Go to Enable access to APIs and follow the instructions.
Enable the Autoscaling API.

Go to Autoscaling API and follow the instructions.
Hugging Face prerequisites:
- Create a Hugging Face account if you don't already have one.
- Request and get approval for access to the Llama 3.1 model on Hugging Face.
- Sign the license consent agreement on the model's page on Hugging Face.
- Generate a Hugging Face access token with at least Read permissions.

Requirements

Ensure your project has sufficient quota for H100 GPUs. For more information, see Plan GPU quota and Allocation quotas.
Use GKE version 1.34.1-gke.1127000 or later.
Use gcloud CLI version 480.0.0 or later.
Your node service accounts must have permissions to write metrics to the Autoscaling API.
You must have the following IAM roles on the project: roles/container.admin and roles/iam.serviceAccountAdmin.

Set up multi-cluster Inference Gateway

To set up the GKE multi-cluster Inference Gateway, follow these steps:

Create clusters and node pools

To host your AI/ML inference workloads and enable cross-regional load balancing, create two GKE clusters in different regions, each with an H100 GPU node pool.

Create the first cluster:
```
gcloud container clusters create CLUSTER_1_NAME \
    --region LOCATION \
    --project=PROJECT_ID \
    --gateway-api=standard \
    --release-channel "rapid" \
    --cluster-version=GKE_VERSION \
    --machine-type="MACHINE_TYPE" \
    --disk-type="DISK_TYPE" \
    --enable-managed-prometheus --monitoring=SYSTEM,DCGM \
    --hpa-profile=performance \
    --async # Allows the command to return immediately
```
Replace the following:
- CLUSTER_1_NAME: the name of the first cluster, for example gke-west.
- LOCATION: the region for the first cluster, for example europe-west3.
- PROJECT_ID: your project ID.
- GKE_VERSION: the GKE version to use, for example 1.34.1-gke.1127000.
- MACHINE_TYPE: the machine type for the cluster nodes, for example c2-standard-16.
- DISK_TYPE: the disk type for the cluster nodes, for example pd-standard.
Note: The --async flag lets the command return immediately while GKE creates the cluster in the background. The subsequent get-credentials command waits for the cluster to be fully provisioned.
Create an H100 node pool for the first cluster:
```
gcloud container node-pools create NODE_POOL_NAME \
    --accelerator "type=nvidia-h100-80gb,count=2,gpu-driver-version=latest" \
    --project=PROJECT_ID \
    --location=CLUSTER_1_ZONE \
    --node-locations=CLUSTER_1_ZONE \
    --cluster=CLUSTER_1_NAME \
    --machine-type=NODE_POOL_MACHINE_TYPE \
    --num-nodes=NUM_NODES \
    --spot \
    --async # Allows the command to return immediately
```
Replace the following:
- NODE_POOL_NAME: the name of the node pool, for example h100.
- PROJECT_ID: your project ID.
- CLUSTER_1_ZONE: the zone for the first cluster, for example europe-west3-c.
- CLUSTER_1_NAME: the name of the first cluster, for example gke-west.
- NODE_POOL_MACHINE_TYPE: the machine type for the node pool, for example a3-highgpu-2g.
- NUM_NODES: the number of nodes in the node pool, for example 3.
Note: Using the --spot flag creates a Spot VM node pool, which can be preempted. Spot VMs are often suitable for AI/ML inference workloads because Spot VMs offer significant cost savings, and inference tasks can often be designed to be resilient to interruptions.
Get the credentials:
```
gcloud container clusters get-credentials CLUSTER_1_NAME \
    --location CLUSTER_1_ZONE \
    --project=PROJECT_ID
```
Replace the following:
- PROJECT_ID: your project ID.
- CLUSTER_1_NAME: the name of the first cluster, for example gke-west.
- CLUSTER_1_ZONE: the zone for the first cluster, for example europe-west3-c.
On the first cluster, create a secret for the Hugging Face token:
```
kubectl create secret generic hf-token \
    --from-literal=token=HF_TOKEN
```
Replace the HF_TOKEN: your Hugging Face access token.
Create the second cluster in a different region from the first cluster:
```
gcloud container clusters create gke-east --region LOCATION \
    --project=PROJECT_ID \
    --gateway-api=standard \
    --release-channel "rapid" \
    --cluster-version=GKE_VERSION \
    --machine-type="MACHINE_TYPE" \
    --disk-type="DISK_TYPE" \
    --enable-managed-prometheus \
    --monitoring=SYSTEM,DCGM \
    --hpa-profile=performance \
    --async # Allows the command to return immediately while the
cluster is created in the background.
```
Replace the following:
- LOCATION: the region for the second cluster. This must be a different region than the first cluster. For example, us-east4.
- PROJECT_ID: your project ID.
- GKE_VERSION: the GKE version to use, for example 1.34.1-gke.1127000.
- MACHINE_TYPE: the machine type for the cluster nodes, for example c2-standard-16.
- DISK_TYPE: the disk type for the cluster nodes, for example pd-standard.
Note: To enable cross-region load balancing, ensure that the second cluster is in a different region from the first cluster.

Create an H100 node pool for the second cluster:

gcloud container node-pools create h100 \
    --accelerator "type=nvidia-h100-80gb,count=2,gpu-driver-version=latest" \
    --project=PROJECT_ID \
    --location=CLUSTER_2_ZONE \
    --node-locations=CLUSTER_2_ZONE \
    --cluster=CLUSTER_2_NAME \
    --machine-type=NODE_POOL_MACHINE_TYPE \
    --num-nodes=NUM_NODES \
    --spot \
    --async # Allows the command to return immediately

Replace the following:

PROJECT_ID: your project ID.
CLUSTER_2_ZONE: the zone for the second cluster, for example us-east4-a.
CLUSTER_2_NAME: the name of the second cluster, for example gke-east.
NODE_POOL_MACHINE_TYPE: the machine type for the node pool, for example a3-highgpu-2g.
NUM_NODES: the number of nodes in the node pool, for example 3.

For the second cluster, get credentials and create a secret for the Hugging Face token:
```
gcloud container clusters get-credentials CLUSTER_2_NAME \
    --location CLUSTER_2_ZONE \
    --project=PROJECT_ID

kubectl create secret generic hf-token --from-literal=token=HF_TOKEN
```
Replace the following:
- CLUSTER_2_NAME: the name of the second cluster, for example gke-east.
- CLUSTER_2_ZONE: the zone for the second cluster, for example us-east4-a.
- PROJECT_ID: your project ID.
- HF_TOKEN: your Hugging Face access token.

Register clusters to a fleet

To enable multi-cluster capabilities, such as the GKE Multi-cluster Inference Gateway, register your clusters to a fleet.

Register both clusters to your project's fleet:
```
gcloud container fleet memberships register CLUSTER_1_NAME \
    --gke-cluster CLUSTER_1_ZONE/CLUSTER_1_NAME \
    --location=global \
    --project=PROJECT_ID

gcloud container fleet memberships register CLUSTER_2_NAME \
    --gke-cluster CLUSTER_2_ZONE/CLUSTER_2_NAME \
    --location=global \
    --project=PROJECT_ID
```
Replace the following:
- CLUSTER_1_NAME: the name of the first cluster, for example gke-west.
- CLUSTER_1_ZONE: the zone for the first cluster, for example europe-west3-c.
- PROJECT_ID: your project ID.
- CLUSTER_2_NAME: the name of the second cluster, for example gke-east.
- CLUSTER_2_ZONE: the zone for the second cluster, for example us-east4-a.
To allow a single Gateway to manage traffic across multiple clusters, enable the multi-cluster Ingress feature and designate a config cluster:
```
gcloud container fleet ingress enable \
    --config-membership=projects/PROJECT_ID/locations/global/memberships/CLUSTER_1_NAME
```
Replace the following:
- PROJECT_ID: your project ID.
- CLUSTER_1_NAME: the name of the first cluster, for example gke-west.

Create proxy-only subnets

For an internal gateway, create a proxy-only subnet in each region. The internal Gateway's Envoy proxies use these dedicated subnets to handle traffic within your VPC network.

Create a subnet in the first cluster's region:

gcloud compute networks subnets create CLUSTER_1_REGION-subnet \
    --purpose=GLOBAL_MANAGED_PROXY \
    --role=ACTIVE \
    --region=CLUSTER_1_REGION \
    --network=default \
    --range=10.0.0.0/23 \
    --project=PROJECT_ID

Create a subnet in the second cluster's region:

gcloud compute networks subnets create CLUSTER_2_REGION-subnet \
    --purpose=GLOBAL_MANAGED_PROXY \
    --role=ACTIVE \
    --region=CLUSTER_2_REGION \
    --network=default \
    --range=10.5.0.0/23 \
    --project=PROJECT_ID

Replace the following:

PROJECT_ID: your project ID.
CLUSTER_1_REGION: the region for the first cluster, for example europe-west3.
CLUSTER_2_REGION: the region for the second cluster, for example us-east4.

Install the required CRDs

The GKE Multi-cluster Inference Gateway uses custom resources such as InferencePool and InferenceObjective. The GKE Gateway API controller manages the InferencePool Custom Resource Definition (CRD). However, you must manually install the InferenceObjective CRD, which is in alpha, on your clusters.

Define context variables for your clusters:
```
CLUSTER1_CONTEXT="gke_PROJECT_ID_CLUSTER_1_ZONE_CLUSTER_1_NAME"
CLUSTER2_CONTEXT="gke_PROJECT_ID_CLUSTER_2_ZONE_CLUSTER_2_NAME"
```
Replace the following:
- PROJECT_ID: your project ID.
- CLUSTER_1_ZONE: the zone for the first cluster, for example europe-west3-c.
- CLUSTER_1_NAME: the name of the first cluster, for example gke-west.
- CLUSTER_2_ZONE: the zone for the second cluster, for example us-east4-a.
- CLUSTER_2_NAME: the name of the second cluster, for example gke-east.

Install the InferenceObjective CRD on both clusters:

# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: restrict-toleration
spec:
  failurePolicy: Fail
  paramKind:
    apiVersion: v1
    kind: ConfigMap
  matchConstraints:
    # GKE will mutate any pod specifying a CC label in a nodeSelector
    # or in a nodeAffinity with a toleration for the CC node label.
    # Mutation hooks will always mutate the K8s object before validating
    # the admission request.  
    # Pods created by Jobs, CronJobs, Deployments, etc. will also be validated.
    # See https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#admission-control-phases for details
    resourceRules:
    - apiGroups:   [""]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["pods"]
  matchConditions:
    - name: 'match-tolerations'
      # Validate only if compute class toleration exists
      # and the CC label tolerated is listed in the configmap.
      expression: > 
        object.spec.tolerations.exists(t, has(t.key) &&
        t.key == 'cloud.google.com/compute-class' &&
        params.data.computeClasses.split('\\n').exists(cc, cc == t.value))
  validations:
    # ConfigMap with permitted namespace list referenced via `params`.
    - expression: "params.data.namespaces.split('\\n').exists(ns, ns == object.metadata.namespace)"
      messageExpression: "'Compute class toleration not permitted on workloads in namespace ' + object.metadata.namespace"

---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: restrict-toleration-binding
spec:
  policyName: restrict-toleration
  validationActions: ["Deny"]
  paramRef:
    name: allowed-ccc-namespaces
    namespace: default
    parameterNotFoundAction: Deny
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: allowed-ccc-namespaces
  namespace: default
data:
  # Replace example namespaces in line-separated list below.
  namespaces: |
    foo
    bar
    baz
  # ComputeClass names to monitor with this validation policy.
  # The 'autopilot' and 'autopilot-spot' CCs are present on
  # all NAP Standard and Autopilot clusters.
  computeClasses: |
    MY_COMPUTE_CLASS
    autopilot
    autopilot-spot

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml --context=CLUSTER1_CONTEXT

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml --context=CLUSTER2_CONTEXT

Replace the following:

CLUSTER1_CONTEXT: the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.
CLUSTER2_CONTEXT: the context for the second cluster, for example gke_my-project_us-east4-a_gke-east.

Deploy resources to the target clusters

To make your AI/ML inference workloads available on each cluster, deploy the required resources, such as the model servers and InferenceObjective custom resources.

Deploy the model servers to both clusters:

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/manifests/vllm/gpu-deployment.yaml --context=CLUSTER1_CONTEXT

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/manifests/vllm/gpu-deployment.yaml --context=CLUSTER2_CONTEXT

Replace the following:

CLUSTER1_CONTEXT: the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.
CLUSTER2_CONTEXT: the context for the second cluster, for example gke_my-project_us-east4-a_gke-east.

Deploy the InferenceObjective resources to both clusters. Save the following sample manifest to a file named inference-objective.yaml:

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceObjective
metadata:
  name: food-review
spec:
  priority: 10
  poolRef:
    name: llama3-8b-instruct
    group: "inference.networking.k8s.io"

Apply the manifest to both clusters:
```
kubectl apply -f inference-objective.yaml --context=CLUSTER1_CONTEXT
kubectl apply -f inference-objective.yaml --context=CLUSTER2_CONTEXT
```
Replace the following:
- CLUSTER1_CONTEXT: the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.
- CLUSTER2_CONTEXT: the context for the second cluster, for example gke_my-project_us-east4-a_gke-east.

Deploy the InferencePool resources to both clusters by using Helm:

helm install vllm-llama3-8b-instruct \
  --kube-context CLUSTER1_CONTEXT \
  --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
  --set provider.name=gke \
  --version v1.1.0 \
  oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool

helm install vllm-llama3-8b-instruct \
  --kube-context CLUSTER2_CONTEXT \
  --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
  --set provider.name=gke \
  --set inferenceExtension.monitoring.gke.enabled=true \
  --version v1.1.0 \
  oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool

Replace the following:

CLUSTER1_CONTEXT: the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.
CLUSTER2_CONTEXT: the context for the second cluster, for example gke_my-project_us-east4-a_gke-east.

Mark the InferencePool resources as exported on both clusters. This annotation makes the InferencePool available for import by the config cluster, which is a required step for multi-cluster routing.
```
kubectl annotate inferencepool vllm-llama3-8b-instruct networking.gke.io/export="True" \
    --context=CLUSTER1_CONTEXT
```
```
kubectl annotate inferencepool vllm-llama3-8b-instruct networking.gke.io/export="True" \
    --context=CLUSTER2_CONTEXT
```
Replace the following:
- CLUSTER1_CONTEXT: the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.
- CLUSTER2_CONTEXT: the context for the second cluster, for example gke_my-project_us-east4-a_gke-east.

Deploy resources to the config cluster

To define how traffic is routed and load-balanced across the InferencePool resources in all registered clusters, deploy the Gateway, HTTPRoute, and HealthCheckPolicy resources. You deploy these resources only to the designated config cluster, which is gke-west in this document.

Create a file named mcig.yaml with the following content:

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: cross-region-gateway
  namespace: default
spec:
  gatewayClassName: gke-l7-cross-regional-internal-managed-mc
  addresses:
  - type: networking.gke.io/ephemeral-ipv4-address/europe-west3
    value: "europe-west3"
  - type: networking.gke.io/ephemeral-ipv4-address/us-east4
    value: "us-east4"
  listeners:
  - name: http
    protocol: HTTP
    port: 80
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: vllm-llama3-8b-instruct-default
spec:
  parentRefs:
  - name: cross-region-gateway
    kind: Gateway
  rules:
  - backendRefs:
    - group: networking.gke.io
      kind: GCPInferencePoolImport
      name: vllm-llama3-8b-instruct
---
apiVersion: networking.gke.io/v1
kind: HealthCheckPolicy
metadata:
  name: health-check-policy
  namespace: default
spec:
  targetRef:
    group: "networking.gke.io"
    kind: GCPInferencePoolImport
    name: vllm-llama3-8b-instruct
  default:
    config:
      type: HTTP
      httpHealthCheck:
        requestPath: /health
        port: 8000

Apply the manifest:
```
kubectl apply -f mcig.yaml --context=CLUSTER1_CONTEXT
```
Replace CLUSTER1_CONTEXT with the context for the first cluster (the config cluster), for example gke_my-project_europe-west3-c_gke-west.

Enable custom metrics reporting

To enable custom metrics reporting and improve cross-regional load balancing, you export KV Cache usage metrics from all clusters. The load balancer uses this exported KV Cache usage data as a custom load signal. Using this custom load signal allows for more intelligent load balancing decisions based on each cluster's actual workload.

Create a file named metrics.yaml with the following content:

apiVersion: autoscaling.gke.io/v1beta1
kind: AutoscalingMetric
metadata:
  name: gpu-cache
  namespace: default
spec:
  selector:
    matchLabels:
      app: vllm-llama3-8b-instruct
  endpoints:
  - port: 8000
    path: /metrics
    metrics:
    - name: vllm:kv_cache_usage_perc # For vLLM versions v0.10.2 and newer
      exportName: kv-cache
    - name: vllm:gpu_cache_usage_perc # For vLLM versions v0.6.2 and newer
      exportName: kv-cache-old

Apply the metrics configuration to both clusters:
```
kubectl apply -f metrics.yaml --context=CLUSTER1_CONTEXT
kubectl apply -f metrics.yaml --context=CLUSTER2_CONTEXT
```
Replace the following:
- CLUSTER1_CONTEXT: the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.
- CLUSTER2_CONTEXT: the context for the second cluster, for example gke_my-project_us-east4-a_gke-east.

Configure the load balancing policy

To optimize how your AI/ML inference requests are distributed across your GKE clusters, configure a load balancing policy. Choosing the right balancing mode helps ensure efficient resource utilization, prevents overloading individual clusters, and improves the overall performance and responsiveness of your inference services.

Configure timeouts

If your requests are expected to have long durations, configure a longer timeout for the load balancer. In the GCPBackendPolicy, set the timeoutSec field to at least twice your estimated P99 request latency.

For example, the following manifest sets the load balancer timeout to 100 seconds.

apiVersion: networking.gke.io/v1
kind: GCPBackendPolicy
metadata:
  name: my-backend-policy
spec:
  targetRef:
    group: "networking.gke.io"
    kind: GCPInferencePoolImport
    name: vllm-llama3-8b-instruct
  default:
    timeoutSec: 100
    balancingMode: CUSTOM_METRICS
    trafficDuration: LONG
    customMetrics:
    - name: gke.named_metrics.kv-cache
      dryRun: false

For more information, see multi-cluster Gateway limitations.

Because the Custom metrics and In-flight requests load balancing modes are mutually exclusive, configure only one of these modes in your GCPBackendPolicy.

Choose a load balancing mode for your deployment.

Custom metrics

For optimal load balancing, start with a target utilization of 60%. To achieve this target, set maxUtilization: 60 in your GCPBackendPolicy's customMetrics configuration.

Create a file named backend-policy.yaml with the following content to enable load balancing based on the kv-cache custom metric:

apiVersion: networking.gke.io/v1
kind: GCPBackendPolicy
metadata:
  name: my-backend-policy
spec:
  targetRef:
    group: "networking.gke.io"
    kind: GCPInferencePoolImport
    name: vllm-llama3-8b-instruct
  default:
    balancingMode: CUSTOM_METRICS
    trafficDuration: LONG
    customMetrics:
      - name: gke.named_metrics.kv-cache
        dryRun: false
        maxUtilization: 60

Apply the new policy:
```
kubectl apply -f backend-policy.yaml --context=CLUSTER1_CONTEXT
```
Replace CLUSTER1_CONTEXT with the context for the first cluster, for example gke_my-project-europe-west3-c-gke-west.

In-flight requests

To use the in-flight balancing mode, estimate the number of in-flight requests each backend can handle and explicitly configure a capacity value.

Create a file named backend-policy.yaml with the following content to enable load balancing based on the number of in-flight requests:

kind: GCPBackendPolicy
apiVersion: networking.gke.io/v1
metadata:
  name: my-backend-policy
spec:
  targetRef:
    group: "networking.gke.io"
    kind: GCPInferencePoolImport
    name: vllm-llama3-8b-instruct
  default:
    balancingMode: IN_FLIGHT
    trafficDuration: LONG
    maxInFlightRequestsPerEndpoint: 1000
    dryRun: false

Apply the new policy:
```
kubectl apply -f backend-policy.yaml --context=CLUSTER1_CONTEXT
```
Replace CLUSTER1_CONTEXT with the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.

Verify the deployment

To verify the internal load balancer, you must send requests from within your VPC network because, as internal load balancers use private IP addresses. Run a temporary Pod inside one of the clusters to send requests from your VPC network and verify the internal load balancer:

Start an interactive shell session in a temporary Pod:
```
kubectl run -it --rm --image=curlimages/curl curly --context=CLUSTER1_CONTEXT -- /bin/sh
```
Replace CLUSTER1_CONTEXT with the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.

From the new shell, get the Gateway IP address and send a test request:

GW_IP=$(kubectl get gateway/cross-region-gateway -n default -o jsonpath='{.status.addresses[0].value}')

curl -i -X POST ${GW_IP}:80/v1/completions -H 'Content-Type: application/json' -d '{
"model": "food-review-1",
"prompt": "What is the best pizza in the world?",
"max_tokens": 100,
"temperature": 0
}'

The following is an example of a successful response:

{
  "id": "cmpl-...",
  "object": "text_completion",
  "created": 1704067200,
  "model": "food-review-1",
  "choices": [
    {
      "text": "The best pizza in the world is subjective, but many argue for Neapolitan pizza...",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 100,
    "total_tokens": 110
  }
}

What's next

Learn more about the GKE Gateway API.
Learn more about GKE multi-cluster Inference Gateway.
Learn more about Multi Cluster Ingress.

Set up the GKE multi-cluster Inference Gateway Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Requirements

Set up multi-cluster Inference Gateway

Create clusters and node pools

Register clusters to a fleet

Create proxy-only subnets

Install the required CRDs

Deploy resources to the target clusters

Deploy resources to the config cluster

Enable custom metrics reporting

Configure the load balancing policy

Configure timeouts

Custom metrics

In-flight requests

Verify the deployment

What's next

Set up the GKE multi-cluster Inference Gateway