Set up the GKE multi-cluster Inference Gateway

This document describes how to set up the Google Kubernetes Engine (GKE) multi-cluster Inference Gateway to intelligently load-balance your AI/ML inference workloads across multiple GKE clusters, which can span different regions. This setup uses Gateway API, Multi Cluster Ingress, and custom resources like InferencePool and InferenceObjective to improve scalability, help ensure high availability, and optimize resource utilization for your model-serving deployments.

To understand this document, be familiar with the following:

This document is for the following personas:

  • Machine learning (ML) engineers, Platform admins and operators, or Data and AI specialists who want to use GKE's container orchestration capabilities for serving AI/ML workloads.
  • Cloud architects or Networking specialists who interact with GKE networking.

To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE Enterprise user roles and tasks.

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
  • Enable the Compute Engine API, Google Kubernetes Engine API, Model Armor, and the Network Services API.

    Go to Enable access to APIs and follow the instructions.

  • Enable the Autoscaling API.

    Go to Autoscaling API and follow the instructions.

  • Hugging Face prerequisites:

    • Create a Hugging Face account if you don't already have one.
    • Request and get approval for access to the Llama 3.1 model on Hugging Face.
    • Sign the license consent agreement on the model's page on Hugging Face.
    • Generate a Hugging Face access token with at least Read permissions.

Requirements

  • Ensure your project has sufficient quota for H100 GPUs. For more information, see Plan GPU quota and Allocation quotas.
  • Use GKE version 1.34.1-gke.1127000 or later.
  • Use gcloud CLI version 480.0.0 or later.
  • Your node service accounts must have permissions to write metrics to the Autoscaling API.
  • You must have the following IAM roles on the project: roles/container.admin and roles/iam.serviceAccountAdmin.

Set up multi-cluster Inference Gateway

To set up the GKE multi-cluster Inference Gateway, follow these steps:

Create clusters and node pools

To host your AI/ML inference workloads and enable cross-regional load balancing, create two GKE clusters in different regions, each with an H100 GPU node pool.

  1. Create the first cluster:

    gcloud container clusters create CLUSTER_1_NAME \
        --region LOCATION \
        --project=PROJECT_ID \
        --gateway-api=standard \
        --release-channel "rapid" \
        --cluster-version=GKE_VERSION \
        --machine-type="MACHINE_TYPE" \
        --disk-type="DISK_TYPE" \
        --enable-managed-prometheus --monitoring=SYSTEM,DCGM \
        --hpa-profile=performance \
        --async # Allows the command to return immediately
    

    Replace the following:

    • CLUSTER_1_NAME: the name of the first cluster, for example gke-west.
    • LOCATION: the region for the first cluster, for example europe-west3.
    • PROJECT_ID: your project ID.
    • GKE_VERSION: the GKE version to use, for example 1.34.1-gke.1127000.
    • MACHINE_TYPE: the machine type for the cluster nodes, for example c2-standard-16.
    • DISK_TYPE: the disk type for the cluster nodes, for example pd-standard.
  2. Create an H100 node pool for the first cluster:

    gcloud container node-pools create NODE_POOL_NAME \
        --accelerator "type=nvidia-h100-80gb,count=2,gpu-driver-version=latest" \
        --project=PROJECT_ID \
        --location=CLUSTER_1_ZONE \
        --node-locations=CLUSTER_1_ZONE \
        --cluster=CLUSTER_1_NAME \
        --machine-type=NODE_POOL_MACHINE_TYPE \
        --num-nodes=NUM_NODES \
        --spot \
        --async # Allows the command to return immediately
    

    Replace the following:

    • NODE_POOL_NAME: the name of the node pool, for example h100.
    • PROJECT_ID: your project ID.
    • CLUSTER_1_ZONE: the zone for the first cluster, for example europe-west3-c.
    • CLUSTER_1_NAME: the name of the first cluster, for example gke-west.
    • NODE_POOL_MACHINE_TYPE: the machine type for the node pool, for example a3-highgpu-2g.
    • NUM_NODES: the number of nodes in the node pool, for example 3.
  3. Get the credentials:

    gcloud container clusters get-credentials CLUSTER_1_NAME \
        --location CLUSTER_1_ZONE \
        --project=PROJECT_ID
    

    Replace the following:

    • PROJECT_ID: your project ID.
    • CLUSTER_1_NAME: the name of the first cluster, for example gke-west.
    • CLUSTER_1_ZONE: the zone for the first cluster, for example europe-west3-c.
  4. On the first cluster, create a secret for the Hugging Face token:

    kubectl create secret generic hf-token \
        --from-literal=token=HF_TOKEN
    

    Replace the HF_TOKEN: your Hugging Face access token.

  5. Create the second cluster in a different region from the first cluster:

    gcloud container clusters create gke-east --region LOCATION \
        --project=PROJECT_ID \
        --gateway-api=standard \
        --release-channel "rapid" \
        --cluster-version=GKE_VERSION \
        --machine-type="MACHINE_TYPE" \
        --disk-type="DISK_TYPE" \
        --enable-managed-prometheus \
        --monitoring=SYSTEM,DCGM \
        --hpa-profile=performance \
        --async # Allows the command to return immediately while the
    cluster is created in the background.
    

    Replace the following:

    • LOCATION: the region for the second cluster. This must be a different region than the first cluster. For example, us-east4.
    • PROJECT_ID: your project ID.
    • GKE_VERSION: the GKE version to use, for example 1.34.1-gke.1127000.
    • MACHINE_TYPE: the machine type for the cluster nodes, for example c2-standard-16.
    • DISK_TYPE: the disk type for the cluster nodes, for example pd-standard.
  6. Create an H100 node pool for the second cluster:

    gcloud container node-pools create h100 \
        --accelerator "type=nvidia-h100-80gb,count=2,gpu-driver-version=latest" \
        --project=PROJECT_ID \
        --location=CLUSTER_2_ZONE \
        --node-locations=CLUSTER_2_ZONE \
        --cluster=CLUSTER_2_NAME \
        --machine-type=NODE_POOL_MACHINE_TYPE \
        --num-nodes=NUM_NODES \
        --spot \
        --async # Allows the command to return immediately
    

    Replace the following:

    • PROJECT_ID: your project ID.
    • CLUSTER_2_ZONE: the zone for the second cluster, for example us-east4-a.
    • CLUSTER_2_NAME: the name of the second cluster, for example gke-east.
    • NODE_POOL_MACHINE_TYPE: the machine type for the node pool, for example a3-highgpu-2g.
    • NUM_NODES: the number of nodes in the node pool, for example 3.
  7. For the second cluster, get credentials and create a secret for the Hugging Face token:

    gcloud container clusters get-credentials CLUSTER_2_NAME \
        --location CLUSTER_2_ZONE \
        --project=PROJECT_ID
    
    kubectl create secret generic hf-token --from-literal=token=HF_TOKEN
    

    Replace the following:

    • CLUSTER_2_NAME: the name of the second cluster, for example gke-east.
    • CLUSTER_2_ZONE: the zone for the second cluster, for example us-east4-a.
    • PROJECT_ID: your project ID.
    • HF_TOKEN: your Hugging Face access token.

Register clusters to a fleet

To enable multi-cluster capabilities, such as the GKE Multi-cluster Inference Gateway, register your clusters to a fleet.

  1. Register both clusters to your project's fleet:

    gcloud container fleet memberships register CLUSTER_1_NAME \
        --gke-cluster CLUSTER_1_ZONE/CLUSTER_1_NAME \
        --location=global \
        --project=PROJECT_ID
    
    gcloud container fleet memberships register CLUSTER_2_NAME \
        --gke-cluster CLUSTER_2_ZONE/CLUSTER_2_NAME \
        --location=global \
        --project=PROJECT_ID
    

    Replace the following:

    • CLUSTER_1_NAME: the name of the first cluster, for example gke-west.
    • CLUSTER_1_ZONE: the zone for the first cluster, for example europe-west3-c.
    • PROJECT_ID: your project ID.
    • CLUSTER_2_NAME: the name of the second cluster, for example gke-east.
    • CLUSTER_2_ZONE: the zone for the second cluster, for example us-east4-a.
  2. To allow a single Gateway to manage traffic across multiple clusters, enable the multi-cluster Ingress feature and designate a config cluster:

    gcloud container fleet ingress enable \
        --config-membership=projects/PROJECT_ID/locations/global/memberships/CLUSTER_1_NAME
    

    Replace the following:

    • PROJECT_ID: your project ID.
    • CLUSTER_1_NAME: the name of the first cluster, for example gke-west.

Create proxy-only subnets

For an internal gateway, create a proxy-only subnet in each region. The internal Gateway's Envoy proxies use these dedicated subnets to handle traffic within your VPC network.

  1. Create a subnet in the first cluster's region:

    gcloud compute networks subnets create CLUSTER_1_REGION-subnet \
        --purpose=GLOBAL_MANAGED_PROXY \
        --role=ACTIVE \
        --region=CLUSTER_1_REGION \
        --network=default \
        --range=10.0.0.0/23 \
        --project=PROJECT_ID
    
  2. Create a subnet in the second cluster's region:

    gcloud compute networks subnets create CLUSTER_2_REGION-subnet \
        --purpose=GLOBAL_MANAGED_PROXY \
        --role=ACTIVE \
        --region=CLUSTER_2_REGION \
        --network=default \
        --range=10.5.0.0/23 \
        --project=PROJECT_ID
    

    Replace the following:

    • PROJECT_ID: your project ID.
    • CLUSTER_1_REGION: the region for the first cluster, for example europe-west3.
    • CLUSTER_2_REGION: the region for the second cluster, for example us-east4.

Install the required CRDs

The GKE Multi-cluster Inference Gateway uses custom resources such as InferencePool and InferenceObjective. The GKE Gateway API controller manages the InferencePool Custom Resource Definition (CRD). However, you must manually install the InferenceObjective CRD, which is in alpha, on your clusters.

  1. Define context variables for your clusters:

    CLUSTER1_CONTEXT="gke_PROJECT_ID_CLUSTER_1_ZONE_CLUSTER_1_NAME"
    CLUSTER2_CONTEXT="gke_PROJECT_ID_CLUSTER_2_ZONE_CLUSTER_2_NAME"
    

    Replace the following:

    • PROJECT_ID: your project ID.
    • CLUSTER_1_ZONE: the zone for the first cluster, for example europe-west3-c.
    • CLUSTER_1_NAME: the name of the first cluster, for example gke-west.
    • CLUSTER_2_ZONE: the zone for the second cluster, for example us-east4-a.
    • CLUSTER_2_NAME: the name of the second cluster, for example gke-east.
  2. Install the InferenceObjective CRD on both clusters:

    # Copyright 2025 Google LLC
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #      http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    apiVersion: admissionregistration.k8s.io/v1
    kind: ValidatingAdmissionPolicy
    metadata:
      name: restrict-toleration
    spec:
      failurePolicy: Fail
      paramKind:
        apiVersion: v1
        kind: ConfigMap
      matchConstraints:
        # GKE will mutate any pod specifying a CC label in a nodeSelector
        # or in a nodeAffinity with a toleration for the CC node label.
        # Mutation hooks will always mutate the K8s object before validating
        # the admission request.  
        # Pods created by Jobs, CronJobs, Deployments, etc. will also be validated.
        # See https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#admission-control-phases for details
        resourceRules:
        - apiGroups:   [""]
          apiVersions: ["v1"]
          operations:  ["CREATE", "UPDATE"]
          resources:   ["pods"]
      matchConditions:
        - name: 'match-tolerations'
          # Validate only if compute class toleration exists
          # and the CC label tolerated is listed in the configmap.
          expression: > 
            object.spec.tolerations.exists(t, has(t.key) &&
            t.key == 'cloud.google.com/compute-class' &&
            params.data.computeClasses.split('\\n').exists(cc, cc == t.value))
      validations:
        # ConfigMap with permitted namespace list referenced via `params`.
        - expression: "params.data.namespaces.split('\\n').exists(ns, ns == object.metadata.namespace)"
          messageExpression: "'Compute class toleration not permitted on workloads in namespace ' + object.metadata.namespace"
    
    ---
    apiVersion: admissionregistration.k8s.io/v1
    kind: ValidatingAdmissionPolicyBinding
    metadata:
      name: restrict-toleration-binding
    spec:
      policyName: restrict-toleration
      validationActions: ["Deny"]
      paramRef:
        name: allowed-ccc-namespaces
        namespace: default
        parameterNotFoundAction: Deny
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: allowed-ccc-namespaces
      namespace: default
    data:
      # Replace example namespaces in line-separated list below.
      namespaces: |
        foo
        bar
        baz
      # ComputeClass names to monitor with this validation policy.
      # The 'autopilot' and 'autopilot-spot' CCs are present on
      # all NAP Standard and Autopilot clusters.
      computeClasses: |
        MY_COMPUTE_CLASS
        autopilot
        autopilot-spot
    
    
    kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml --context=CLUSTER1_CONTEXT
    
    kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml --context=CLUSTER2_CONTEXT
    

    Replace the following:

    • CLUSTER1_CONTEXT: the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.
    • CLUSTER2_CONTEXT: the context for the second cluster, for example gke_my-project_us-east4-a_gke-east.

Deploy resources to the target clusters

To make your AI/ML inference workloads available on each cluster, deploy the required resources, such as the model servers and InferenceObjective custom resources.

  1. Deploy the model servers to both clusters:

    kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/manifests/vllm/gpu-deployment.yaml --context=CLUSTER1_CONTEXT
    
    kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/v1.1.0/config/manifests/vllm/gpu-deployment.yaml --context=CLUSTER2_CONTEXT
    

    Replace the following:

    • CLUSTER1_CONTEXT: the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.
    • CLUSTER2_CONTEXT: the context for the second cluster, for example gke_my-project_us-east4-a_gke-east.
  2. Deploy the InferenceObjective resources to both clusters. Save the following sample manifest to a file named inference-objective.yaml:

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceObjective
    metadata:
      name: food-review
    spec:
      priority: 10
      poolRef:
        name: llama3-8b-instruct
        group: "inference.networking.k8s.io"
    
  3. Apply the manifest to both clusters:

    kubectl apply -f inference-objective.yaml --context=CLUSTER1_CONTEXT
    kubectl apply -f inference-objective.yaml --context=CLUSTER2_CONTEXT
    

    Replace the following:

    • CLUSTER1_CONTEXT: the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.
    • CLUSTER2_CONTEXT: the context for the second cluster, for example gke_my-project_us-east4-a_gke-east.
  4. Deploy the InferencePool resources to both clusters by using Helm:

    helm install vllm-llama3-8b-instruct \
      --kube-context CLUSTER1_CONTEXT \
      --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
      --set provider.name=gke \
      --version v1.1.0 \
      oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool
    
    helm install vllm-llama3-8b-instruct \
      --kube-context CLUSTER2_CONTEXT \
      --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
      --set provider.name=gke \
      --set inferenceExtension.monitoring.gke.enabled=true \
      --version v1.1.0 \
      oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool
    

    Replace the following:

    • CLUSTER1_CONTEXT: the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.
    • CLUSTER2_CONTEXT: the context for the second cluster, for example gke_my-project_us-east4-a_gke-east.
  5. Mark the InferencePool resources as exported on both clusters. This annotation makes the InferencePool available for import by the config cluster, which is a required step for multi-cluster routing.

    kubectl annotate inferencepool vllm-llama3-8b-instruct networking.gke.io/export="True" \
        --context=CLUSTER1_CONTEXT
    
    kubectl annotate inferencepool vllm-llama3-8b-instruct networking.gke.io/export="True" \
        --context=CLUSTER2_CONTEXT
    

    Replace the following:

    • CLUSTER1_CONTEXT: the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.
    • CLUSTER2_CONTEXT: the context for the second cluster, for example gke_my-project_us-east4-a_gke-east.

Deploy resources to the config cluster

To define how traffic is routed and load-balanced across the InferencePool resources in all registered clusters, deploy the Gateway, HTTPRoute, and HealthCheckPolicy resources. You deploy these resources only to the designated config cluster, which is gke-west in this document.

  1. Create a file named mcig.yaml with the following content:

    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: cross-region-gateway
      namespace: default
    spec:
      gatewayClassName: gke-l7-cross-regional-internal-managed-mc
      addresses:
      - type: networking.gke.io/ephemeral-ipv4-address/europe-west3
        value: "europe-west3"
      - type: networking.gke.io/ephemeral-ipv4-address/us-east4
        value: "us-east4"
      listeners:
      - name: http
        protocol: HTTP
        port: 80
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: vllm-llama3-8b-instruct-default
    spec:
      parentRefs:
      - name: cross-region-gateway
        kind: Gateway
      rules:
      - backendRefs:
        - group: networking.gke.io
          kind: GCPInferencePoolImport
          name: vllm-llama3-8b-instruct
    ---
    apiVersion: networking.gke.io/v1
    kind: HealthCheckPolicy
    metadata:
      name: health-check-policy
      namespace: default
    spec:
      targetRef:
        group: "networking.gke.io"
        kind: GCPInferencePoolImport
        name: vllm-llama3-8b-instruct
      default:
        config:
          type: HTTP
          httpHealthCheck:
            requestPath: /health
            port: 8000
    
  2. Apply the manifest:

    kubectl apply -f mcig.yaml --context=CLUSTER1_CONTEXT
    

    Replace CLUSTER1_CONTEXT with the context for the first cluster (the config cluster), for example gke_my-project_europe-west3-c_gke-west.

Enable custom metrics reporting

To enable custom metrics reporting and improve cross-regional load balancing, you export KV Cache usage metrics from all clusters. The load balancer uses this exported KV Cache usage data as a custom load signal. Using this custom load signal allows for more intelligent load balancing decisions based on each cluster's actual workload.

  1. Create a file named metrics.yaml with the following content:

    apiVersion: autoscaling.gke.io/v1beta1
    kind: AutoscalingMetric
    metadata:
      name: gpu-cache
      namespace: default
    spec:
      selector:
        matchLabels:
          app: vllm-llama3-8b-instruct
      endpoints:
      - port: 8000
        path: /metrics
        metrics:
        - name: vllm:kv_cache_usage_perc # For vLLM versions v0.10.2 and newer
          exportName: kv-cache
        - name: vllm:gpu_cache_usage_perc # For vLLM versions v0.6.2 and newer
          exportName: kv-cache-old
    
  2. Apply the metrics configuration to both clusters:

    kubectl apply -f metrics.yaml --context=CLUSTER1_CONTEXT
    kubectl apply -f metrics.yaml --context=CLUSTER2_CONTEXT
    

    Replace the following:

    • CLUSTER1_CONTEXT: the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.
    • CLUSTER2_CONTEXT: the context for the second cluster, for example gke_my-project_us-east4-a_gke-east.

Configure the load balancing policy

To optimize how your AI/ML inference requests are distributed across your GKE clusters, configure a load balancing policy. Choosing the right balancing mode helps ensure efficient resource utilization, prevents overloading individual clusters, and improves the overall performance and responsiveness of your inference services.

Configure timeouts

If your requests are expected to have long durations, configure a longer timeout for the load balancer. In the GCPBackendPolicy, set the timeoutSec field to at least twice your estimated P99 request latency.

For example, the following manifest sets the load balancer timeout to 100 seconds.

apiVersion: networking.gke.io/v1
kind: GCPBackendPolicy
metadata:
  name: my-backend-policy
spec:
  targetRef:
    group: "networking.gke.io"
    kind: GCPInferencePoolImport
    name: vllm-llama3-8b-instruct
  default:
    timeoutSec: 100
    balancingMode: CUSTOM_METRICS
    trafficDuration: LONG
    customMetrics:
    - name: gke.named_metrics.kv-cache
      dryRun: false

For more information, see multi-cluster Gateway limitations.

Because the Custom metrics and In-flight requests load balancing modes are mutually exclusive, configure only one of these modes in your GCPBackendPolicy.

Choose a load balancing mode for your deployment.

Custom metrics

For optimal load balancing, start with a target utilization of 60%. To achieve this target, set maxUtilization: 60 in your GCPBackendPolicy's customMetrics configuration.

  1. Create a file named backend-policy.yaml with the following content to enable load balancing based on the kv-cache custom metric:

    apiVersion: networking.gke.io/v1
    kind: GCPBackendPolicy
    metadata:
      name: my-backend-policy
    spec:
      targetRef:
        group: "networking.gke.io"
        kind: GCPInferencePoolImport
        name: vllm-llama3-8b-instruct
      default:
        balancingMode: CUSTOM_METRICS
        trafficDuration: LONG
        customMetrics:
          - name: gke.named_metrics.kv-cache
            dryRun: false
            maxUtilization: 60
    
  2. Apply the new policy:

    kubectl apply -f backend-policy.yaml --context=CLUSTER1_CONTEXT
    

    Replace CLUSTER1_CONTEXT with the context for the first cluster, for example gke_my-project-europe-west3-c-gke-west.

In-flight requests

To use the in-flight balancing mode, estimate the number of in-flight requests each backend can handle and explicitly configure a capacity value.

  1. Create a file named backend-policy.yaml with the following content to enable load balancing based on the number of in-flight requests:

    kind: GCPBackendPolicy
    apiVersion: networking.gke.io/v1
    metadata:
      name: my-backend-policy
    spec:
      targetRef:
        group: "networking.gke.io"
        kind: GCPInferencePoolImport
        name: vllm-llama3-8b-instruct
      default:
        balancingMode: IN_FLIGHT
        trafficDuration: LONG
        maxInFlightRequestsPerEndpoint: 1000
        dryRun: false
    
  2. Apply the new policy:

    kubectl apply -f backend-policy.yaml --context=CLUSTER1_CONTEXT
    

    Replace CLUSTER1_CONTEXT with the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.

Verify the deployment

To verify the internal load balancer, you must send requests from within your VPC network because, as internal load balancers use private IP addresses. Run a temporary Pod inside one of the clusters to send requests from your VPC network and verify the internal load balancer:

  1. Start an interactive shell session in a temporary Pod:

    kubectl run -it --rm --image=curlimages/curl curly --context=CLUSTER1_CONTEXT -- /bin/sh
    

    Replace CLUSTER1_CONTEXT with the context for the first cluster, for example gke_my-project_europe-west3-c_gke-west.

  2. From the new shell, get the Gateway IP address and send a test request:

    GW_IP=$(kubectl get gateway/cross-region-gateway -n default -o jsonpath='{.status.addresses[0].value}')
    
    curl -i -X POST ${GW_IP}:80/v1/completions -H 'Content-Type: application/json' -d '{
    "model": "food-review-1",
    "prompt": "What is the best pizza in the world?",
    "max_tokens": 100,
    "temperature": 0
    }'
    

    The following is an example of a successful response:

    {
      "id": "cmpl-...",
      "object": "text_completion",
      "created": 1704067200,
      "model": "food-review-1",
      "choices": [
        {
          "text": "The best pizza in the world is subjective, but many argue for Neapolitan pizza...",
          "index": 0,
          "logprobs": null,
          "finish_reason": "length"
        }
      ],
      "usage": {
        "prompt_tokens": 10,
        "completion_tokens": 100,
        "total_tokens": 110
      }
    }
    

What's next