Serve an LLM with multi-cluster Ray Serve and GKE Inference Gateway

This document explains how to manage inference requests across multiple Ray Serve clusters on Google Kubernetes Engine (GKE) by configuring the Kubernetes Gateway API and GKE Inference Gateway. This configuration lets you centralize traffic management for multiple teams, distribute workloads across regions for higher capacity, and implement model-aware routing based on request body content.

Benefits of using GKE Inference Gateway and Ray Serve

Using GKE Inference Gateway and Ray Serve offers the following benefits:

  • Path routing: configure each RayService with a path prefix, then serve them with one Gateway routing to multiple Ray Services.
  • Model-aware routing: choose a RayService to route to based on the request body—for example, by extracting the requested model from an OpenAI-API JSON request.
  • Governance: require API keys to use your service, or enforce quota for users by using Apigee for authentication and API management.
  • Multi-region: split traffic across multiple GKE clusters with RayServices to attain higher availability or capacity with multi-cluster Gateways.
  • Separation of concerns: use separate RayServices, which can be administered by separate teams, follow separate rollouts, and run on different topologies.
  • Security: use Gateway to act as the SSL terminator to help secure your user traffic over the internet. For more information, see Gateway security.

To configure routing, you need to deploy a Gateway, HTTPRoute, and RayService. A Kubernetes Service for each target Ray cluster is typically created by KubeRay. Ray Serve spreads request load in-cluster, with no need to create an InferencePool or Endpoint Picker.

Model-aware routing for Ray Serve on GKE

Model-aware routing is enabled by a body-based routing extension. Body-based routing lets you direct traffic to different RayServices based solely on the model named in the user's request, which lets you have a single endpoint that can serve many models hosted in multiple Ray clusters. Your users have simplified access, and your app developers have control over configuring each Ray endpoint.

To configure model-aware routing, you deploy the following key components:

  • A body-based router extension to extract model names from JSON payloads. This router extension is deployed by using Helm.
  • A GKE Gateway (L7 regional internal Application Load Balancer) to handle the incoming traffic.
  • HTTPRoute rules to direct traffic to the correct Ray Service by using headers populated by the router extension.
  • Multiple Ray Serve clusters to manage the lifecycle and autoscaling of siloed models.

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.

Prepare your environment

Set up environment variables:

export CLUSTER=$(whoami)-ray-bbr
export PROJECT_ID=$(gcloud config get-value project)
export LOCATION=us-central1-b
export REGION=us-central1
export HUGGING_FACE_TOKEN=YOUR_HUGGING_FACE_TOKEN

Replace YOUR_HUGGING_FACE_TOKEN with your Hugging Face access token.

Prepare your infrastructure

In this section, you set up a Ray-enabled, Gateway-enabled GKE cluster with L4 GPUs.

  1. Create a cluster with the Ray Operator and Gateway API enabled:

    gcloud container clusters create ${CLUSTER} \
        --project ${PROJECT_ID} \
        --location ${LOCATION} \
        --cluster-version 1.35 \
        --gateway-api standard \
        --addons HttpLoadBalancing,RayOperator \
        --enable-ray-cluster-logging \
        --enable-ray-cluster-monitoring \
        --machine-type e2-standard-4
    
  2. Create a GPU node pool for your model workloads:

    gcloud container node-pools create gpu-pool \
        --cluster=${CLUSTER} \
        --location=${LOCATION} \
        --accelerator="type=nvidia-l4,count=1,gpu-driver-version=latest" \
        --machine-type=g2-standard-8 \
        --num-nodes=4
    
  3. Create a proxy-only subnet for the regional internal Application Load Balancer, which is required by body-based routing:

    gcloud compute networks subnets create bbr-proxy-only-subnet \
        --purpose=REGIONAL_MANAGED_PROXY \
        --role=ACTIVE \
        --region=${REGION} \
        --network=default \
        --range=192.168.10.0/24
    
  4. Deploy your Hugging Face secret:

    kubectl create secret generic hf-secret \
        --from-literal=hf_api_token=${HUGGING_FACE_TOKEN}
    

Deploy the body-based router for model-aware routing

The body-based router extension intercepts requests, parses the JSON body, and extracts the model field into an X-Gateway-Model-Name header.

  1. Create a file named helm-values.yaml with the following content:

    bbr:
      plugins:
        - type: "body-field-to-header"
          name: "openai-model-extractor"
          json:
            field_name: "model"
            header_name: "X-Gateway-Model-Name"
    
  2. Install the body-based router by using Helm:

    helm install body-based-router \
        oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing \
        --version v1.4.0 \
        --set provider.name=gke \
        --set inferenceGateway.name=ray-multi-model-gateway \
        --values helm-values.yaml
    

Deploy RayServices

To deploy your models, you must apply the RayService manifests. Each manifest defines a Ray cluster that runs a specific LLM.

  1. Create a file named gemma-2b-it.yaml with the following content:

    apiVersion: ray.io/v1
    kind: RayService
    metadata:
      name: gemma-2b-it
    spec:
      serveConfigV2: |
        applications:
        - name: llm_app
          route_prefix: "/"
          import_path: ray.serve.llm:build_openai_app
          args:
            llm_configs:
                - model_loading_config:
                    model_id: gemma-2b-it
                    model_source: google/gemma-2b-it
                  accelerator_type: L4
                  log_engine_metrics: true
                  deployment_config:
                    autoscaling_config:
                        min_replicas: 2
                        max_replicas: 2
                    health_check_period_s: 600
                    health_check_timeout_s: 300
      rayClusterConfig:
        headGroupSpec:
          rayStartParams:
            dashboard-host: "0.0.0.0"
            num-cpus: "0"
          template:
            spec:
              containers:
                - name: ray-head
                  image: rayproject/ray-llm:2.54.0-py311-cu128
                  resources:
                    limits:
                      memory: "8Gi"
                      ephemeral-storage: "32Gi"
                    requests:
                      cpu: "2"
                      memory: "8Gi"
                      ephemeral-storage: "32Gi"
                  ports:
                    - containerPort: 6379
                      name: gcs-server
                    - containerPort: 8265
                      name: dashboard
                    - containerPort: 10001
                      name: client
                    - containerPort: 8000
                      name: serve
                  env:
                    - name: RAY_SERVE_THROUGHPUT_OPTIMIZED
                      value: "1"
                    - name: RAY_SERVE_ENABLE_HA_PROXY
                      value: "1"
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
        rayVersion: 2.54.0
        workerGroupSpecs:
          - replicas: 2
            minReplicas: 2
            maxReplicas: 2
            groupName: gpu-group
            rayStartParams: {}
            template:
              spec:
                containers:
                  - name: llm
                    image: rayproject/ray-llm:2.54.0-py311-cu128
                    env:
                      - name: RAY_SERVE_THROUGHPUT_OPTIMIZED
                        value: "1"
                      - name: RAY_SERVE_ENABLE_HA_PROXY
                        value: "1"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                    resources:
                      limits:
                        nvidia.com/gpu: "1"
                        ephemeral-storage: "24Gi"
                      requests:
                        cpu: "6"
                        memory: "24Gi"
                        nvidia.com/gpu: "1"
                        ephemeral-storage: "24Gi"
                nodeSelector:
                  cloud.google.com/gke-accelerator: nvidia-l4
    
  2. Create a file named qwen2.5-3b.yaml with the following content:

    apiVersion: ray.io/v1
    kind: RayService
    metadata:
      name: qwen-25-3b
    spec:
      serveConfigV2: |
        applications:
        - name: llm_app
          route_prefix: "/"
          import_path: ray.serve.llm:build_openai_app
          args:
            llm_configs:
                - model_loading_config:
                    model_id: qwen-2.5-3b
                    model_source: Qwen/Qwen2.5-3B
                  accelerator_type: L4
                  log_engine_metrics: true
                  deployment_config:
                    autoscaling_config:
                        min_replicas: 2
                        max_replicas: 2
                    health_check_period_s: 600
                    health_check_timeout_s: 300
      rayClusterConfig:
        headGroupSpec:
          rayStartParams:
            dashboard-host: "0.0.0.0"
            num-cpus: "0"
          template:
            spec:
              containers:
                - name: ray-head
                  image: rayproject/ray-llm:2.54.0-py311-cu128
                  resources:
                    limits:
                      memory: "8Gi"
                      ephemeral-storage: "32Gi"
                    requests:
                      cpu: "2"
                      memory: "8Gi"
                      ephemeral-storage: "32Gi"
                  ports:
                    - containerPort: 6379
                      name: gcs-server
                    - containerPort: 8265
                      name: dashboard
                    - containerPort: 10001
                      name: client
                    - containerPort: 8000
                      name: serve
                  env:
                    - name: RAY_SERVE_THROUGHPUT_OPTIMIZED
                      value: "1"
                    - name: RAY_SERVE_ENABLE_HA_PROXY
                      value: "1"
                    - name: HUGGING_FACE_HUB_TOKEN
                      valueFrom:
                        secretKeyRef:
                          name: hf-secret
                          key: hf_api_token
        rayVersion: 2.54.0
        workerGroupSpecs:
          - replicas: 2
            minReplicas: 2
            maxReplicas: 2
            groupName: gpu-group
            rayStartParams: {}
            template:
              spec:
                containers:
                  - name: llm
                    image: rayproject/ray-llm:2.54.0-py311-cu128
                    env:
                      - name: RAY_SERVE_THROUGHPUT_OPTIMIZED
                        value: "1"
                      - name: RAY_SERVE_ENABLE_HA_PROXY
                        value: "1"
                      - name: HUGGING_FACE_HUB_TOKEN
                        valueFrom:
                          secretKeyRef:
                            name: hf-secret
                            key: hf_api_token
                    resources:
                      limits:
                        nvidia.com/gpu: "1"
                        ephemeral-storage: "24Gi"
                      requests:
                        cpu: "6"
                        memory: "24Gi"
                        nvidia.com/gpu: "1"
                        ephemeral-storage: "24Gi"
                nodeSelector:
                  cloud.google.com/gke-accelerator: nvidia-l4
    
  3. Deploy the models:

    kubectl apply -f gemma-2b-it.yaml
    kubectl apply -f qwen2.5-3b.yaml
    

Configure health checks

To help ensure the load balancer accurately monitors Ray worker health, you must apply the HealthCheckPolicy resource.

  1. Create a file named healthcheck-policy.yaml with the following content:

    apiVersion: networking.gke.io/v1
    kind: HealthCheckPolicy
    metadata:
      name: gemma-serve-healthcheck
      namespace: default
    spec:
      default:
        checkIntervalSec: 5
        timeoutSec: 5
        healthyThreshold: 2
        unhealthyThreshold: 2
        config:
          type: HTTP
          httpHealthCheck:
            port: 8000
            requestPath: /-/healthz
      targetRef:
        group: ""
        kind: Service
        name: gemma-2b-it-serve-svc
    ---
    apiVersion: networking.gke.io/v1
    kind: HealthCheckPolicy
    metadata:
      name: qwen-serve-healthcheck
      namespace: default
    spec:
      default:
        checkIntervalSec: 5
        timeoutSec: 5
        healthyThreshold: 2
        unhealthyThreshold: 2
        config:
          type: HTTP
          httpHealthCheck:
            port: 8000
            requestPath: /-/healthz
      targetRef:
        group: ""
        kind: Service
        name: qwen-25-3b-serve-svc
    
  2. Apply the health check policy:

    kubectl apply -f healthcheck-policy.yaml
    

Configure routing

To configure routing, you must apply the Gateway and HTTPRoute manifests. The HTTPRoute contains rules that match the X-Gateway-Model-Name header (populated by the body-based router) to route traffic to the appropriate Ray service.

  1. Create a file named gateway.yaml with the following content:

    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: ray-multi-model-gateway
      namespace: default
    spec:
      gatewayClassName: gke-l7-rilb
      listeners:
      - allowedRoutes:
          namespaces:
            from: Same
        name: http
        port: 80
        protocol: HTTP
    ---
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: ray-multi-model-route
    spec:
      parentRefs:
      - name: ray-multi-model-gateway
      rules:
      - matches:
        - headers:
          - type: Exact
            name: X-Gateway-Model-Name
            value: gemma-2b-it  # Must match model named in JSON request!
          path:
            type: PathPrefix
            value: /
        backendRefs:
        - name: gemma-2b-it-serve-svc  # Ray service name plus "-serve-svc".
          kind: Service
          port: 8000
    
      - matches:
        - headers:
          - type: Exact
            name: X-Gateway-Model-Name
            value: qwen-2.5-3b  # Matches another extracted model name
          path:
            type: PathPrefix
            value: /
        backendRefs:
        - name: qwen-25-3b-serve-svc  # Target Ray Service.
          kind: Service
          port: 8000
    
  2. Apply the gateway and route:

    kubectl apply -f gateway.yaml
    

Test the deployment

After the Gateway is provisioned and both Ray clusters are ready, you can test routing by sending requests with different model names in the JSON body.

  1. Get the Gateway IP address:

    kubectl get gateways ray-multi-model-gateway
    
  2. Start a shell in a network that can reach the Gateway address. You can use curl on one of the Ray cluster Pods:

    POD_NAME=$(kubectl get pods -l ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')
    kubectl exec -it $POD_NAME -- bash
    
  3. Send requests by testing routing to Gemma:

    curl http://GATEWAY_IP_ADDRESS/v1/chat/completions \
        --header 'Content-Type: application/json' \
        --data '{
        "model": "gemma-2b-it",
        "messages": [{"role": "user", "content": "Tell me about GKE."}]
        }'
    

    Replace GATEWAY_IP_ADDRESS with the IP address from the previous step.

    The output is similar to the following:

    {"id":"chatcmpl-594f7cab-f991-4522-9829-acdbb65d9f67","object":"chat.completion","created":1776379509,"model":"gemma-2b-it","choices":[{"index":0,"message":{"role":"assistant","content":"**Google Kubernetes Engine (GKE)** is a fully managed container orchestration service for Kubernetes [...]
    
  4. Test routing to Qwen:

    curl http://GATEWAY_IP_ADDRESS/v1/chat/completions \
        --header 'Content-Type: application/json' \
        --data '{
        "model": "qwen-2.5-3b",
        "messages": [{"role": "user", "content": "How does Ray Serve work?"}]
        }'
    

    The output is similar to the following:

    {"id":"chatcmpl-dfe3f3b7-45fc-481c-b53e-2fc09c033cdb","object":"chat.completion","created":1776380249,"model":"qwen-2.5-3b","choices":[{"index":0,"message":{"role":"assistant","content":"Ray Serve facilitates the hosting and deployment of scalable microservices. [...]
    

The body-based router automatically extracts the value of the model field and ensures each request reaches the correct backend service configured in the gateway.yaml file.

Clean up

Delete the cluster:

gcloud container clusters delete ${CLUSTER}

What's next