The Google Kubernetes Engine (GKE) multi-cluster Inference Gateway load balances your AI/ML inference workloads across multiple GKE clusters. It integrates GKE multi-cluster gateways for cross-cluster traffic routing with Inference Gateway for AI/ML model serving. This integration improves the scalability and high availability of your deployments. This document explains the gateway's core concepts and benefits.
For more information about how to deploy GKE multi-cluster Inference Gateway, see Set up the GKE multi-cluster Inference Gateway.
To understand this document, you must be familiar with the following:
- AI/ML orchestration on GKE.
- Generative AI terminology.
- GKE networking concepts including Services, GKE multi-cluster gateway, and the Gateway API.
- Load balancing in Google Cloud, especially how load balancers interact with GKE.
This document targets the following personas:
- Machine learning (ML) engineers, Platform admins and operators, and Data and AI specialists interested in using Kubernetes container orchestration capabilities for serving AI/ML workloads.
- Cloud architects or Networking specialists who interact with Kubernetes networking.
To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.
Benefits of the GKE multi-cluster Inference Gateway
The GKE multi-cluster Inference Gateway provides several benefits for managing your AI/ML inference workloads, including the following:
- Enhances high availability and fault tolerance through intelligent load balancing across multiple GKE clusters, even across different geographical regions. Your inference workloads remain available, and the system automatically reroutes requests if a cluster or region experiences issues, thereby minimizing downtime.
- Improves scalability and optimizes resource usage by pooling GPU and TPU resources from various clusters to handle increased demand. This pooling allows your workloads to burst beyond the capacity of a single cluster and efficiently use available resources across your fleet.
- Maximizes performance with globally optimized routing. The gateway uses advanced metrics, such as Key-Value (KV) Cache usage from all clusters, to make efficient routing decisions. This approach helps to ensure that requests go to the cluster best equipped to handle them, thereby maximizing overall performance for your AI/ML inference fleet.
Limitations
The GKE multi-cluster Inference Gateway has the following limitations:
Model Armor integration: the GKE multi-cluster Inference Gateway doesn't support Model Armor integration.
Envoy Proxy latency reporting: the Envoy Proxy only reports query latency for successful (
2xx) requests. It ignores errors and timeouts. This behavior can cause the Global Server Load Balancer (GSLB) to underestimate the true load on failing backends, potentially directing more traffic to already overloaded services. To mitigate this issue, configure a larger request timeout. For example, a value of600sis recommended.
Key components
The GKE multi-cluster Inference Gateway uses several Kubernetes custom resources to manage inference workloads and traffic routing:
InferencePool: groups identical model server backends in your target cluster. This resource simplifies the management and scaling of your model serving instances.InferenceObjective: defines routing priorities for specific models within anInferencePool. This routing helps to ensure that certain models receive traffic preference based on your requirements.GCPInferencePoolImport: makes your model backends available for routing configuration by usingHTTPRoutein the config cluster. This resource is automatically created in your config cluster when you export anInferencePoolfrom a target cluster. The config cluster acts as the central point of control for your multi-cluster environment.GCPBackendPolicy: customizes how traffic is load balanced to your backends. For example, you can enable load balancing based on custom metrics or set limits on in-flight requests per endpoint to protect your model servers.AutoscalingMetric: defines custom metrics, such asvllm:kv_cache_usage_perc, to export from your model servers. You can then use these metrics withinGCPBackendPolicyto make more intelligent load balancing decisions, and optimize performance and resource utilization.
How the GKE multi-cluster Inference Gateway works
The GKE multi-cluster Inference Gateway manages and routes traffic to your AI/ML models deployed across multiple GKE clusters. It works as follows:
- Centralized traffic management: a dedicated config cluster defines your traffic routing rules. The config cluster acts as the central point of control for your multi-cluster environment. You designate a GKE cluster as the config cluster when you enable multi-cluster Ingress for your fleet. This centralized approach lets you manage how requests are directed to your models across your entire fleet of GKE clusters from a single place.
- Flexible model deployment: your actual AI/ML models run in separate target clusters. This separation lets you deploy models where it makes the most sense (for example, closer to data or to clusters with specific hardware).
- Easy integration of models: when you deploy a model in a target cluster,
you group its serving instances using an
InferencePool. Exporting thisInferencePoolautomatically makes it available for routing in your config cluster. - Intelligent load balancing: the gateway doesn't only distribute traffic; it makes intelligent routing decisions. By configuring it to use various signals, including custom metrics from your model servers, the gateway helps to ensure that incoming requests are sent to the best-equipped cluster or model instance, which can maximize performance and resource utilization. For example, you can route requests to the cluster with the most available inference capacity based on metrics like Key-Value (KV) Cache usage.
What's next
- To deploy the gateway, see Set up the GKE multi-cluster Inference Gateway.
- To learn how to use the
scopesfield in theGCPBackendPolicyresource, see Customize backend configurations withGCPBackendPolicyscopes.