Customize backend configurations with GCPBackendPolicy scopes

Autopilot Standard

In a multi-cluster Google Kubernetes Engine (GKE) Inference Gateway environment, you can apply different backend configurations to services deployed across multiple clusters. For example, you can set different maximum request rates or capacity scalers for backends in different regions or environments.

To understand this document, you should be familiar with the following:

AI/ML orchestration on GKE.
Generative AI terminology.
GKE networking concepts including Services, GKE Multi Cluster Ingress, and the GKE Gateway API.
Load balancing in Google Cloud, especially how load balancers interact with GKE.

This document targets the following personas:

Machine learning (ML) engineers, Platform admins and operators, and Data and AI specialists interested in using Kubernetes container orchestration capabilities for serving AI/ML workloads.
Cloud architects or Networking specialists who interact with Kubernetes networking.

To learn more about common roles and example tasks that we reference in Google Cloud content, see Common GKE Enterprise user roles and tasks.

How `GCPBackendPolicy` scopes work

The scopes field in GCPBackendPolicy lets you tailor backend configurations based on the specific clusters where your backends are running. You can apply different settings to backends in different environments or regions, which gives you fine-grained control over your distributed AI/ML workloads. The following sections explain how to target resources, define policy scopes, and handle conflict resolution.

Target Inference Gateway resources

To use Inference Gateway policies in a multi-cluster GKE environment, the GCPBackendPolicy's targetRef field must reference a GCPInferencePoolImport resource:

targetRef:
  group: networking.gke.io
  kind: GCPInferencePoolImport
  name: example

Policy scope definition

The scopes field in GCPBackendPolicy lets you apply different backend settings to specific groups of backends. By defining configuration objects within default.scopes, you can use cluster labels to precisely target backends and apply specific settings. For example, you can set unique capacity limits or request rates for backends in different regions or clusters.

You can't specify the same backend-level fields (like maxRatePerEndpoint) in both the main default section and within the default.scopes entries. Attempting to do so causes the policy to be rejected, which helps to ensure clear and consistent configurations.

Conflict resolution

When a backend matches multiple scopes, the system applies the following rules to ensure predictable behavior:

Prioritized matching: if a backend matches multiple selectors in your scopes list, the system applies only the settings from the first matching selector. Order your scopes from the most specific to the most general to help ensure your intended configuration takes effect.
Precise targeting: when a single selector contains multiple labels (for example, gke.io/region: us-central1 and env: prod), the backend must satisfy all those labels for the system to apply the scope's configuration. This approach lets you precisely target backends based on multiple criteria.

Supported per-backend fields

The following table lists the backend-level fields that you can customize to control backend behavior in different environments or regions.

Field Name	Description	Example Configuration
`backendPreference`	Specifies whether the backend is preferred (`PREFERRED`) or default (`DEFAULT`) during capacity chasing for multi-region load balancing.	`backendPreference: PREFERRED`
`balancingMode`	Specifies the balancing algorithm. Supported values are `RATE`, `UTILIZATION`, or `CUSTOM_METRICS`.	`balancingMode: CUSTOM_METRICS`
`capacityScalerPercent`	Configures traffic distribution based on capacity. This value is a percentage from 0 to 100, acting as a multiplier on the backend's configured target capacity. The default value is 100%.	`capacityScalerPercent: 20`
`customMetrics`	Specifies custom metrics used for load balancing when `balancingMode` is set to `CUSTOM_METRICS`. This field is a list of metric definitions.	`customMetrics: [{ name: "my-metric", value: 0.8 }]`
`maxInFlightPerEndpoint`	Sets the maximum number of concurrent requests or connections per endpoint.	`maxInFlightPerEndpoint: 100`
`maxRatePerEndpoint`	Sets the maximum rate of requests per endpoint, in requests per second (RPS).	`maxRatePerEndpoint: 50`

Specify scope selectors

The selectors field in each scope lets you control which backends receive specific policy settings. You can target backends based on their cluster labels—either built-in GKE labels or your own custom labels—to tailor configurations for different groups of backends.

kind: GCPBackendPolicy
apiVersion: networking.gke.io/v1
metadata:
  name: echoserver-v2
spec:
  targetRef:
    group: "networking.gke.io"
    kind: GCPInferencePoolImport
    name: test-inference-pool
  default:
    balancingMode: IN_FLIGHT # IN_FLIGHT mode is set at the default level
    scopes:
    - selector:
        gke.io/zone: "us-central1-a"
      maxInFlightPerEndpoint: 100 # Invalid: maxInFlightPerEndpoint cannot be set within a scope when balancingMode is IN_FLIGHT at the default level

Implicit GKE labels

The following implicit labels are available to use as selectors. GKE automatically applies these labels to your clusters:

Label	Description	Example Value
`gke.io/cluster-name`	The name of the GKE cluster.	`my-cluster`
`gke.io/region`	The region where the cluster is located.	`us-central1`
`gke.io/zone`	The zone where the cluster is located.	`us-central1-a`

Custom cluster labels

Custom cluster labels provide more flexibility for grouping and managing your backends. By defining your own labels on your GKE clusters, you can create highly specific selectors in your GCPBackendPolicy to apply unique configurations. For example, you can base these configurations on criteria such as different environments (dev, staging, or prod) or specific application versions.

To add a custom label, such as environment=production, to a GKE cluster, run the following command:

gcloud container clusters update CLUSTER_NAME \
    --region=REGION \
    --update-labels=LABEL_KEY=LABEL_VALUE

Replace the following:

CLUSTER_NAME: the name of your cluster.
REGION: the region of your cluster.
LABEL_KEY: the key for your custom label, for example, environment.
LABEL_VALUE: the value for your custom label, for example, production.

You can then select backends in this cluster by using the custom label selector in your policy.

Example `GCPBackendPolicy` with scope selectors

The following example defines a GCPBackendPolicy that targets an GCPInferencePoolImport named experimental. The policy uses implicit and custom labels to set values for backendPreference, maxRatePerEndpoint, and capacityScalerPercent.

apiVersion: networking.gke.io/v1
kind: GCPBackendPolicy
metadata:
  name: backend-policy
spec:
  targetRef:
    kind: GCPInferencePoolImport
    name: experimental
  default:
    scopes:
      # Selector 1: Targets backends in us-west2, sets capacity to 50%
      - capacityScalarPercent: 50
        selector:
          gke.io/region: us-west2

      # Selector 2: Targets backends in clusters labeled 'env: prod'
      - maxRatePerEndpoint: 40
        selector:
          env: prod

      # Selector 3: Targets backends in a specific US-Central zone and marks them as PREFERRED
      - backendPreference: PREFERRED
        maxRatePerEndpoint: 50
        selector:
          gke.io/cluster-name: my-cluster
          gke.io/zone: us-central1-a

After applying this policy, you observe the following behaviors:

Backends in clusters within the us-west2 region have their effective capacity scaled to 50%.
Backends in clusters labeled with env: prod are limited to a maximum of 40 requests per second per endpoint.
Backends in clusters specifically located in the us-central1-a zone are prioritized (PREFERRED) during load balancing and have a maximum rate of 50 requests per second per endpoint.

What's next

Learn how to Configure a GCPBackendPolicy.
Learn more about GKE multi-cluster Inference Gateway.
Learn more about Set up the GKE multi-cluster Inference Gateway.