Capacity buffers help you reduce Pod startup latency for your Google Kubernetes Engine (GKE) workloads by letting you proactively declare tiers of active or standby capacity in your cluster. By declaring spare capacity ahead of time, you can achieve faster workload startups in a cost-efficient manner.
This document explains how capacity buffers work. To learn how to enable and use capacity buffers, see Configure capacity buffers.
When to use a capacity buffer
Use a capacity buffer for applications that are sensitive to startup latency and need to scale rapidly. When you experience sudden increases in traffic, an active buffer provides pre-provisioned capacity that's designed for low-latency scaling. When you experience a sustained increase in traffic, a standby buffer provides Pod scheduling at a more affordable cost than pre-provisioning.
Capacity buffers provide the following benefits:
- Minimize scaling latency: active buffers provide running VMs, which help minimize latency. Standby buffers resume quickly, providing faster capacity availability than fresh nodes.
- Cost-efficient over-provisioning: capacity buffers help you maintain a fixed-size safety net. For large-scale workloads, this approach is often more cost-efficient than other over-provisioning methods (for example, lowering HorizontalPodAutoscaler (HPA) utilization targets), which can increase idle capacity linearly as your cluster grows.
- Meet workload requirements: you have full control over the size of the capacity buffer. Your options include incorporating custom daemonsets or data, and controlling image preloading and workload pre-start to fit your needs.
We recommend capacity buffers for latency-sensitive workloads that require rapid scale-up, such as AI agents, AI inference, retail applications during sales events, or game servers during peak player activity.
We don't recommend capacity buffers for workloads that are not sensitive to startup latency, for example batch processing jobs. For these workloads, over-provisioning resources provides no benefit.
How capacity buffers work
Implement a capacity buffer by using a Kubernetes CapacityBuffer custom resource to define a buffer of spare capacity. The GKE cluster autoscaler monitors CapacityBuffer resources and treats them as pending demand to help ensure that spare capacity is available. If your cluster doesn't have enough capacity to satisfy the resource requests defined in the buffer, the cluster autoscaler provisions additional nodes.
When a high-priority workload scales up, GKE schedules the workload on the available capacity in the buffer immediately. This immediate scheduling applies to the number of replicas or the resource amount that's reserved in the buffer, avoiding the typical delay associated with node provisioning. When a workload uses a buffer unit, the cluster autoscaler provisions a new node to refill the buffer.
Capacity buffer strategies
You can configure capacity buffers by using different provisioning strategies based on your requirements for latency and cost.
Active buffer
An active buffer provides running VMs for low-latency scaling of workloads that fit within the reserved capacity. Because the nodes are already ready, they provide minimal latency for initial buffer consumption during a scale-up event.
We recommend this strategy for critical workloads where scale-up time is the highest priority.
Standby buffer
A standby buffer provides suspended VMs. The standby strategy is more cost-effective than the active strategy, but introduces a short delay to resume the VM before it accepts workloads.
We recommend this strategy for workloads that can tolerate a minor delay in scaling to optimize for cost.
CapacityBuffer CRD
To configure a capacity buffer, you create a CapacityBuffer CustomResourceDefinition (CRD). You can configure the capacity buffer to meet different criteria:
- Fixed replicas: Specify a fixed number of buffer Pods. This configuration is the simplest way to create a buffer of a known size.
- Percentage-based: Define the buffer size as a percentage of an existing scalable object that defines a scale subresource (such as a Deployment, StatefulSet, ReplicaSet, or Job). The buffer size adjusts dynamically as the reference workload scales. You can't define a percentage-based buffer for Pod templates because they don't have a replicas field.
- Resource limits: Define the total amount of CPU and memory that the buffer should reserve. The controller calculates how many buffer Pods to create based on the resource requests of a referenced Pod template.
For more information, see the CapacityBuffer CRD reference documentation.
Requirements and limitations
Capacity buffers have the following requirements and limitations:
- Capacity buffers are available for GKE clusters running version 1.35.2-gke.1842000 or later for active buffers, and version 1.35.2-gke.1842002 for standby buffers.
- Capacity buffers support only workloads that use a node-based billing model. Capacity buffers don't support workloads that use the Pod-based billing model.
- We recommend that you enable node auto-provisioning on your clusters. Node auto-provisioning allows the cluster autoscaler to create new node pools based on the resource requests in your CapacityBuffer. If you don't enable node auto-provisioning, the cluster autoscaler only scales up existing node pools.
Standby buffers have the following additional limitations:
- They are supported only on Standard clusters.
- VMs with GPUs or TPUs attached are not supported.
- Local SSDs are not supported.
- Confidential Google Kubernetes Engine Nodes are not supported.
- You should be familiar with limitations related to Compute Engine suspend and resume operations, for example, memory limits.
What's next
- To learn how to implement a capacity buffer, see Configure capacity buffers.