About capacity buffers

Capacity buffers help you reduce Pod startup latency for your Google Kubernetes Engine (GKE) workloads by letting you proactively manage spare capacity in your cluster. By reserving spare capacity ahead of time, you help to ensure that nodes have capacity available when needed, reducing the time it takes to schedule new Pods during scaling events.

This document explains how capacity buffers work. To learn how to enable and use capacity buffers, see Configure capacity buffers.

When to use a capacity buffer

Use a capacity buffer for applications that are sensitive to startup latency and need to scale rapidly. When you experience sudden increases in traffic, a capacity buffer provides pre-provisioned capacity.

Capacity buffers provide the following benefits:

  • Minimize scaling latency: by using capacity buffers, critical workloads land on pre-provisioned nodes immediately. Capacity buffers help eliminate the delay associated with VM startup and image pulling, helping you maintain strict service-level objectives (SLOs) during traffic spikes.
  • Cost-efficient over-provisioning: capacity buffers help you maintain a fixed-size safety net. For large-scale workloads, this approach is often more cost-efficient than other over-provisioning methods (for example, lowering HorizontalPodAutoscaler (HPA) utilization targets), which can increase idle capacity linearly as your cluster grows.
  • Meet workload requirements: you have full control over the size of the capacity buffer. Your options include incorporating custom daemonsets or data, and controlling image preloading and workload pre-start.

We recommend capacity buffers for latency-sensitive workloads that require rapid scale-up, such as AI inference, retail applications during sales events, or game servers during peak player activity.

We don't recommend capacity buffers for workloads that are not sensitive to startup latency, for example batch processing jobs. For these workloads, over-provisioning resources provides no benefit.

How capacity buffers work

Implement a capacity buffer by using a Kubernetes CapacityBuffer custom resource to define a buffer of spare capacity. The GKE cluster autoscaler monitors CapacityBuffer resources and treats them as pending demand to help ensure that spare capacity is available. If your cluster doesn't have enough capacity to satisfy the resource requests defined in the buffer, the cluster autoscaler provisions additional nodes.

When a high-priority workload scales up, GKE schedules the workload on the available capacity in the buffer immediately. This immediate scheduling applies to the number of replicas or the resource amount that's reserved in the buffer, avoiding the typical delay associated with node provisioning. When a workload uses a buffer unit, the cluster autoscaler provisions a new node to refill the buffer.

An active buffer provides running VMs for low-latency scaling of workloads that fit within the reserved capacity. Because the nodes are already ready, they provide the lowest possible latency for initial buffer consumption during a scale-up event.

CapacityBuffer CRD

To configure a capacity buffer, you create a CapacityBuffer CustomResourceDefinition (CRD). You can configure the capacity buffer to meet different criteria:

  • Fixed replicas: Specify a fixed number of buffer Pods. This configuration is the simplest way to create a buffer of a known size.
  • Percentage-based: Define the buffer size as a percentage of an existing scalable workload, such as a Deployment. The buffer size adjusts dynamically while the reference workload scales.
  • Resource limits: Define the total amount of CPU and memory that the buffer should reserve. The controller calculates how many buffer Pods to create based on the resource requests of a referenced Pod template.

For more information, see the CapacityBuffer CRD reference documentation.

Requirements and limitations

Capacity buffers have the following requirements and limitations:

  • Capacity buffers are available for GKE clusters running version 1.35.2-gke.1842000 or later.
  • Capacity buffers support only workloads that use a node-based billing model. Capacity buffers don't support workloads that use the Pod-based billing model.
  • We recommend that you enable node auto-provisioning on your clusters. Node auto-provisioning allows the cluster autoscaler to create new node pools based on the resource requests in your CapacityBuffer. If you don't enable node auto-provisioning, the cluster autoscaler only scales up existing node pools.

What's next