Scale to zero using KEDA

This tutorial shows how to scale your GKE workloads down to zero Pods using KEDA. Scaling the deployments to zero Pods saves resources during periods of inactivity (such as weekends and non-office hours), or for intermittent workloads such as periodic jobs.

Install KEDA

KEDA is a component that complements Kubernetes Horizontal Pod Autoscaler. With KEDA, you can scale a Deployment to zero Pods and up from zero Pods to one Pod. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster.. The standard Horizontal Pod Autoscaler algorithm applies after GKE creates at least one Pod.

After GKE scales the Deployment to zero Pods, because no Pods are running, autoscaling cannot rely on Pod metrics such as CPU utilization. As a consequence, KEDA allows fetching metrics originating from outside the cluster using an implementation of the Kubernetes External Metrics API. You can use this API to autoscale based on metrics such as the number of outstanding messages on a Pub/Sub subscription. See the KEDA documentation for a list of all supported metric sources.

Install KEDA on your cluster with Helm or with kubectl.

Helm

Run the following commands to add the KEDA Helm repository, install the KEDA Helm chart, and give the KEDA service account read access to Cloud Monitoring:

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --create-namespace --namespace keda

gcloud projects add-iam-policy-binding projects/${PROJECT_ID} \
     --role roles/monitoring.viewer \
     --member=principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/keda/sa/keda-operator

Note that this command also sets up authorization rules that require the cluster to be set up with Workload Identity Federation for GKE.

kubectl

Run the following commands to install KEDA using kubectl apply and to give the KEDA service account read access to Cloud Monitoring:

kubectl apply --server-side  -f https://github.com/kedacore/keda/releases/download/v2.15.1/keda-2.15.1.yaml

gcloud projects add-iam-policy-binding projects/${PROJECT_ID} \
     --role roles/monitoring.viewer \
     --member=principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/keda/sa/keda-operator

Note that this command also sets up authorization rules that require the cluster to be set up with Workload Identity Federation for GKE.

Confirm that all KEDA resources appear under the keda namespace:

kubectl get all -n keda

For more information about KEDA design and resources, see the KEDA documentation.

Scale your Pub/Sub workload to zero

This section describes a workload that processes messages from a Pub/Sub subscription, handling each message and acknowledging its completion. The workload scales dynamically: as the number of unacknowledged messages increases, autoscaling instantiates more Pods to ensure timely processing.

Scaling to zero ensures that no Pods are instantiated when no messages have been received for a while. This saves resources as no Pods stay idle for long periods of time.

Deploy a Pub/Sub workload

Deploy a sample workload that processes messages queued on a Pub/Sub topic. To simulate a realistic workload, this sample program waits three seconds before acknowledging a message. The workload is configured to run under the keda-pubsub-sa service account.

Run the following commands to create the Pub/Sub topic and subscription, configure their permission, and create the Deployment starting the workload under the keda-pubsub namespace.

gcloud pubsub topics create keda-echo
gcloud pubsub subscriptions create keda-echo-read --topic=keda-echo
gcloud projects add-iam-policy-binding projects/${PROJECT_ID}  \
    --role=roles/pubsub.subscriber \
  --member=principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/keda-pubsub/sa/keda-pubsub-sa

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/refs/heads/main/cost-optimization/gke-keda/cloud-pubsub/deployment/keda-pubsub-with-workload-identity.yaml

Configure scale-to-zero

To configure your Pub/Sub workload to scale to zero, use KEDA to define a ScaledObject resource to specify how the deployment should scale. KEDA will then automatically create and manage the underlying HorizontalPodAutoscaler (HPA) object.

  1. Create the ScaledObject resource to describe the expected autoscaling behavior:

    curl https://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/refs/heads/main/cost-optimization/gke-keda/cloud-pubsub/deployment/keda-pubsub-scaledobject.yaml | envsubst | kubectl apply -f -
    

    This creates the following object:

    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: keda-pubsub
      namespace: keda-pubsub
    spec:
      maxReplicaCount: 5
      scaleTargetRef:
        name: keda-pubsub
      triggers:
        - type: gcp-pubsub
          authenticationRef:
            name: keda-auth
          metadata:
            subscriptionName: "projects/${PROJECT_ID}/subscriptions/keda-echo-read"
    
  2. Inspect the HorizontalPodAutoscaler (HPA) object that KEDA creates based on the ScaledObject object:

    kubectl get hpa keda-hpa-keda-pubsub -n keda-pubsub -o yaml
    

    You can read more about autoscaling in the Kubernetes documentation.

  3. Wait until KEDA acknowledges that the Pub/Sub subscription is empty, and scales the Deployment to zero replicas.

    Inspect the workload autoscaler:

    kubectl describe hpa keda-hpa-keda-pubsub -n keda-pubsub
    

    Observe that in the command response, the ScalingActive condition is false. The associated message shows that the Horizontal Pod Autoscaler acknowledges that KEDA scaled the deployment to zero, at which point it stops operating until the Deployment scales back up to one Pod.

    Name:                                                  keda-hpa-keda-pubsub
    Namespace:                                             keda-pubsub
    Metrics:                                               ( current / target )
      "s0-gcp-ps-projects-[...]]" (target average value):  0 / 10
    Min replicas:                                          1
    Max replicas:                                          5
    Deployment pods:                                       5 current / 5 desired
    Conditions:
      Type            Status  Reason               Message
      ----            ------  ------               -------
      AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one [...]
      ScalingActive   False   ScalingDisabled      scaling is disabled since the replica count of the target is zero
      ScalingLimited  True    TooManyReplicas      the desired replica count is more than the maximum replica count
    

Trigger the scale-up

To stimulate the Deployment to scale up:

  1. Enqueue messages on the Pub/Sub topic:

    for num in {1..20}
    do
      gcloud pubsub topics publish keda-echo --project=${PROJECT_ID} --message="Test"
    done
    
  2. Verify that the Deployment is scaling up:

    kubectl get deployments -n keda-pubsub
    

    In the output, observe that the 'Ready' column shows one replica:

    NAME          READY   UP-TO-DATE   AVAILABLE   AGE
    keda-pubsub   1/1     1            1           2d
    

KEDA scales up the Deployment after it observes that the queue is not empty.

Scale your LLM workload to zero

This section describes a Large Language Model (LLM) workload that deploys an Ollama server with attached GPU. Ollama allows running popular LLMs such as Gemma and Llama 2, and exposes its features primarily through HTTP.

Install KEDA-HTTP add-on

Scaling an HTTP service down to zero Pods during periods of inactivity causes request failures, since there's no backend to handle the requests.

This section shows how to solve this problem using the KEDA-HTTP add-on. KEDA-HTTP starts an HTTP proxy that receives user requests and forwards them to the Services configured to scale-to-zero. When the Service has no Pod, the proxy triggers the Service to scale up, and buffers the request until the Service has scaled up to at least one Pod.

Install the KEDA-HTTP add-on using Helm. For more information, refer to KEDA-HTTP documentation.

helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm repo update

# Set the proxy timeout to 120s, giving Ollama time to start.
helm install http-add-on kedacore/keda-add-ons-http  \
  --create-namespace --namespace keda \
  --set interceptor.responseHeaderTimeout=120s

Deploy an Ollama LLM workload

To deploy an Ollama LLM workload:

  1. Create a node pool containing g2-standard-4 nodes with attached GPUs, and configure cluster autoscaling to provide between zero and two nodes:

    gcloud container node-pools create gpu --machine-type=g2-standard-4 \
        --location=${LOCATION} --cluster=scale-to-zero \
        --min-nodes 0 --max-nodes 2 --num-nodes=1 --enable-autoscaling
    
  2. Add the official Ollama Helm chart repository, and update your local Helm client's repository:

    helm repo add ollama-helm https://otwld.github.io/ollama-helm/
    helm repo update
    
  3. Deploy the Ollama server using the Helm chart:

    helm install ollama ollama-helm/ollama --create-namespace --namespace ollama \
      -f https://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/refs/heads/main/cost-optimization/gke-keda/ollama/helm-values-ollama.yaml
    

    The helm-values-ollama.yaml configuration specifies the LLM models to load, the GPU requirements, and the TCP port for the Ollama server.

Configure scale-to-zero

To configure your Ollama workload to scale to zero, KEDA-HTTP uses an HTTPScaledObject.

  1. Create the HTTPScaledObject resource to describe the expected autoscaling behavior:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/kubernetes-engine-samples/refs/heads/main/cost-optimization/gke-keda/ollama/keda-ollama-httpscaledobject.yaml
    

    This creates the HTTPScaledObject object that defines the following fields:

    • scaleTargetRef: specifies the Service to which KEDA-HTTP should forward the requests. In this example, all requests with the host ollama.ollama are routed to the Ollama server.
    • scaledownPeriod: specifies (in seconds) how fast to scale down when no requests are received.
    • replicas: specifies the minimum and maximum number of Pods to maintain for the Ollama deployment.
    • scalingMetric: specifies the metrics used to drive autoscaling, such as request rate in this example. For more metric options, see the KEDA-HTTP documentation.
    kind: HTTPScaledObject
    apiVersion: http.keda.sh/v1alpha1
    metadata:
        namespace: ollama
        name: ollama
    spec:
        hosts:
        - ollama.ollama
        scaleTargetRef:
            name: ollama
            kind: Deployment
            apiVersion: apps/v1
            service: ollama
            port: 11434
        replicas:
            min: 0
            max: 2
        scaledownPeriod: 3600
        scalingMetric:
            requestRate:
                targetValue: 20
    
  2. Run the following command to verify that KEDA-HTTP has successfully processed the HTTPScaledObject created in the previous step:

    kubectl get hpa,scaledobject -n ollama
    

    The output shows the HorizontalPodAutoscaler (created by KEDA), and the ScaledObject (created by KEDA-HTTP) resources:

    NAME                                                  REFERENCE           TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
    horizontalpodautoscaler.autoscaling/keda-hpa-ollama   Deployment/ollama   0/100 (avg)   1         2         1          2d
    
    NAME                          SCALETARGETKIND      SCALETARGETNAME   MIN   MAX   TRIGGERS        AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
    scaledobject.keda.sh/ollama   apps/v1.Deployment   ollama            0     2     external-push                    True    False    False      Unknown   2d
    
  3. Verify that the Deployment scales down to zero Pods.

    Wait the period of time set in the scaledownPeriod field and run the command:

    kubectl get deployments -n ollama
    

    The output shows that KEDA scaled down the Ollama deployment, and that no Pods are running:

    NAME     READY   UP-TO-DATE   AVAILABLE   AGE
    ollama   0/0     0            0           2d
    

Trigger the scale-up

To stimulate the Deployment to scale up, call the Ollama service using the proxy set up by the KEDA-HTTP add-on. This causes the value of the request rate metric to increase, and triggers the creation of a first Pod.

Use kubectl port forwarding capabilities to access the proxy because the proxy is not exposed externally.

kubectl port-forward svc/keda-add-ons-http-interceptor-proxy -n keda 8080:8080 &

# Set the 'Host' HTTP header so that the proxy routes requests to the Ollama server.
curl -H "Host: ollama.ollama" \
  http://localhost:8080/api/generate \
  -d '{ "model": "gemma:7b", "prompt": "Hello!" }'

The curl command sends the prompt "Hello!" to a Gemma model. Observe the answer tokens coming back in the response. For the specification of the API, see the Ollama guide.