Configure custom scaling controls for services

By default, Cloud Run optimizes for high performance with a utilization target of 60% for both CPU and concurrency, and scales the number of instances automatically to handle all incoming requests. However, for some use cases, you might want the ability to configure which scaling factors to use, such as CPU only, and set custom targets for utilization.

Cloud Run provides scaling controls to give you more ownership of your service's scaling behaviors, letting you make informed decisions about scaling your workload according to your requirements. You can opt in for enhanced scaling behavior by retaining the default utilization targets, or configure the following custom utilization targets:

  • Target utilization for CPU-based scaling
  • Target utilization for concurrency-based scaling

With scaling controls, you can optimize costs and improve predictability for your services. For more information about the default autoscaling behavior of Cloud Run services, see About instance autoscaling in Cloud Run services.

Configuration limits

The following limits apply to custom scaling targets:

Scaling driver Default % Minimum configurable % Maximum configurable %
CPU target utilization 60% 10% 95%
Concurrency target utilization 60% 10% 95%

Opt-in for enhanced scaling behavior

Cloud Run's autoscaler responds closely to the targets you configure, even for services with a low number of instances. Consider opting in to this feature for improved scaling predictability, even if you intend to keep the default utilization targets of 60% for both CPU and concurrency.

To opt in, you can use the gcloud CLI or YAML when you deploy a new revision.

Any configuration change leads to the creation of a new revision. Subsequent revisions will also automatically get this configuration setting unless you make explicit updates to change it.

gcloud

Set the target CPU utilization and the target concurrency utilization values of a given revision by running the following gcloud beta run services update command:

gcloud beta run services update SERVICE --scaling-cpu-target=0.6 \
--scaling-concurrency-target=0.6

Replace SERVICE with the name of your service.

YAML

  1. If you are creating a new service, skip this step. If you are updating an existing service, download its YAML configuration:

    gcloud run services describe SERVICE --format export > service.yaml
  2. Add the run.googleapis.com/scaling-cpu-target and run.googleapis.com/scaling-concurrency-target attributes.

    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
      annotations:
        run.googleapis.com/launch-stage: BETA
      name: SERVICE
    spec:
      template:
        metadata:
          annotations:
            run.googleapis.com/scaling-cpu-target: '0.6'
            run.googleapis.com/scaling-concurrency-target: '0.6'

    Replace SERVICE with the name of your service.

  3. Create or update the service using the following command:

    gcloud run services replace service.yaml

Configure custom targets

Define custom utilization targets to optimize costs or improve performance for your workloads by configuring specific CPU and concurrency utilization targets within the configuration limits.

Any configuration change leads to the creation of a new revision. Subsequent revisions will also automatically get this configuration setting unless you make explicit updates to change it.

You can configure scaling controls using the gcloud CLI or YAML when you deploy a new revision.

gcloud

Update the target CPU utilization and the target concurrency utilization values of a given revision by running the gcloud beta run services update command.

  • To update the target CPU utilization, run the following command:

    gcloud beta run services update SERVICE --scaling-cpu-target=CPU_TARGET

    Replace the following:

    • SERVICE: the name of your service.

    • CPU_TARGET: the target for CPU utilization. Specify a value from 0.1 to 0.95. You can only configure up to two digits after the decimal point.

  • To update the target concurrency utilization, run the following command:

    gcloud beta run services update SERVICE --scaling-concurrency-target=CONCURRENCY_TARGET

    Replace the following:

    • SERVICE: the name of your service.

    • CONCURRENCY_TARGET: the target for concurrency utilization. Specify a value from 0.1 to 0.95. You can only configure up to two digits after the decimal point.

  • To update both the target CPU and the concurrency utilization, run the following command:

    gcloud beta run services update SERVICE --scaling-cpu-target=CPU_TARGET \
    --scaling-concurrency-target=CONCURRENCY_TARGET

    Replace the following:

    • SERVICE: the name of your service.
    • CPU_TARGET: the target for CPU utilization. Specify a value from 0.1 to 0.95. You can only configure up to two digits after the decimal point.
    • CONCURRENCY_TARGET: the target for concurrency utilization. Specify a value from 0.1 to 0.95. You can only configure up to two digits after the decimal point.

YAML

  1. If you are creating a new service, skip this step. If you are updating an existing service, download its YAML configuration:

    gcloud run services describe SERVICE --format export > service.yaml
  2. To update the target CPU and concurrency utilization, add the run.googleapis.com/scaling-cpu-target and run.googleapis.com/scaling-concurrency-target attributes:

    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
      annotations:
        run.googleapis.com/launch-stage: BETA
      name: SERVICE
    spec:
      template:
        metadata:
          annotations:
            run.googleapis.com/scaling-cpu-target: 'CPU_TARGET'
            run.googleapis.com/scaling-concurrency-target: 'CONCURRENCY_TARGET'

    Replace the following:

    • SERVICE: the name of your service.
    • CPU_TARGET: the target for CPU utilization. Specify a value from 0.1 to 0.95. You can only configure up to two digits after the decimal point.
    • CONCURRENCY_TARGET: the target for concurrency utilization. Specify a value from 0.1 to 0.95. You can only configure up to two digits after the decimal point.
  3. Create or update the service using the following command:

    gcloud run services replace service.yaml

Disable scaling controls

You can disable either CPU utilization or concurrency utilization targets, but not both. One scaling driver must always be active. To opt out of scaling controls, restore the default utilization values instead of disabling them. When you disable a scaling driver, Cloud Run ignores that metric while making scaling decisions.

You can disable scaling controls using the gcloud CLI or YAML when you deploy a new revision.

gcloud

You can disable either the target CPU utilization or the target concurrency utilization by running the gcloud beta run services update command.

  • To scale only by CPU, disable the concurrency target by running the following command:

    gcloud beta run services update SERVICE --scaling-concurrency-target=disabled

    Replace SERVICE with the name of your service.

  • To scale only by concurrency, disable the CPU target by running the following command:

    gcloud beta run services update SERVICE --scaling-cpu-target=disabled

    Replace SERVICE with the name of your service.

YAML

  1. If you are creating a new service, skip this step. If you are updating an existing service, download its YAML configuration:

    gcloud run services describe SERVICE --format export > service.yaml
  2. To scale only by CPU, disable the concurrency target by setting the run.googleapis.com/scaling-concurrency-target attribute to disabled:

    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
      annotations:
        run.googleapis.com/launch-stage: BETA
      name: SERVICE
    spec:
      template:
        metadata:
          annotations:
            run.googleapis.com/scaling-concurrency-target: disabled

    Replace SERVICE with the name of your service.

  3. To scale only by concurrency, disable the CPU target by setting the run.googleapis.com/scaling-cpu-target attribute to disabled:

    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
      annotations:
        run.googleapis.com/launch-stage: BETA
      name: SERVICE
    spec:
      template:
        metadata:
          annotations:
            run.googleapis.com/scaling-cpu-target: disabled

    Replace SERVICE with the name of your service.

  4. Create or update the service using the following command:

    gcloud run services replace service.yaml

Restore to default values

When you restore the target CPU or the target concurrency utilization values to default, you opt out of the scaling controls feature. You can restore scaling controls to default using the gcloud CLI or YAML when you deploy a new revision.

gcloud

Restore the target CPU utilization and the target concurrency utilization to their defaults by running the gcloud beta run services update command.

  • To restore the target CPU utilization to its default value, run the following command:

    gcloud beta run services update SERVICE --scaling-cpu-target=default

    Replace SERVICE with the name of your service.

  • To restore the target concurrency utilization to its default value, run the following command:

    gcloud beta run services update SERVICE --scaling-concurrency-target=default

    Replace SERVICE with the name of your service.

  • To restore both the target CPU utilization and the target concurrency to their default values, run the following command:

    gcloud beta run services update SERVICE --scaling-cpu-target=default \
    --scaling-concurrency-target=default

    Replace SERVICE with the name of your service.

YAML

  1. If you are creating a new service, skip this step. If you are updating an existing service, download its YAML configuration:

    gcloud run services describe SERVICE --format export > service.yaml
  2. To restore CPU and concurrency utilization to their default targets, remove the run.googleapis.com/scaling-cpu-target and run.googleapis.com/scaling-concurrency-target attributes from your YAML file:

    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
      annotations:
        run.googleapis.com/launch-stage: BETA
      name: SERVICE
    spec:
      template:
        metadata:
          # Remove the scaling target annotations to restore defaults
        ...

    Replace SERVICE with the name of your service.

  3. Create or update the service using the following command:

    gcloud run services replace service.yaml

View scaling configuration

You can view your scaling configuration using the gcloud CLI or YAML.

Console

  1. In the Google Cloud console, go to the Cloud Run Services page:

    Go to Cloud Run

  2. Click your service to open the Service details panel.

  3. Click the Revisions tab.

  4. In the details panel at the right, view the Autoscaling metrics setting listed under the Containers tab.

gcloud

  1. Use the following command:

    gcloud run services describe SERVICE

    Replace SERVICE with the name of your service.

  2. Locate the value for the Target CPU utilization: and Target concurrency utilization: settings in the returned configuration.

Best practices

You can optimize costs and prevent overscaling by decreasing the number of instances, or you can improve performance by scaling more aggressively in response to specific drivers. To determine the optimal utilization targets for your workload, use the following strategies:

  • Before adjusting targets, identify which metric is triggering your service to scale. Follow these steps to identify the scaling metric:

    1. Go to Metrics Explorer in the Google Cloud console to review the monitoring chart for your CPU and concurrency utilization.

    2. Search and select the run.googleapis.com/scaling/recommended_instances metric, and set Aggregation to Unaggregated to view the metric grouped by scaling driver.

    The driver with the highest value is the one controlling your service's instance count. If you want a different driver to take priority, or if you want to scale more or less aggressively, adjust the utilization target for that specific driver.

  • Adjust targets incrementally, and wait for a few minutes between adjustments to observe the effect on performance.

  • Use traffic splitting to test new scaling targets by directing a small percentage of your traffic to a separate revision before rolling them out to your entire service.

About low utilization targets

Lowering your utilization target to the minimum of 0.1 (10%) significantly changes how your service scales.

Benefits of setting a low utilization target include:

  • High service availability: Your service scales up much earlier, maintaining a large buffer of idle capacity to handle sudden traffic spikes without latency hits.

  • Faster scaling at low instance counts: Services scale more reliably before hitting high-utilization bottlenecks.

Drawbacks of setting low utilization targets include:

  • Potential for cost increase: You run more instances than strictly necessary for your current load, leading to higher billing.
  • More frequent scaling decisions: At lower utilizations, Cloud Run has a lower tolerance, and doesn't wait as long before scaling.

What's next