Run agents with Gemma 4 models on Cloud Run

This guide describes how to deploy Gemma 4 open models on Cloud Run using a prebuilt container with vLLM inference library, and provides guidance on using the deployed Cloud Run service with AI agents built using Agent Development Kit.

Gemma 4 is the Google's most efficient open-weight model family, delivering strong reasoning and agentic capabilities.

Long context, multimodality, reasoning and tool calling allow Gemma 4 handle complex logic, multi-step planning, coding, and agentic workflows.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. Set up your Cloud Run development environment in your Google Cloud project.
  7. Install and initialize the gcloud CLI.
  8. Ensure you have the following IAM roles granted to your account:
  9. Learn how to grant the roles

    Console

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. Click Grant access.
    4. In the New principals field, enter your user identifier. This is typically the email address that is used to deploy the Cloud Run service.

    5. In the Select a role list, select a role.
    6. To grant additional roles, click Add another role and add each additional role.
    7. Click Save.

    gcloud

    To grant the required IAM roles to your account on your project:

            gcloud projects add-iam-policy-binding PROJECT_ID \
                --member=PRINCIPAL \
                --role=ROLE
            

    Replace:

    • PROJECT_NUMBER with your Google Cloud project number.
    • PROJECT_ID with your Google Cloud project ID.
    • PRINCIPAL with the account you are adding the binding for. This is typically the email address that is used to deploy the Cloud Run service.
    • ROLE with the role you are adding to the deployer account.
  10. If necessary, request Total Nvidia RTX Pro 6000 GPU allocation, in milli GPU, without zonal redundancy, per project per region quota under Cloud Run Admin API in the Quotas and system limits page.
  11. Review the Cloud Run pricing page. To generate a cost estimate based on your projected usage, use the pricing calculator.

Deploy a Gemma 4 model with a vLLM container

Gemma 4 offers advanced agentic capabilities, including reasoning, function calling, code generation, and structured output.

Agent Development Kit (ADK) helps you build fully-functional AI agents with Gemma 4.

Use vLLM to serve Gemma as an OpenAI API endpoint. vLLM provides fast and efficient serving for generative models at scale, featuring state-of-the-art serving throughput, efficient memory management with PagedAttention, continuous batching of incoming requests, quantization support, and optimized CUDA kernels.

To deploy Gemma models on Cloud Run, use the following gcloud CLI command with the recommended settings:

CONTAINER_ARGS=(
    "serve"
    "MODEL_NAME"
    "--enable-chunked-prefill"
    "--enable-prefix-caching"
    "--generation-config=auto"
    "--enable-auto-tool-choice"
    "--tool-call-parser=gemma4"
    "--reasoning-parser=gemma4"
    "--dtype=bfloat16"
    "--max-num-seqs=64"
    "--gpu-memory-utilization=0.95"
    "--tensor-parallel-size=1"
    "--port=8080"
    "--host=0.0.0.0"
)
gcloud beta run deploy SERVICE_NAME \
    --image "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4" \
    --project PROJECT \
    --region REGION \
    --execution-environment gen2 \
    --no-allow-unauthenticated \
    --cpu 20 \
    --memory 80Gi \
    --gpu 1 \
    --gpu-type nvidia-rtx-pro-6000 \
    --no-gpu-zonal-redundancy \
    --no-cpu-throttling \
    --max-instances 3 \
    --concurrency 64 \
    --timeout 600 \
    --startup-probe tcpSocket.port=8080,initialDelaySeconds=240,failureThreshold=1,timeoutSeconds=240,periodSeconds=240 \
    --command "vllm" \
    --args=$(IFS=','; echo "${CONTAINER_ARGS[*]}")

Replace:

  • SERVICE_NAME with a unique name for the Cloud Run service.
  • PROJECT with your Google Cloud Project Id.
  • REGION with a Google Cloud region where nvidia-rtx-pro-6000 GPUs are supported for Cloud Run, such as us-central1. For a full list of supported regions for GPU-enabled deployments, see GPU configuration.

  • MODEL_NAME with the full name of a Gemma 4 variant.

    • Gemma 4 2B: google/gemma-4-E2B-it
    • Gemma 4 4B: google/gemma-4-E4B-it

The other settings are as follows:

Option Description
--concurrency

The maximum number of requests that can be processed simultaneously by a given instance, such as 8. See Set concurrency for optimal performance for recommendations on optimal request latency.

--cpu

The amount of allocated CPU for your service, such as 20.

--set-env-vars

The environment variables set for your service. For example, HF_TOKEN="..."

--gpu

The GPU value for your service, such as 1.

--gpu-type

The type of GPU to use for your service, such as nvidia-rtx-pro-6000.

--max-instances

The maximum number of container instances for your service, such as 1.

--memory

The amount of allocated memory for your service, such as 80Gi.

--no-invoker-iam-check

Disable invoker IAM checks. See Secure Cloud Run services tutorial for recommendations on how to better secure your app.

--no-cpu-throttling

This setting disables CPU throttling when the container is not actively serving requests.

--timeout

The time within which a response must be returned, such as 600 seconds.

--startup-probe

Comma separated settings for startup probe in the form KEY=VALUE. See Cloud Run Startup Probe for details. With model sizes of Gemma 4, if you are not using Direct VPC Egress, it's recommended to set your startup probe timeout to at least 240 seconds.

If you need to modify the default settings or add more customized settings to your Cloud Run service, see Configure services.

After completion of the deployed service, a success message is displayed along with the Cloud Run endpoint URL ending with run.app.

Test the deployed Gemma service with curl

Now that you have deployed the Gemma service, you can send requests to it. However, if you send a request directly, Cloud Run responds with HTTP 401 Unauthorized. This is intentional, because an LLM inference API is intended for other services to call, such as a front-end application. For more information on service-to-service authentication on Cloud Run, refer to Authenticating service-to-service.

To send requests to the Gemma service, add a header with a valid OIDC token to the requests, for example using the Cloud Run developer proxy:

  1. Start the proxy, and when prompted to install the cloud-run-proxy component, choose Y:

    gcloud run services proxy SERVICE_NAME \
      --project PROJECT \
      --region REGION \
      --port=9090
  2. Run the following command to send a request in a separate terminal tab, leaving the proxy running. The proxy runs on localhost:9090. Specify the Gemma model you previously used:

    curl http://localhost:9090/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "MODEL_NAME",
        "messages": [{"role": "user", "content": "Why is the sky blue?"}],
        "chat_template_kwargs": {
             "enable_thinking": true
         },
         "skip_special_tokens": false
      }'

    This command should provide an output similar to this:

    {
     "id": "chatcmpl-9cf1ab1450487047",
     "object": "chat.completion",
     "created": 1774904187,
     "model": "google/gemma-4-E2B-it",
     "choices": [
       {
         "index": 0,
         "message": {
           "role": "assistant",
           "content": "The short answer is a phenomenon called **Rayleigh scattering**...",
           "function_call": null,
           "tool_calls": [],
           "reasoning": "*   Question: \"Why is the sky blue?\"\n..."
         },
         "finish_reason": "stop",
         "stop_reason": 106
       }
     ],
     "usage": {
       "prompt_tokens": 21,
       "total_tokens": 877,
       "completion_tokens": 856
     }
    }
    

Set concurrency for optimal performance

This section provides context on the recommended concurrency settings. For optimal request latency, ensure the --concurrency setting is equal to vLLM's --max-num-seqs command line argument.

  • --max-num-seqs determines how many sequences (requests) are available per each vLLM instance to handle inference concurrently.
  • --concurrency determines how many requests Cloud Run sends to a vLLM instance at the same time.

If --concurrency exceeds --max-num-seqs, Cloud Run can send more requests to a vLLM instance than it has available request slots for. This leads to request queuing within vLLM, increasing request latency for the queued requests. It also leads to less responsive auto scaling, as the queued requests don't trigger Cloud Run to scale out and start new instances.

To completely avoid request queuing on the vLLM instance, you should set --concurrency to match --max-num-seqs.

It's important to note that increasing --max-num-seqs also makes parallel requests take longer, and requires more GPU memory for the KV cache.

Optimize utilization

For optimal GPU utilization, increase --concurrency, keeping it within twice the value of --max-num-seqs. While this leads to request queuing in vLLM, it can help improve utilization: vLLM instances can immediately process requests from their queue, and the queues help absorb traffic spikes.

Build AI agents with Agent Development Kit using Gemma 4

After you have deployed your Cloud Run service, you can use Cloud Run endpoint with Gemma 4 to create AI agents with the Agent Development Kit.

Before you use the Agent Development Kit, ensure that incoming requests pass the appropriate identity token. To learn more about using IAM authentication and Cloud Run, see Authenticating service-to-service.

The following example shows how to use the Agent Development Kit in Python with IAM authentication:

import subprocess
from google.adk.models.lite_llm import LiteLlm
from google.adk.agents import Agent

# Get the identity token using gcloud
id_token = subprocess.run(
    ["gcloud", "auth", "print-identity-token"],
    capture_output=True, text=True
).stdout.strip()

gemma_model = LiteLlm(
    model=f'openai/MODEL_NAME',
    base_url='https://YOUR_CLOUD_RUN_SERVICE_URL/v1',
    extra_body={
        "chat_template_kwargs": {
            "enable_thinking": True
        },
        "skip_special_tokens": False
    },
    extra_headers={
        "Authorization": f"Bearer {id_token}",
    },
)

root_agent = Agent(
    model=gemma_model,
    name='assistant',
    instruction="You are a helpful assistant",
)

Clean up

Delete the following Google Cloud resources created:

What's next