This guide describes how to deploy Gemma 4 open models on Cloud Run using a prebuilt container with vLLM inference library, and provides guidance on using the deployed Cloud Run service with AI agents built using Agent Development Kit.
Gemma 4 is the Google's most efficient open-weight model family, delivering strong reasoning and agentic capabilities.
Long context, multimodality, reasoning and tool calling allow Gemma 4 handle complex logic, multi-step planning, coding, and agentic workflows.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
- Set up your Cloud Run development environment in your Google Cloud project.
- Install and initialize the gcloud CLI.
- Ensure you have the following IAM roles granted to your account:
- Cloud Run Admin (
roles/run.admin) - Project IAM Admin (
roles/resourcemanager.projectIamAdmin) - Service Usage Consumer (
roles/serviceusage.serviceUsageConsumer)
- Cloud Run Admin (
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
- Click Grant access.
-
In the New principals field, enter your user identifier. This is typically the email address that is used to deploy the Cloud Run service.
- In the Select a role list, select a role.
- To grant additional roles, click Add another role and add each additional role.
- Click Save.
- PROJECT_NUMBER with your Google Cloud project number.
- PROJECT_ID with your Google Cloud project ID.
- PRINCIPAL with the account you are adding the binding for. This is typically the email address that is used to deploy the Cloud Run service.
- ROLE with the role you are adding to the deployer account.
- If necessary, request
Total Nvidia RTX Pro 6000 GPU allocation, in milli GPU, without zonal redundancy, per project per regionquota under Cloud Run Admin API in the Quotas and system limits page. - Review the Cloud Run pricing page. To generate a cost estimate based on your projected usage, use the pricing calculator.
Learn how to grant the roles
Console
gcloud
To grant the required IAM roles to your account on your project:
gcloud projects add-iam-policy-binding PROJECT_ID \ --member=PRINCIPAL \ --role=ROLE
Replace:
Deploy a Gemma 4 model with a vLLM container
Gemma 4 offers advanced agentic capabilities, including reasoning, function calling, code generation, and structured output.
Agent Development Kit (ADK) helps you build fully-functional AI agents with Gemma 4.
Use vLLM to serve Gemma as an OpenAI API endpoint. vLLM provides fast and efficient serving for generative models at scale, featuring state-of-the-art serving throughput, efficient memory management with PagedAttention, continuous batching of incoming requests, quantization support, and optimized CUDA kernels.
To deploy Gemma models on Cloud Run, use the following gcloud CLI command with the recommended settings:
CONTAINER_ARGS=( "serve" "MODEL_NAME" "--enable-chunked-prefill" "--enable-prefix-caching" "--generation-config=auto" "--enable-auto-tool-choice" "--tool-call-parser=gemma4" "--reasoning-parser=gemma4" "--dtype=bfloat16" "--max-num-seqs=64" "--gpu-memory-utilization=0.95" "--tensor-parallel-size=1" "--port=8080" "--host=0.0.0.0" ) gcloud beta run deploy SERVICE_NAME \ --image "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4" \ --project PROJECT \ --region REGION \ --execution-environment gen2 \ --no-allow-unauthenticated \ --cpu 20 \ --memory 80Gi \ --gpu 1 \ --gpu-type nvidia-rtx-pro-6000 \ --no-gpu-zonal-redundancy \ --no-cpu-throttling \ --max-instances 3 \ --concurrency 64 \ --timeout 600 \ --startup-probe tcpSocket.port=8080,initialDelaySeconds=240,failureThreshold=1,timeoutSeconds=240,periodSeconds=240 \ --command "vllm" \ --args=$(IFS=','; echo "${CONTAINER_ARGS[*]}")
Replace:
SERVICE_NAMEwith a unique name for the Cloud Run service.PROJECTwith your Google Cloud Project Id.REGIONwith a Google Cloud region wherenvidia-rtx-pro-6000GPUs are supported for Cloud Run, such asus-central1. For a full list of supported regions for GPU-enabled deployments, see GPU configuration.MODEL_NAMEwith the full name of a Gemma 4 variant.- Gemma 4 2B:
google/gemma-4-E2B-it - Gemma 4 4B:
google/gemma-4-E4B-it
- Gemma 4 2B:
The other settings are as follows:
| Option | Description |
|---|---|
--concurrency |
The maximum number of requests that can be processed simultaneously by a given instance, such as |
--cpu |
The amount of allocated CPU for your service, such as |
--set-env-vars |
The environment variables set for your service. For example, |
--gpu |
The GPU value for your service, such as |
--gpu-type |
The type of GPU to use for your service, such as |
--max-instances |
The maximum number of container instances for your service, such as |
--memory |
The amount of allocated memory for your service, such as |
--no-invoker-iam-check |
Disable invoker IAM checks. See Secure Cloud Run services tutorial for recommendations on how to better secure your app. |
--no-cpu-throttling |
This setting disables CPU throttling when the container is not actively serving requests. |
--timeout |
The time within which a response must be returned, such as |
--startup-probe |
Comma separated settings for startup probe in the form KEY=VALUE. See Cloud Run Startup Probe for details. With model sizes of Gemma 4, if you are not using Direct VPC Egress, it's recommended to set your startup probe timeout to at least 240 seconds. |
If you need to modify the default settings or add more customized settings to your Cloud Run service, see Configure services.
After completion of the deployed service, a success message is displayed along
with the Cloud Run endpoint URL
ending with run.app.
Test the deployed Gemma service with curl
Now that you have deployed the Gemma service, you can send
requests to it. However, if you send a request directly, Cloud Run
responds with HTTP 401 Unauthorized. This is intentional, because an LLM
inference API is intended for other services to call, such as a front-end
application. For more information on service-to-service
authentication on Cloud Run, refer to Authenticating service-to-service.
To send requests to the Gemma service, add a header with a valid OIDC token to the requests, for example using the Cloud Run developer proxy:
Start the proxy, and when prompted to install the
cloud-run-proxycomponent, chooseY:gcloud run services proxy SERVICE_NAME \ --project PROJECT \ --region REGION \ --port=9090Run the following command to send a request in a separate terminal tab, leaving the proxy running. The proxy runs on
localhost:9090. Specify the Gemma model you previously used:curl http://localhost:9090/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "MODEL_NAME", "messages": [{"role": "user", "content": "Why is the sky blue?"}], "chat_template_kwargs": { "enable_thinking": true }, "skip_special_tokens": false }'This command should provide an output similar to this:
{ "id": "chatcmpl-9cf1ab1450487047", "object": "chat.completion", "created": 1774904187, "model": "google/gemma-4-E2B-it", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "The short answer is a phenomenon called **Rayleigh scattering**...", "function_call": null, "tool_calls": [], "reasoning": "* Question: \"Why is the sky blue?\"\n..." }, "finish_reason": "stop", "stop_reason": 106 } ], "usage": { "prompt_tokens": 21, "total_tokens": 877, "completion_tokens": 856 } }
Set concurrency for optimal performance
This section provides context on the recommended concurrency settings. For optimal
request latency, ensure the --concurrency setting is equal to vLLM's
--max-num-seqs command line argument.
--max-num-seqsdetermines how many sequences (requests) are available per each vLLM instance to handle inference concurrently.--concurrencydetermines how many requests Cloud Run sends to a vLLM instance at the same time.
If --concurrency exceeds --max-num-seqs, Cloud Run can send
more requests to a vLLM instance than it has available request slots for.
This leads to request queuing within vLLM, increasing request latency for the
queued requests. It also leads to less responsive auto scaling, as the queued
requests don't trigger Cloud Run to scale out and start new instances.
To completely avoid request queuing on the vLLM instance, you should set
--concurrency to match --max-num-seqs.
It's important to note that increasing --max-num-seqs also makes parallel requests take longer, and requires more GPU memory for the KV cache.
Optimize utilization
For optimal GPU utilization, increase --concurrency, keeping it within
twice the value of --max-num-seqs. While this leads to request queuing in vLLM, it can help improve utilization: vLLM instances can immediately process requests from their queue, and the queues help absorb traffic spikes.
Build AI agents with Agent Development Kit using Gemma 4
After you have deployed your Cloud Run service, you can use Cloud Run endpoint with Gemma 4 to create AI agents with the Agent Development Kit.
Before you use the Agent Development Kit, ensure that incoming requests pass the appropriate identity token. To learn more about using IAM authentication and Cloud Run, see Authenticating service-to-service.
The following example shows how to use the Agent Development Kit in Python with IAM authentication:
import subprocess
from google.adk.models.lite_llm import LiteLlm
from google.adk.agents import Agent
# Get the identity token using gcloud
id_token = subprocess.run(
["gcloud", "auth", "print-identity-token"],
capture_output=True, text=True
).stdout.strip()
gemma_model = LiteLlm(
model=f'openai/MODEL_NAME',
base_url='https://YOUR_CLOUD_RUN_SERVICE_URL/v1',
extra_body={
"chat_template_kwargs": {
"enable_thinking": True
},
"skip_special_tokens": False
},
extra_headers={
"Authorization": f"Bearer {id_token}",
},
)
root_agent = Agent(
model=gemma_model,
name='assistant',
instruction="You are a helpful assistant",
)
Clean up
Delete the following Google Cloud resources created:
What's next
- For larger models and faster startup time with Run:ai Model Streamer, use codelab Run inference of Gemma 4 model on Cloud Run with RTX 6000 Pro GPU with vLLM
- Configure GPU
- Best practices: AI inference on Cloud Run with GPUs
- Run Gemma 4 models with various AI runtime frameworks