Run offline evaluations

Offline evaluation allows you to measure the performance, safety, and quality of your agents by analyzing historical data captured during development or production. You can evaluate individual Traces (single execution paths) or full Sessions (multi-turn conversation histories) against a set of predefined or custom metrics.

Traces vs. sessions

Trace: A factual, immutable record of the agent's behavior, including model inputs, responses, and tool calls. A trace represents a single execution path.
Session: Encompasses the entire multi-turn interaction between a user and an agent. Use sessions to evaluate context retention and conversational flow over time.

Before you begin

To ensure you have the necessary data and environment for offline evaluation, complete the following:

Ensure you have a working Agent Runtime deployed with Cloud Trace enabled.
Set up a Cloud Storage bucket to store evaluation results. You only need to provide this path once; it will be pre-filled for future runs.
If you plan to use the Agent Platform SDK for evaluation, initialize the client as described in Evaluate your agents.

Telemetry requirements

Offline evaluation requires your agent to export specific OpenTelemetry signals to provide the necessary context for assessment. These requirements are identical to those for Online Monitors:

Invoke agent span: Must include the following attributes:
- gen_ai.agent.name: The identifier for the agent.
- gen_ai.agent.description: A brief description of the agent's purpose.
- gen_ai.conversation.id: A unique identifier for the specific conversation session.
Inference events: The gen_ai.client.inference.operation.details event must capture:
- gen_ai.input.messages: The prompts sent to the agent.
- gen_ai.output.messages: The responses generated by the agent.
- gen_ai.system_instructions: The underlying system prompts.
- gen_ai.tool.definitions: Metadata about any tools available to the agent.

If you are using the Agent Development Kit, you must enable these telemetry capabilities by setting the following environment variables:

OTEL_SEMCONV_STABILITY_OPT_IN='gen_ai_latest_experimental'
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT='EVENT_ONLY'

Recording media in Cloud Storage

If your agent uses multimodal data, such as images or large documents, then we recommend recording the inputs and outputs in a Cloud Storage bucket instead of embedding them directly in trace spans. Configure the following environment variables to enable this:

OTEL_INSTRUMENTATION_GENAI_UPLOAD_FORMAT='jsonl'
OTEL_INSTRUMENTATION_GENAI_COMPLETION_HOOK='upload'
OTEL_INSTRUMENTATION_GENAI_UPLOAD_BASE_PATH='gs://STORAGE_BUCKET_NAME/PATH'

For more information, see Collect multimodal prompts and responses.

Create an evaluation from the registry

In the Google Cloud console, navigate to the Agent Platform > Agents > Evaluation page.
Go to Evaluation
Click New evaluation.
Select the Traces or Sessions tab based on your assessment goal.
Use the filter icon and time picker to filter data (for example, by Version or "Last 2 weeks") and select the specific IDs you want to evaluate.
Click Continue.
(Optional) In the Evaluation name field, enter a name for the assessment or use the pre-filled default.
In the Output private data path field, enter your Cloud Storage bucket URI. After the first use, this path is pre-filled for future runs.
By default, all four core metrics are added. You can add or remove metrics as needed.
Click Evaluate agent.

Evaluate a single trace or session

You can trigger evaluations directly while inspecting individual execution paths:

In the Google Cloud console, navigate to the Agent Platform > Agents page.
In the left navigation menu, select Deployments.
Select your agent.
Go to Deployments
Select the Traces tab.
Click Session view or Trace view to inspect the execution path.
Select a specific row from the table to open the details panel.
Select the Evaluation tab.
If the trace or session has not been evaluated, click Evaluate to run an ad-hoc assessment.

View evaluation results

After the evaluation completes, you can analyze the results to identify performance gaps and systemic issues:

View results for a run: In the Google Cloud console, go to the Agent Platform > Agents > Evaluation page and select the Evaluations tab. Click an evaluation name to view the detailed report.
Go to Evaluation
Drill down to traces: From a results report, click any row to navigate directly to the associated trace and inspect the reasoning (rationales) behind the scores.

For more information, see Analyze evaluation results.

Run offline evaluations Stay organized with collections Save and categorize content based on your preferences.