This document describes how to use agent evaluation to measure and improve the performance, safety, and quality of your agents.
To learn more about model evaluation, see Gen AI evaluation service overview.
Procedure summary
| Phase | Activity | Goal |
|---|---|---|
| Design | Define eval cases | Specify agent tasks and expected outcomes. |
| Execution | Run inferences | Generate real-world or simulated conversation traces. |
| Scoring | Compute metrics | Grade traces using automated raters (Task Success, Safety). |
| Refinement | Optimize agent | Propose and verify improvements to instructions or tools. |
Evaluation process
Evaluation follows a structured, iterative workflow:
- Define eval cases: An eval case is a specification that defines an agent's task. An eval case can include one or multiple conversation steps, the conversation context (the agent's state), and a specification for simulating user responses during inference.
- Run inferences: Inference is the execution of an eval case. If an eval case contains a conversation plan, user responses are simulated during inference.
- Generate traces: Each inference run captures the agent's behavior in a trace. A trace is a factual, immutable record of the agent's behavior, including model inputs, responses, and tool calls.
- Compute metrics: Metrics are scores computed for each trace using prebuilt or custom raters. Some metrics, like Exact Match, are reference-based and require an eval case with a reference answer. Others, like Helpfulness, are reference-free and evaluate the trace on its own. This automated evaluation lets you score traces captured from production traffic or external logs, independent of a managed test environment.
- Conduct analysis: Analyze metrics, rubrics, and verdicts to identify key agent issues, link the agent issues back to test cases, and generate insights for improvement.
- Optimize the agent: Use optimization to manage the entire evaluation cycle. This automated process analyzes results, proposes improvements to the agent, and iteratively reruns the process to verify performance gains.
Evaluation workflow
You can integrate evaluation into two main stages of your workflow:
- Local development iteration: Evaluate an Agent Development Kit (ADK)-based agent locally to rapidly iterate on prompt engineering and tool configurations.
- Deployed agent assessment: Measure the quality of deployed agents by analyzing historical traces or running synthetic benchmarks against agent endpoints.
Core capabilities
Agent evaluation helps you build an initial evaluation suite, even without existing test data. The following features help automate the process of generating test cases and refining your agentic systems:
Scenario generation and user simulation: Automatically generate diverse, multi-turn synthetic test scenarios based on your agent's instructions and tool definitions. This automation allows you to start testing immediately by eliminating the need to manually author initial test cases.
Environment simulation: Intercept specific tool calls to inject custom behaviors, mocked data, or simulated errors (such as HTTP 503 errors or latency spikes). This simulation lets you validate agent resilience without impacting production backends.
Multi-turn evaluation: Automatically evaluate entire conversation histories using multi-turn autoraters. These raters analyze intent extraction, dynamically generate rubrics, and provide objective validation verdicts to help ensure instruction adherence.
Prompt optimization: Programmatically generate and validate refined system instructions by using prompt optimization. The optimization framework identifies points of failure and iteratively proposes targeted updates.
Evaluate with AI coding assistants
If you use Gemini CLI or another AI coding assistant, you can install Agent skills that teach your assistant the agent evaluation methodology described on this page. Each skill provides the eval workflow, dataset schema, metric selection guidance, and failure analysis steps directly in your coding session, so your assistant can build, grade, and improve evaluations without leaving your editor.
Installation instructions follow each skill.
Agents CLI eval skill
A CLI-driven workflow to evaluate and optimize Agent Development Kit (ADK)
agents using the agents-cli eval commands. This skill covers:
- Preparing eval datasets and synthesizing multi-turn scenarios with user simulation
- Running inference, grading traces, and analyzing failure clusters
- Iterating on prompts and tools with the eval-fix loop
To install, run the following command:
npx skills add https://github.com/google/agents-cli --skill google-agents-cli-eval
Agent Platform GenAI Evaluation Service flywheel skill
An SDK-driven playbook to evaluate and improve models and agents through the
Agent Platform GenAI Evaluation Service, using the Agent Platform GenAI Evaluation SDK
(client.evals.evaluate()). This skill covers:
- Building eval datasets from session traces, DataFrames, or synthetic generation
- Selecting, configuring, and writing custom metrics with LLM-as-judge scoring
- Analyzing rubric verdicts and loss patterns to drive concrete improvements
To install, run the following command:
npx skills add https://github.com/google/skills --skill agent-platform-eval-flywheel