准备评估数据集

本页面介绍了如何为 Gen AI Evaluation Service 准备数据集。

概览

Gen AI Evaluation Service 会自动检测并处理多种常见数据格式。这意味着，您通常可以直接使用数据，而无需执行手动转换。

您需要在数据集中提供的字段取决于您的目标：

目标	所需数据	SDK 工作流
生成新的回答，然后对其进行评估	`prompt`	`run_inference()` → `evaluate()`
评估现有回答	`prompt`和`response`之间	`evaluate()`
生成新的代理运行结果，然后对其进行评估	`prompt`	`run_inference()` → `evaluate()`
评估现有的代理回答和中间事件	`prompt`、`response`和`intermediate_events`	`evaluate()`

运行 client.evals.evaluate() 或 client.evals.create_evaluation_run() 时，Gen AI Evaluation Service 会自动在数据集中查找以下常见字段：

prompt：（必需）要评估的模型的输入。为获得最佳效果，您应提供能够代表模型在生产环境中处理的输入类型的示例提示。
response：（必需）待评估模型或应用生成的输出。
reference：（可选）可以用来与模型的回答进行对比的标准答案或“黄金”答案。对于基于计算的指标（例如 bleu 和 rouge），此字段通常是必需的。
conversation_history：（可选）多轮对话中先前轮次的列表。Gen AI Evaluation Service 会自动从支持的格式中提取此字段。如需了解详情，请参阅处理多轮对话。
session_inputs：（可选）用于初始化会话以运行代理的输入。对于 run_inference() → evaluate() 工作流，此字段只是可选字段。
intermediate_events：（可选）代理运行中单个对话轮次的代理跟踪记录，包括函数调用、函数响应和中间模型响应。对于 run_inference() → evaluate() 工作流，此字段不是必需字段。

支持的数据格式

Gen AI Evaluation Service 支持以下格式：

Pandas DataFrame（扁平化格式）
Gemini 批量预测格式 (JSONL)
OpenAI 对话补全格式 (JSONL)

Pandas DataFrame

对于简单的评估，您可以使用 pandas.DataFrame。Gen AI Evaluation Service 会查找 prompt、response 和 reference 等常见列名称。此格式完全向后兼容。

import pandas as pd

# Example DataFrame with prompts and ground truth references
prompts_df = pd.DataFrame({
    "prompt": [
        "What is the capital of France?",
        "Who wrote 'Hamlet'?",
    ],
    "reference": [
        "Paris",
        "William Shakespeare",
    ]
})

# You can use this DataFrame directly with run_inference or evaluate
eval_dataset = client.evals.run_inference(model="gemini-2.5-flash", src=prompts_df)
eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[types.PrebuiltMetric.GENERAL_QUALITY]
)
eval_result.show()

Gemini 批量预测格式

您可以直接使用 Vertex AI 批量预测作业的输出，这些输出通常是存储在 Cloud Storage 中的 JSONL 文件，其中每一行都包含一个请求和响应对象。Gen AI Evaluation Service 会自动解析此结构，以便与其他 Vertex AI 服务集成。

下面是 JSONl 文件中一行内容的示例：

{"request": {"contents": [{"role": "user", "parts": [{"text": "Why is the sky blue?"}]}]}, "response": {"candidates": [{"content": {"role": "model", "parts": [{"text": "The sky appears blue to the human eye as a result of a phenomenon known as Rayleigh scattering."}]}}]}}

然后，您可以直接评估批量作业中预生成的回答：

# Cloud Storage path to your batch prediction output file
batch_job_output_uri = "gs://path/to/your/batch_output.jsonl"

# Evaluate the pre-generated responses directly
eval_result = client.evals.evaluate(
    dataset=batch_job_output_uri,
    metrics=[types.PrebuiltMetric.GENERAL_QUALITY]
)
eval_result.show()

OpenAI 对话补全格式

有关第三方模型（例如 OpenAI 和 Anthropic）评估或比较，Gen AI Evaluation Service 支持 OpenAI 对话补全格式。您可以提供一个数据集，其中每一行都是一个结构类似于 OpenAI API 请求的 JSON 对象。Gen AI Evaluation Service 会自动检测此格式。

下面是采用此格式的一行内容的示例：

{"request": {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the capital of France?"}], "model": "gpt-4o"}}

您可以使用此数据生成第三方模型的回答并评估这些回答：

# Ensure your third-party API key is set
# e.g., os.environ['OPENAI_API_KEY'] = 'Your API Key'

openai_request_uri = "gs://path/to/your/openai_requests.jsonl"

# Generate responses using a LiteLLM-supported model string
openai_responses = client.evals.run_inference(
    model="gpt-4o",  # LiteLLM compatible model string
    src=openai_request_uri,
)

# The resulting openai_responses object can then be evaluated
eval_result = client.evals.evaluate(
    dataset=openai_responses,
    metrics=[types.PrebuiltMetric.GENERAL_QUALITY]
)
eval_result.show()

处理多轮对话

Gen AI Evaluation Service 会自动解析采用所支持格式的多轮对话数据。如果您的输入数据包含对话历史记录（例如 Gemini 格式中的 request.contents 字段或 OpenAI 格式中的 request.messages），Gen AI Evaluation Service 会识别先前的轮次，并将其处理为 conversation_history。

这意味着您无需手动将当前提示与先前的对话分开，因为评估指标可以使用对话历史记录来了解模型回答的上下文。

请参阅以下采用 Gemini 格式的多轮对话示例：

{
  "request": {
    "contents": [
      {"role": "user", "parts": [{"text": "I'm planning a trip to Paris."}]},
      {"role": "model", "parts": [{"text": "That sounds wonderful! What time of year are you going?"}]},
      {"role": "user", "parts": [{"text": "I'm thinking next spring. What are some must-see sights?"}]}
    ]
  },
  "response": {
    "candidates": [
      {"content": {"role": "model", "parts": [{"text": "For spring in Paris, you should definitely visit the Eiffel Tower, the Louvre Museum, and wander through Montmartre."}]}}
    ]
  }
}

系统会自动解析多轮对话，如下所示：

prompt：最后一条用户消息被标识为当前提示 ({"role": "user", "parts": [{"text": "I'm thinking next spring. What are some must-see sights?"}]})。
conversation_history：系统会自动提取先前的消息，并将其作为对话历史记录 ([{"role": "user", "parts": [{"text": "I'm planning a trip to Paris."}]}, {"role": "model", "parts": [{"text": "That sounds wonderful! What time of year are you going?"}]}]) 提供。
response：模型的回答取自 response 字段 ({"role": "model", "parts": [{"text": "For spring in Paris..."}]})。

后续步骤

运行评估。