Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

定义评估指标

创建评估数据集后，下一步是定义用于衡量模型性能的指标。生成式 AI 模型可以创建应用来执行各种任务，而 Gen AI Evaluation Service 使用测试驱动型框架，可将评估从主观评分转换为切实可行的客观结果。

如需详细了解指标类型，请参阅 Gen AI Evaluation Service 概览页面上的评估指标部分。

常规质量指标

您可以通过 SDK 访问自适应评分准则。我们建议首先将 GENERAL_QUALITY 作为默认值。

GENERAL_QUALITY 会根据输入提示生成一组涵盖各种任务（例如指令遵从、格式设置、语气、风格）的评分准则。您可以在以下代码行中将评分准则生成与验证相结合：

from vertexai import types

eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[
        types.RubricMetric.GENERAL_QUALITY,
    ],
)

您可以先单独生成评分准则（以便在模型和代理之间查看或重复使用），然后再使用它们评估模型回答：

from vertexai import types

# Use GENERAL_QUALITY recipe to generate rubrics, and store them
# as a rubric group named "general_quality_rubrics".
data_with_rubrics = client.evals.generate_rubrics(
    src=eval_dataset_df,
    rubric_group_name="general_quality_rubrics",
    predefined_spec_name=types.RubricMetric.GENERAL_QUALITY,
)

# Specify the group of rubrics to use for the evaluation.
eval_result = client.evals.evaluate(
    dataset=data_with_rubrics,
    metrics=[types.RubricMetric.GENERAL_QUALITY(
      rubric_group_name="general_quality_rubrics",
    )],
)

您还可以使用自然语言 guidelines 引导 GENERAL_QUALITY，使系统重点根据对您而言最重要的标准来生成评分准则。然后，Gen AI Evaluation Service 会生成涵盖其默认任务和您指定的准则的评分准则。

from vertexai import types

eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[
        types.RubricMetric.GENERAL_QUALITY(
            metric_spec_parameters={
                "guidelines": "The response must maintain a professional tone and must not provide financial advice."
            }
        )
    ],
)

有针对性的质量指标

如果您需要评估模型质量的更具针对性的方面，可以使用会生成侧重于特定领域的评分准则的指标。例如：

from vertexai import types

eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[
        types.RubricMetric.TEXT_QUALITY,
        types.RubricMetric.INSTRUCTION_FOLLOWING,
    ],
)

Gen AI Evaluation Service 提供以下类型的自适应评分准则：

INSTRUCTION_FOLLOWING：衡量回答在多大程度上遵循了提示中的特定限制条件和指令。
TEXT_QUALITY：专门侧重于回答的语言质量，评估流畅度、连贯性和语法。

多轮对话

multi_turn_general_quality：评估多轮对话中的整体对话质量。
multi_turn_text_quality：评估多轮对话中回答的文本质量。

智能体评估

final_response_reference_free：评估代理最终答案的质量，而无需参考答案。
final_response_quality：使用自适应评分准则，根据代理的配置和工具使用情况评估代理最终答案的质量。
hallucination：根据代理的配置和工具使用情况，评估代理的文本回答是否基于事实。
tool_use_quality：评估代理为响应用户提示而进行的函数调用的正确性。

如需详细了解有针对性的自适应评分准则，请参阅自适应评分准则详细信息。

静态评分准则

静态评分准则会为数据集中的每个样本应用一组固定的评分准则。如果您需要针对所有提示，以一致的基准衡量性能，那么这种得分驱动型方法会很有用。

例如，以下静态评分准则以 1-5 分制对文本质量进行评分：

5: (Very good). Exceptionally clear, coherent, fluent, and concise. Fully adheres to instructions and stays grounded.
4: (Good). Well-written, coherent, and fluent. Mostly adheres to instructions and stays grounded. Minor room for improvement.
3: (Ok). Adequate writing with decent coherence and fluency. Partially fulfills instructions and may contain minor ungrounded information. Could be more concise.
2: (Bad). Poorly written, lacking coherence and fluency. Struggles to adhere to instructions and may include ungrounded information. Issues with conciseness.
1: (Very bad). Very poorly written, incoherent, and non-fluent. Fails to follow instructions and contains substantial ungrounded information. Severely lacking in conciseness.

Gen AI Evaluation Service 提供以下静态评分准则指标：

GROUNDING：根据提供的源文本（标准答案）检查真实性和一致性。此指标对于 RAG 系统至关重要。
SAFETY：评估模型回答是否违反安全政策，例如是否包含仇恨言论或危险内容。

您还可以使用指标提示模板，例如 FLUENCY。

from vertexai import types

eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[
        types.RubricMetric.SAFETY,
        types.RubricMetric.GROUNDING,
        types.RubricMetric.FLUENCY,
    ],
)

自定义静态评分准则

对于高度专业化的需求，您可以创建自己的静态评分准则。此方法可提供最大的控制权，但需要您仔细设计评估提示，以确保获得一致且可靠的结果。我们建议您先使用包含 GENERAL_QUALITY 的准则，然后再自定义静态评分准则。

# Define a custom metric to evaluate language simplicity
simplicity_metric = types.LLMMetric(
    name='language_simplicity',
    prompt_template=types.MetricPromptBuilder(
        instruction="Evaluate the story's simplicity for a 5-year-old.",
        criteria={
            "Vocabulary": "Uses simple words.",
            "Sentences": "Uses short sentences.",
        },
        rating_scores={
            "5": "Excellent: Very simple, ideal for a 5-year-old.",
            "4": "Good: Mostly simple, with minor complex parts.",
            "3": "Fair: Mix of simple and complex; may be challenging for a 5-year-old.",
            "2": "Poor: Largely too complex, with difficult words/sentences.",
            "1": "Very Poor: Very complex, unsuitable for a 5-year-old."
        }
    )
)

eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[
        simplicity_metric
    ],
)

基于计算的指标

基于计算的指标使用确定性算法，通过将模型回答与参考答案进行比较来为模型回答评分。它们需要数据集中的标准答案，非常适合明确定义了“正确”答案的任务。

以召回率为导向的摘要评估研究（rouge_l、rouge_1）：衡量模型回答与参考文本之间 n 元语法（连续的字词序列）的重叠程度。它通常用于评估文本摘要。
双语替换评测 (bleu)：通过统计匹配的 n 元语法来衡量回答与高质量参考文本的相似程度。它是衡量翻译质量的标准指标，但也可用于其他文本生成任务。
完全匹配 (exact_match)：衡量与参考答案完全相同的回答所占的百分比。这对于基于事实的问答或只有一种正确回答的任务非常有用。

from vertexai import types

eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[
        types.Metric(name='bleu'),
        types.Metric(name='rouge_l'),
        types.Metric(name='exact_match')
    ],
)

自定义函数指标

您还可以通过向 custom_function 参数传递自定义 Python 函数来实现自定义评估逻辑。Gen AI Evaluation Service 会针对数据集的每一行执行此函数。

# Define a custom function to check for the presence of a keyword
def contains_keyword(instance: dict) -> dict:
    keyword = "magic"
    response_text = instance.get("response", "")
    score = 1.0 if keyword in response_text.lower() else 0.0
    return {"score": score}

keyword_metric = types.Metric(
    name="keyword_check",
    custom_function=contains_keyword
)

eval_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[keyword_metric]
)

远程自定义函数指标

除了在本地运行自定义评估逻辑之外，您还可以实现可在远程沙盒环境中安全执行的自定义评估逻辑。如果您希望将评估作为模型调优工作流的一部分进行集成，或者您有现有评估指标未涵盖的用户特定场景，则此方法非常有用。为此，您可以将 Python 代码段作为字符串传递给 Metric 类的 remote_custom_function 参数。Gen AI Evaluation Service 会针对数据集的每一行远程执行此函数。

import pandas as pd
from vertexai import types

code_snippet = """
def evaluate(instance):
    if instance['response'] == instance['reference']:
        return 1.0
    return 0.0
"""

custom_metric = types.Metric(
    name="my_custom_code_metric",
    remote_custom_function=code_snippet,
)

prompts_df = pd.DataFrame(
    {
        "prompt": ["What is 2+2?", "What is 3+3?"],
        "response": ["4", "5"],
        "reference": ["4", "6"],
    }
)

eval_dataset = types.EvaluationDataset(
    eval_dataset_df=prompts_df,
    candidate_name="test_model",
)

evaluation_result = client.evals.evaluate(
    dataset=eval_dataset,
    metrics=[custom_metric],
)

评估实例输入

evaluate 函数接受一个 instance 字典作为实参。instance 表示评估实例，并且 EvaluationInstance 中填充的任何字段都可作为 instance[field_name] 提供给该函数。可用字段包括：

prompt：提供给模型的用户提示。
response：模型生成的输出。
reference：用于与回答进行比较的标准答案。
rubric_groups：与提示关联的评分准则的命名组。
other_data：用于根据键填充占位符的其他数据。
agent_eval_data：特定于代理评估的数据，例如代理配置和跟踪记录。

技术限制

执行环境 ：自定义代码在没有网络访问权限的沙盒环境中执行。
执行时间限制 ：评分执行本身限制为 1 分钟。
内存限制 ：上传的代码的总大小与执行期间加载的任何数据相结合，不得超过 1.5 GB。

执行时可以使用以下第三方软件包：

altair
chess
cv2
deepdiff
editdistance
jsonschema
matplotlib
mpmath
nltk
numpy
pandas
pdfminer
pydantic
rdkit
reportlab
scipy
seaborn
sklearn
sqlparse
statsmodels
striprtf
sympy
tabulate