自 2025 年 4 月 29 日起，Gemini 1.5 Pro 和 Gemini 1.5 Flash 模型將無法用於先前未使用這些模型的專案，包括新專案。詳情請參閱「模型版本和生命週期」。

本頁面由 Cloud Translation API 翻譯而成。

準備評估資料集

對於 Gen AI 評估服務，評估資料集通常包含您要評估的模型回應、用於產生回應的輸入資料，以及可能的真值回應。

評估資料集架構

對於一般以模型為基礎的指標用途，資料集必須提供下列資訊：

輸入類型	輸入欄位內容
提示	使用者輸入的生成式 AI 模型或應用程式。在某些情況下為選填屬性。
回應	要評估的 LLM 推論回應。
baseline_model_response (逐對指標必填)	用於比較 LLM 回覆與成對評估結果的基準 LLM 推論回覆

如果您使用 Vertex AI SDK for Python 的 Gen AI Evaluation 模組，Gen AI 評估服務就能根據您指定的模型，自動產生 response 和 baseline_model_response。

對於其他評估用途，您可能需要提供更多資訊：

多輪對話或即時通訊

輸入類型	輸入欄位內容
記錄	使用者和模型在目前回合之前的對話記錄。
提示	在目前回合中，使用者對生成式 AI 模型或應用程式輸入的內容。
回應	系統會根據歷史記錄和目前的轉彎提示，評估您的 LLM 推論回應。
baseline_model_response (逐對指標必填)	用於比較 LLM 回覆與成對評估的基準 LLM 推論回覆，這項評估會根據歷史記錄和目前的回合提示。

計算指標

資料集必須提供大型語言模型的回覆內容，以及用於比較的參考資料。

輸入類型	輸入欄位內容
回應	要評估的 LLM 推論回應。
參考資料	用於比較 LLM 回覆的基準真相。

翻譯指標

您的資料集必須提供模型的回覆。視用途而定，您還需要提供參考資料進行比較、原始語言的輸入內容，或兩者皆提供。

輸入類型	輸入欄位內容
來源	系統預測的來源文字，以系統翻譯的來源語言為準。
回應	要評估的 LLM 推論回應。
參考資料	與 LLM 回覆進行比較的基準真相。這項資訊的語言與回應相同。

視用途而定，您也可以將輸入的使用者提示分解為細項，例如 instruction 和 context，並透過提供提示範本將這些細項組合起來，以便進行推論。如有需要，您也可以提供參考資料或實際資料：

輸入類型	輸入欄位內容
指示	輸入使用者提示的一部分。指的是傳送至 LLM 的推論指令。例如：「請摘要以下文字」是指示。
context	在目前回合中，使用者對生成式 AI 模型或應用程式輸入的內容。
參考資料	用於比較 LLM 回覆的基準真相。

評估資料集的必要輸入內容應與指標一致。如要進一步瞭解如何自訂指標，請參閱「定義評估指標」和「執行評估」。如要進一步瞭解如何在模型指標中加入參考資料，請參閱「根據輸入資料調整指標提示範本」。

匯入評估用資料集

您可以使用下列格式匯入資料集：

儲存在 Cloud Storage 中的 JSONL 或 CSV 檔案
BigQuery 資料表
Pandas DataFrame

評估資料集範例

本節將說明使用 Pandas Dataframe 格式的資料集範例。請注意，這裡僅顯示幾個資料記錄做為範例，評估資料集通常會有 100 個以上的資料點。如需準備資料集的最佳做法，請參閱「最佳做法」一節。

以點為單位的以模型為準指標

以下是摘要示例，說明點狀模型式指標的範例資料集：

prompts = [
    # Example 1
    (
        "Summarize the text in one sentence: As part of a comprehensive"
        " initiative to tackle urban congestion and foster sustainable urban"
        " living, a major city has revealed ambitious plans for an extensive"
        " overhaul of its public transportation system. The project aims not"
        " only to improve the efficiency and reliability of public transit but"
        " also to reduce the city's carbon footprint and promote eco-friendly"
        " commuting options. City officials anticipate that this strategic"
        " investment will enhance accessibility for residents and visitors"
        " alike, ushering in a new era of efficient, environmentally conscious"
        " urban transportation."
    ),
    # Example 2
    (
        "Summarize the text such that a five-year-old can understand: A team of"
        " archaeologists has unearthed ancient artifacts shedding light on a"
        " previously unknown civilization. The findings challenge existing"
        " historical narratives and provide valuable insights into human"
        " history."
    ),
]

responses = [
    # Example 1
    (
        "A major city is revamping its public transportation system to fight"
        " congestion, reduce emissions, and make getting around greener and"
        " easier."
    ),
    # Example 2
    (
        "Some people who dig for old things found some very special tools and"
        " objects that tell us about people who lived a long, long time ago!"
        " What they found is like a new puzzle piece that helps us understand"
        " how people used to live."
    ),
]

eval_dataset = pd.DataFrame({
    "prompt": prompts,
    "response": responses,
})

以模型為基準的逐對指標

以下範例為開放式問答案例，說明成對模型式指標的範例資料集。

prompts = [
    # Example 1
    (
        "Based on the context provided, what is the hardest material? Context:"
        " Some might think that steel is the hardest material, or even"
        " titanium. However, diamond is actually the hardest material."
    ),
    # Example 2
    (
        "Based on the context provided, who directed The Godfather? Context:"
        " Mario Puzo and Francis Ford Coppola co-wrote the screenplay for The"
        " Godfather, and the latter directed it as well."
    ),
]

responses = [
    # Example 1
    "Diamond is the hardest material. It is harder than steel or titanium.",
    # Example 2
    "Francis Ford Coppola directed The Godfather.",
]

baseline_model_responses = [
    # Example 1
    "Steel is the hardest material.",
    # Example 2
    "John Smith.",
]

eval_dataset = pd.DataFrame(
  {
    "prompt":  prompts,
    "response":  responses,
    "baseline_model_response": baseline_model_responses,
  }
)

計算指標

對於以運算為基礎的指標，通常需要 reference。

eval_dataset = pd.DataFrame({
  "response": ["The Roman Senate was filled with exuberance due to Pompey's defeat in Asia."],
  "reference": ["The Roman Senate was filled with exuberance due to successes against Catiline."],
})

工具使用 (函式呼叫) 指標

以下範例顯示計算型工具使用指標的輸入資料：

json_responses = ["""{
    "content": "",
    "tool_calls":[{
      "name":"get_movie_info",
      "arguments": {"movie":"Mission Impossible", "time": "today 7:30PM"}
    }]
  }"""]

json_references = ["""{
    "content": "",
    "tool_calls":[{
      "name":"book_tickets",
      "arguments":{"movie":"Mission Impossible", "time": "today 7:30PM"}
      }]
  }"""]

eval_dataset = pd.DataFrame({
    "response": json_responses,
    "reference": json_references,
})

翻譯用途

以下範例顯示翻譯指標的輸入資料：

  source = [
      "Dem Feuer konnte Einhalt geboten werden",
      "Schulen und Kindergärten wurden eröffnet.",
  ]

  response = [
      "The fire could be stopped",
      "Schools and kindergartens were open",
  ]

  reference = [
      "They were able to control the fire.",
      "Schools and kindergartens opened",
  ]

  eval_dataset = pd.DataFrame({
      "source": source,
      "response": response,
      "reference": reference,
  })

最佳做法

定義評估資料集時，請遵循下列最佳做法：

提供代表模型在實際工作中處理的輸入類型的範例。
資料集至少須包含一個評估範例。建議您提供約 100 個範例，確保取得高品質的匯總指標和具統計顯著性的結果。這個大小有助於在匯總評估結果中建立更高的信心水準，盡量減少異常值的影響，並確保效能指標反映模型在不同情境下的真實能力。提供的樣本數量超過 400 個時，匯總指標品質改善率通常會下降。

後續步驟

執行評估作業。
試用評估範例筆記本。

準備評估資料集 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

評估資料集架構

多輪對話或即時通訊

計算指標

翻譯指標

匯入評估用資料集

評估資料集範例

以點為單位的以模型為準指標

以模型為基準的逐對指標

計算指標

工具使用 (函式呼叫) 指標

翻譯用途

最佳做法

後續步驟

準備評估資料集