Gen AI evaluation service API

שירות ההערכה של AI גנרטיבי מאפשר לכם להעריך את המודלים הגדולים של השפה (LLM) שלכם לפי כמה מדדים עם קריטריונים משלכם. אתם יכולים לספק קלט בזמן ההסקה, תשובות של מודל שפה גדול ופרמטרים נוספים, ושירות ההערכה של ה-AI הגנרטיבי מחזיר מדדים שספציפיים למשימת ההערכה.

המדדים כוללים מדדים מבוססי-מודל, כמו PointwiseMetric ו-PairwiseMetric, ומדדים מחושבים בזיכרון, כמו rouge, bleu ומדדים של קריאות לפונקציות של כלים. ‫PointwiseMetric ו-PairwiseMetric הם מדדים כלליים שמבוססים על מודלים, שאפשר להתאים אישית באמצעות קריטריונים משלכם. השירות מקבל את תוצאות החיזוי ישירות מהמודלים כקלט, ולכן שירות ההערכה יכול לבצע הן הסקה והן הערכה לאחר מכן בכל המודלים שנתמכים על ידי Vertex AI.

מידע נוסף על הערכת מודל זמין במאמר סקירה כללית על שירות הערכת ה-AI הגנרטיבי.

מגבלות

אלה המגבלות של שירות ההערכה:

יכול להיות שיהיה עיכוב בהפצה של שירות ההערכה בשיחה הראשונה.
רוב המדדים שמבוססים על מודלים צורכים תפוקה של Gemini 2.5 Flash, כי שירות ההערכה של ה-AI הגנרטיבי משתמש ב-gemini-2.5-flash כמודל השופט הבסיסי כדי לחשב את המדדים האלה שמבוססים על מודלים.
חלק מהמדדים שמבוססים על מודלים, כמו MetricX ו-COMET, משתמשים במודלים שונים של למידת מכונה, ולכן הם לא צורכים את מגבלת התפוקה של Gemini.

תחביר לדוגמה

תחביר לשליחת בקשה לשיחת הערכה.

curl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \

https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances \
-d '{
  "pointwise_metric_input" : {
    "metric_spec" : {
      ...
    },
    "instance": {
      ...
    },
  }
}'

Python

import json

from google import auth
from google.api_core import exceptions
from google.auth.transport import requests as google_auth_requests

creds, _ = auth.default(
    scopes=['https://www.googleapis.com/auth/cloud-platform'])

data = {
  ...
}

uri = f'https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}:evaluateInstances'
result = google_auth_requests.AuthorizedSession(creds).post(uri, json=data)

print(json.dumps(result.json(), indent=2))

רשימת פרמטרים

פרמטרים
`exact_match_input`	`ExactMatchInput` (אופציונלי) קלט להערכה אם התחזית תואמת בדיוק להפניה.
`bleu_input`	`BleuInput` (אופציונלי) קלט לחישוב ציון BLEU על ידי השוואת החיזוי לנתוני ההשוואה.
`rouge_input`	`RougeInput` (אופציונלי) הנתונים שמוזנים לחישוב הציונים `rouge` הם תוצאה של השוואה בין התחזית לבין ההפניה. `rouge_type` תומך בניקודים שונים של `rouge`.
`fluency_input`	`FluencyInput` (אופציונלי) קלט להערכת רמת השליטה בשפה של תגובה יחידה.
`coherence_input`	`CoherenceInput` (אופציונלי) קלט להערכת היכולת של תשובה יחידה לספק תגובה עקבית וקלה להבנה.
`safety_input`	`SafetyInput` (אופציונלי) קלט להערכת רמת הבטיחות של תשובה יחידה.
`groundedness_input`	`GroundednessInput` (אופציונלי) קלט להערכת היכולת של תשובה יחידה לספק מידע שכלול רק בטקסט הקלט או להפנות אליו.
`fulfillment_input`	`FulfillmentInput` (אופציונלי) קלט להערכת היכולת של תשובה יחידה למלא את ההוראות באופן מלא.
`summarization_quality_input`	`SummarizationQualityInput` (אופציונלי) קלט להערכת היכולת הכוללת של תשובה אחת לסכם טקסט.
`pairwise_summarization_quality_input`	`PairwiseSummarizationQualityInput` (אופציונלי) קלט להשוואה בין האיכות הכוללת של סיכום שתי תשובות.
`summarization_helpfulness_input`	`SummarizationHelpfulnessInput` (אופציונלי) קלט להערכת היכולת של תשובה יחידה לספק סיכום, שמכיל את הפרטים הנדרשים כדי להחליף את הטקסט המקורי.
`summarization_verbosity_input`	`SummarizationVerbosityInput` (אופציונלי) קלט להערכת היכולת של תשובה אחת לספק סיכום תמציתי.
`question_answering_quality_input`	`QuestionAnsweringQualityInput` (אופציונלי) קלט להערכת היכולת הכוללת של תשובה יחידה לענות על שאלות, בהינתן גוף טקסט להפניה.
`pairwise_question_answering_quality_input`	`PairwiseQuestionAnsweringQualityInput` (אופציונלי) קלט להשוואה בין היכולת הכוללת של שתי תשובות לענות על שאלות, בהינתן גוף טקסט להפניה.
`question_answering_relevance_input`	`QuestionAnsweringRelevanceInput` (אופציונלי) קלט להערכת היכולת של תשובה יחידה לספק מידע רלוונטי כשנשאלת שאלה.
`question_answering_helpfulness_input`	`QuestionAnsweringHelpfulnessInput` (אופציונלי) קלט להערכת היכולת של תשובה יחידה לספק פרטים חשובים כשעונים על שאלה.
`question_answering_correctness_input`	`QuestionAnsweringCorrectnessInput` (אופציונלי) קלט להערכת היכולת של תשובה יחידה לענות על שאלה בצורה נכונה.
`pointwise_metric_input`	`PointwiseMetricInput` (אופציונלי) קלט להערכה כללית של נקודות ספציפיות.
`pairwise_metric_input`	`PairwiseMetricInput` (אופציונלי) קלט להערכה כללית של זוגות.
`tool_call_valid_input`	`ToolCallValidInput` (אופציונלי) קלט להערכת היכולת של תשובה יחידה לחזות קריאה תקפה לכלי.
`tool_name_match_input`	`ToolNameMatchInput` (אופציונלי) קלט להערכת היכולת של תשובה יחידה לחזות קריאה לכלי עם שם הכלי הנכון.
`tool_parameter_key_match_input`	`ToolParameterKeyMatchInput` (אופציונלי) קלט להערכת היכולת של תשובה בודדת לחזות קריאה לכלי עם שמות פרמטרים נכונים.
`tool_parameter_kv_match_input`	`ToolParameterKvMatchInput` (אופציונלי) קלט להערכת היכולת של תגובה יחידה לחזות קריאה לכלי עם שמות וערכים נכונים של פרמטרים
`comet_input`	`CometInput` (אופציונלי) קלט להערכה באמצעות COMET.
`metricx_input`	`MetricxInput` (אופציונלי) קלט להערכה באמצעות MetricX.

`ExactMatchInput`

{
  "exact_match_input": {
    "metric_spec": {},
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

פרמטרים
`metric_spec`	אופציונלי: `ExactMatchSpec`. מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instances`	`ExactMatchInstance[]` (אופציונלי) קלט להערכה, שכולל את התגובה של מודל שפה גדול (LLM) ואת ההפניה.
`instances.prediction`	`string` (אופציונלי) תשובה מ-LLM.
`instances.reference`	`string` (אופציונלי) תשובת LLM מוזהבת לעיון.

`ExactMatchResults`

{
  "exact_match_results": {
    "exact_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

תשובה

תשובה
`exact_match_metric_values`	`ExactMatchMetricValue[]` תוצאות הבדיקה לכל קלט של מופע.
`exact_match_metric_values.score`	`float` אחת מהאפשרויות הבאות: ‫`0`: המופע לא היה התאמה מדויקת ‫`1`: התאמה מדויקת

exact_match_metric_values

ExactMatchMetricValue[]

תוצאות הבדיקה לכל קלט של מופע.

exact_match_metric_values.score

float

אחת מהאפשרויות הבאות:

‫0: המופע לא היה התאמה מדויקת
‫1: התאמה מדויקת

`BleuInput`

{
  "bleu_input": {
    "metric_spec": {
      "use_effective_order": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

פרמטרים
`metric_spec`	`BleuSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`metric_spec.use_effective_order`	`bool` (אופציונלי) האם לקחת בחשבון סדרים של n-gram ללא התאמה.
`instances`	`BleuInstance[]` (אופציונלי) קלט להערכה, שכולל את התגובה של מודל שפה גדול (LLM) ואת ההפניה.
`instances.prediction`	`string` (אופציונלי) תשובה מ-LLM.
`instances.reference`	`string` (אופציונלי) תשובת LLM מוזהבת לעיון.

`BleuResults`

{
  "bleu_results": {
    "bleu_metric_values": [
      {
        "score": float
      }
    ]
  }
}

תשובה

תשובה
`bleu_metric_values`	`BleuMetricValue[]` תוצאות הבדיקה לכל קלט של מופע.
`bleu_metric_values.score`	‫`float`: `[0, 1]`, כאשר ציונים גבוהים יותר מציינים שהתחזית דומה יותר לנתוני ההשוואה.

bleu_metric_values

BleuMetricValue[]

תוצאות הבדיקה לכל קלט של מופע.

bleu_metric_values.score

‫float: [0, 1], כאשר ציונים גבוהים יותר מציינים שהתחזית דומה יותר לנתוני ההשוואה.

`RougeInput`

{
  "rouge_input": {
    "metric_spec": {
      "rouge_type": string,
      "use_stemmer": bool,
      "split_summaries": bool
    },
    "instances": [
      {
        "prediction": string,
        "reference": string
      }
    ]
  }
}

פרמטרים
`metric_spec`	`RougeSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`metric_spec.rouge_type`	`string` (אופציונלי) ערכים קבילים: ‫`rougen[1-9]`: compute `rouge` scores based on the overlap of n-grams between the prediction and the reference. ‫`rougeL`: חישוב ציונים `rouge` על סמך הרצף המשותף הארוך ביותר (LCS) בין התחזית לבין ההפניה. ‫`rougeLsum`: קודם מפצלת את התחזית ואת ההפניה למשפטים, ואז מחשבת את ה-LCS לכל טופל. הציון הסופי של `rougeLsum` הוא הממוצע של הציונים הנפרדים של LCS.
`metric_spec.use_stemmer`	`bool` (אופציונלי) האם להשתמש ב-Porter stemmer כדי להסיר סיומות של מילים ולשפר את ההתאמה.
`metric_spec.split_summaries`	`bool` (אופציונלי) האם להוסיף שורות חדשות בין משפטים עבור rougeLsum.
`instances`	`RougeInstance[]` (אופציונלי) קלט להערכה, שכולל את התגובה של מודל שפה גדול (LLM) ואת ההפניה.
`instances.prediction`	`string` (אופציונלי) תשובה מ-LLM.
`instances.reference`	`string` (אופציונלי) תשובת LLM מוזהבת לעיון.

`RougeResults`

{
  "rouge_results": {
    "rouge_metric_values": [
      {
        "score": float
      }
    ]
  }
}

תשובה

תשובה
`rouge_metric_values`	`RougeValue[]` תוצאות הבדיקה לכל קלט של מופע.
`rouge_metric_values.score`	‫`float`: `[0, 1]`, כאשר ציונים גבוהים יותר מציינים שהתחזית דומה יותר לנתוני ההשוואה.

rouge_metric_values

RougeValue[]

תוצאות הבדיקה לכל קלט של מופע.

rouge_metric_values.score

‫float: [0, 1], כאשר ציונים גבוהים יותר מציינים שהתחזית דומה יותר לנתוני ההשוואה.

`FluencyInput`

{
  "fluency_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

פרמטרים

פרמטרים
`metric_spec`	`FluencySpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`FluencyInstance` (אופציונלי) קלט להערכה, שכולל את התגובה של מודל שפה גדול (LLM).
`instance.prediction`	`string` (אופציונלי) תשובה מ-LLM.

metric_spec

FluencySpec (אופציונלי)

מפרט המדד, שבו מוגדרת ההתנהגות של המדד.

instance

FluencyInstance (אופציונלי)

קלט להערכה, שכולל את התגובה של מודל שפה גדול (LLM).

instance.prediction

string (אופציונלי)

תשובה מ-LLM.

`FluencyResult`

{
  "fluency_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`score`	‫`float`: אחת מהאפשרויות הבאות: `1`: לא ברור `2`: לא כל כך ברור `3`: ניטרלי `4`: די שוטף `5`: שוטף
`explanation`	`string`: הצדקה להקצאת הניקוד.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

score

‫float: אחת מהאפשרויות הבאות:

1: לא ברור
2: לא כל כך ברור
3: ניטרלי
4: די שוטף
5: שוטף

explanation

string: הצדקה להקצאת הניקוד.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`CoherenceInput`

{
  "coherence_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

פרמטרים

פרמטרים
`metric_spec`	`CoherenceSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`CoherenceInstance` (אופציונלי) קלט להערכה, שכולל את התגובה של מודל שפה גדול (LLM).
`instance.prediction`	`string` (אופציונלי) תשובה מ-LLM.

metric_spec

CoherenceSpec (אופציונלי)

מפרט המדד, שבו מוגדרת ההתנהגות של המדד.

instance

CoherenceInstance (אופציונלי)

קלט להערכה, שכולל את התגובה של מודל שפה גדול (LLM).

instance.prediction

string (אופציונלי)

תשובה מ-LLM.

`CoherenceResult`

{
  "coherence_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`score`	‫`float`: אחת מהאפשרויות הבאות: `1`: לא עקבי `2`: לא כל כך עקבי `3`: ניטרלי `4`: קצת קוהרנטי `5`: עקבי
`explanation`	`string`: הצדקה להקצאת הניקוד.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

score

‫float: אחת מהאפשרויות הבאות:

1: לא עקבי
2: לא כל כך עקבי
3: ניטרלי
4: קצת קוהרנטי
5: עקבי

explanation

string: הצדקה להקצאת הניקוד.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`SafetyInput`

{
  "safety_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string
    }
  }
}

פרמטרים

פרמטרים
`metric_spec`	`SafetySpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`SafetyInstance` (אופציונלי) קלט להערכה, שכולל את התגובה של מודל שפה גדול (LLM).
`instance.prediction`	`string` (אופציונלי) תשובה מ-LLM.

metric_spec

SafetySpec (אופציונלי)

מפרט המדד, שבו מוגדרת ההתנהגות של המדד.

instance

SafetyInstance (אופציונלי)

קלט להערכה, שכולל את התגובה של מודל שפה גדול (LLM).

instance.prediction

string (אופציונלי)

תשובה מ-LLM.

`SafetyResult`

{
  "safety_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`score`	‫`float`: אחת מהאפשרויות הבאות: `0`: לא בטוח `1`: בטוח
`explanation`	`string`: הצדקה להקצאת הניקוד.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

score

‫float: אחת מהאפשרויות הבאות:

0: לא בטוח
1: בטוח

explanation

string: הצדקה להקצאת הניקוד.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`GroundednessInput`

{
  "groundedness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "context": string
    }
  }
}

פרמטר	תיאור
`metric_spec`	אופציונלי: GroundednessSpec מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	אופציונלי: GroundednessInstance קלט להערכה, שכולל קלט של הסקה ותשובה תואמת.
`instance.prediction`	`string` (אופציונלי) תשובה מ-LLM.
`instance.context`	`string` (אופציונלי) טקסט בזמן ההסקה שמכיל את כל המידע, שאפשר להשתמש בו בתשובה של מודל ה-LLM.

`GroundednessResult`

{
  "groundedness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`score`	‫`float`: אחת מהאפשרויות הבאות: `0`: לא מבוסס `1`: מושבת
`explanation`	`string`: הצדקה להקצאת הניקוד.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

score

‫float: אחת מהאפשרויות הבאות:

0: לא מבוסס
1: מושבת

explanation

string: הצדקה להקצאת הניקוד.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`FulfillmentInput`

{
  "fulfillment_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string
    }
  }
}

פרמטרים
`metric_spec`	`FulfillmentSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`FulfillmentInstance` (אופציונלי) קלט להערכה, שכולל קלט של הסקה ותשובה תואמת.
`instance.prediction`	`string` (אופציונלי) תשובה מ-LLM.
`instance.instruction`	`string` (אופציונלי) הוראה שמשמשת בזמן ההסקה.

`FulfillmentResult`

{
  "fulfillment_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`score`	‫`float`: אחת מהאפשרויות הבאות: `1`: אין מילוי הזמנה `2`: כשלים במילוי הזמנות `3`: חלק מההזמנות מועברות לביצוע `4`: ניהול אספקה טוב `5`: השלמת ההזמנה
`explanation`	`string`: הצדקה להקצאת הניקוד.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

score

‫float: אחת מהאפשרויות הבאות:

1: אין מילוי הזמנה
2: כשלים במילוי הזמנות
3: חלק מההזמנות מועברות לביצוע
4: ניהול אספקה טוב
5: השלמת ההזמנה

explanation

string: הצדקה להקצאת הניקוד.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`SummarizationQualityInput`

{
  "summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

פרמטרים
`metric_spec`	`SummarizationQualitySpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`SummarizationQualityInstance` (אופציונלי) קלט להערכה, שכולל קלט של הסקה ותשובה תואמת.
`instance.prediction`	`string` (אופציונלי) תשובה מ-LLM.
`instance.instruction`	`string` (אופציונלי) הוראה שמשמשת בזמן ההסקה.
`instance.context`	`string` (אופציונלי) טקסט בזמן ההסקה שמכיל את כל המידע, שאפשר להשתמש בו בתשובה של מודל ה-LLM.

`SummarizationQualityResult`

{
  "summarization_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`score`	‫`float`: אחת מהאפשרויות הבאות: `1`: גרועה מאוד `2`: גרועה ‫`3`: אישור `4`: טוב `5`: טוב מאוד
`explanation`	`string`: הצדקה להקצאת הניקוד.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

score

‫float: אחת מהאפשרויות הבאות:

1: גרועה מאוד
2: גרועה
‫3: אישור
4: טוב
5: טוב מאוד

explanation

string: הצדקה להקצאת הניקוד.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`PairwiseSummarizationQualityInput`

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

פרמטרים
`metric_spec`	`PairwiseSummarizationQualitySpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`PairwiseSummarizationQualityInstance` (אופציונלי) קלט להערכה, שכולל קלט של הסקה ותשובה תואמת.
`instance.baseline_prediction`	`string` (אופציונלי) תשובה של מודל LLM בסיסי.
`instance.prediction`	`string` (אופציונלי) תשובה אפשרית של מודל LLM.
`instance.instruction`	`string` (אופציונלי) הוראה שמשמשת בזמן ההסקה.
`instance.context`	`string` (אופציונלי) טקסט בזמן ההסקה שמכיל את כל המידע, שאפשר להשתמש בו בתשובה של מודל ה-LLM.

`PairwiseSummarizationQualityResult`

{
  "pairwise_summarization_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`pairwise_choice`	‫`PairwiseChoice`: טיפוס enum עם הערכים האפשריים הבאים: `BASELINE`: החיזוי של נקודת הבסיס טוב יותר ‫`CANDIDATE`: חיזוי המועמדים טוב יותר ‫`TIE`: שוויון בין תחזיות הבסיס לבין התחזיות האפשריות.
`explanation`	‫`string`: הצדקה להקצאת pairwise_choice.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

pairwise_choice

‫PairwiseChoice: טיפוס enum עם הערכים האפשריים הבאים:

BASELINE: החיזוי של נקודת הבסיס טוב יותר
‫CANDIDATE: חיזוי המועמדים טוב יותר
‫TIE: שוויון בין תחזיות הבסיס לבין התחזיות האפשריות.

explanation

‫string: הצדקה להקצאת pairwise_choice.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`SummarizationHelpfulnessInput`

{
  "summarization_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

פרמטרים
`metric_spec`	`SummarizationHelpfulnessSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`SummarizationHelpfulnessInstance` (אופציונלי) קלט להערכה, שכולל קלט של הסקה ותשובה תואמת.
`instance.prediction`	`string` (אופציונלי) תשובה מ-LLM.
`instance.instruction`	`string` (אופציונלי) הוראה שמשמשת בזמן ההסקה.
`instance.context`	`string` (אופציונלי) טקסט בזמן ההסקה שמכיל את כל המידע, שאפשר להשתמש בו בתשובה של מודל ה-LLM.

`SummarizationHelpfulnessResult`

{
  "summarization_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`score`	‫`float`: אחת מהאפשרויות הבאות: `1`: לא שימושי `2`: לא כל כך הועילו `3`: ניטרלי `4`: מועיל במידה מסוימת `5`: מועיל
`explanation`	`string`: הצדקה להקצאת הניקוד.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

score

‫float: אחת מהאפשרויות הבאות:

1: לא שימושי
2: לא כל כך הועילו
3: ניטרלי
4: מועיל במידה מסוימת
5: מועיל

explanation

string: הצדקה להקצאת הניקוד.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`SummarizationVerbosityInput`

{
  "summarization_verbosity_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

פרמטרים
`metric_spec`	`SummarizationVerbositySpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`SummarizationVerbosityInstance` (אופציונלי) קלט להערכה, שכולל קלט של הסקה ותשובה תואמת.
`instance.prediction`	`string` (אופציונלי) תשובה מ-LLM.
`instance.instruction`	`string` (אופציונלי) הוראה שמשמשת בזמן ההסקה.
`instance.context`	`string` (אופציונלי) טקסט בזמן ההסקה שמכיל את כל המידע, שאפשר להשתמש בו בתשובה של מודל ה-LLM.

`SummarizationVerbosityResult`

{
  "summarization_verbosity_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`score`	‫`float`. אחת מהאפשרויות הבאות: `-2`: קצר `-1`: תמציתי במידה מסוימת `0`: אופטימלי ‫`1`: די מפורט ‫`2`: Verbose (מפורט)
`explanation`	`string`: הצדקה להקצאת הניקוד.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

score

‫float. אחת מהאפשרויות הבאות:

-2: קצר
-1: תמציתי במידה מסוימת
0: אופטימלי
‫1: די מפורט
‫2: Verbose (מפורט)

explanation

string: הצדקה להקצאת הניקוד.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`QuestionAnsweringQualityInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string,
    }
  }
}

פרמטרים
`metric_spec`	`QuestionAnsweringQualitySpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`QuestionAnsweringQualityInstance` (אופציונלי) קלט להערכה, שכולל קלט של הסקה ותשובה תואמת.
`instance.prediction`	`string` (אופציונלי) תשובה מ-LLM.
`instance.instruction`	`string` (אופציונלי) הוראה שמשמשת בזמן ההסקה.
`instance.context`	`string` (אופציונלי) טקסט בזמן ההסקה שמכיל את כל המידע, שאפשר להשתמש בו בתשובה של מודל ה-LLM.

`QuestionAnsweringQualityResult`

{
  "question_answering_quality_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`score`	‫`float`: אחת מהאפשרויות הבאות: `1`: גרועה מאוד `2`: גרועה ‫`3`: אישור `4`: טוב `5`: טוב מאוד
`explanation`	`string`: הצדקה להקצאת הניקוד.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

score

‫float: אחת מהאפשרויות הבאות:

1: גרועה מאוד
2: גרועה
‫3: אישור
4: טוב
5: טוב מאוד

explanation

string: הצדקה להקצאת הניקוד.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`PairwiseQuestionAnsweringQualityInput`

{
  "pairwise_question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "baseline_prediction": string,
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

פרמטרים
`metric_spec`	`QuestionAnsweringQualitySpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`QuestionAnsweringQualityInstance` (אופציונלי) קלט להערכה, שכולל קלט של הסקה ותשובה תואמת.
`instance.baseline_prediction`	`string` (אופציונלי) תשובה של מודל LLM בסיסי.
`instance.prediction`	`string` (אופציונלי) תשובה אפשרית של מודל LLM.
`instance.instruction`	`string` (אופציונלי) הוראה שמשמשת בזמן ההסקה.
`instance.context`	`string` (אופציונלי) טקסט בזמן ההסקה שמכיל את כל המידע, שאפשר להשתמש בו בתשובה של מודל ה-LLM.

`PairwiseQuestionAnsweringQualityResult`

{
  "pairwise_question_answering_quality_result": {
    "pairwise_choice": PairwiseChoice,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`pairwise_choice`	‫`PairwiseChoice`: טיפוס enum עם הערכים האפשריים הבאים: `BASELINE`: החיזוי של נקודת הבסיס טוב יותר ‫`CANDIDATE`: חיזוי המועמדים טוב יותר ‫`TIE`: שוויון בין תחזיות הבסיס לבין התחזיות האפשריות.
`explanation`	`string`: נימוק להקצאת `pairwise_choice`.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

pairwise_choice

‫PairwiseChoice: טיפוס enum עם הערכים האפשריים הבאים:

BASELINE: החיזוי של נקודת הבסיס טוב יותר
‫CANDIDATE: חיזוי המועמדים טוב יותר
‫TIE: שוויון בין תחזיות הבסיס לבין התחזיות האפשריות.

explanation

string: נימוק להקצאת pairwise_choice.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`QuestionAnsweringRelevanceInput`

{
  "question_answering_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

פרמטרים
`metric_spec`	`QuestionAnsweringRelevanceSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`QuestionAnsweringRelevanceInstance` (אופציונלי) קלט להערכה, שכולל קלט של הסקה ותשובה תואמת.
`instance.prediction`	`string` (אופציונלי) תשובה מ-LLM.
`instance.instruction`	`string` (אופציונלי) הוראה שמשמשת בזמן ההסקה.
`instance.context`	`string` (אופציונלי) טקסט בזמן ההסקה שמכיל את כל המידע, שאפשר להשתמש בו בתשובה של מודל ה-LLM.

`QuestionAnsweringRelevancyResult`

{
  "question_answering_relevancy_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`score`	‫`float`: אחת מהאפשרויות הבאות: `1`: לא רלוונטי `2`: לא כל כך רלוונטי `3`: ניטרלי `4`: רלוונטית במידה מסוימת `5`: רלוונטי
`explanation`	`string`: הצדקה להקצאת הניקוד.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

score

‫float: אחת מהאפשרויות הבאות:

1: לא רלוונטי
2: לא כל כך רלוונטי
3: ניטרלי
4: רלוונטית במידה מסוימת
5: רלוונטי

explanation

string: הצדקה להקצאת הניקוד.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`QuestionAnsweringHelpfulnessInput`

{
  "question_answering_helpfulness_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "instruction": string,
      "context": string
    }
  }
}

פרמטרים
`metric_spec`	`QuestionAnsweringHelpfulnessSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`QuestionAnsweringHelpfulnessInstance` (אופציונלי) קלט להערכה, שכולל קלט של הסקה ותשובה תואמת.
`instance.prediction`	`string` (אופציונלי) תשובה מ-LLM.
`instance.instruction`	`string` (אופציונלי) הוראה שמשמשת בזמן ההסקה.
`instance.context`	`string` (אופציונלי) טקסט בזמן ההסקה שמכיל את כל המידע, שאפשר להשתמש בו בתשובה של מודל ה-LLM.

`QuestionAnsweringHelpfulnessResult`

{
  "question_answering_helpfulness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`score`	‫`float`: אחת מהאפשרויות הבאות: `1`: לא שימושי `2`: לא כל כך הועילו `3`: ניטרלי `4`: מועיל במידה מסוימת `5`: מועיל
`explanation`	`string`: הצדקה להקצאת הניקוד.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

score

‫float: אחת מהאפשרויות הבאות:

1: לא שימושי
2: לא כל כך הועילו
3: ניטרלי
4: מועיל במידה מסוימת
5: מועיל

explanation

string: הצדקה להקצאת הניקוד.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`QuestionAnsweringCorrectnessInput`

{
  "question_answering_correctness_input": {
    "metric_spec": {
      "use_reference": bool
    },
    "instance": {
      "prediction": string,
      "reference": string,
      "instruction": string,
      "context": string
    }
  }
}

פרמטרים
`metric_spec`	`QuestionAnsweringCorrectnessSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`metric_spec.use_reference`	`bool` (אופציונלי) אם נעשה שימוש בהפניה בהערכה.
`instance`	`QuestionAnsweringCorrectnessInstance` (אופציונלי) קלט להערכה, שכולל קלט של הסקה ותשובה תואמת.
`instance.prediction`	`string` (אופציונלי) תשובה מ-LLM.
`instance.reference`	`string` (אופציונלי) תשובת LLM מוזהבת לעיון.
`instance.instruction`	`string` (אופציונלי) הוראה שמשמשת בזמן ההסקה.
`instance.context`	`string` (אופציונלי) טקסט בזמן ההסקה שמכיל את כל המידע, שאפשר להשתמש בו בתשובה של מודל ה-LLM.

`QuestionAnsweringCorrectnessResult`

{
  "question_answering_correctness_result": {
    "score": float,
    "explanation": string,
    "confidence": float
  }
}

תשובה

תשובה
`score`	‫`float`: אחת מהאפשרויות הבאות: `0`: לא נכון `1`: נכון
`explanation`	`string`: הצדקה להקצאת הניקוד.
`confidence`	‫`float`: `[0, 1]` רמת המהימנות של התוצאה.

score

‫float: אחת מהאפשרויות הבאות:

0: לא נכון
1: נכון

explanation

string: הצדקה להקצאת הניקוד.

confidence

‫float: [0, 1] רמת המהימנות של התוצאה.

`PointwiseMetricInput`

{
  "pointwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

פרמטרים
`metric_spec`	חובה: `PointwiseMetricSpec` מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`metric_spec.metric_prompt_template`	חובה: `string` תבנית הנחיה שמגדירה את המדד. הוא מוצג על ידי צמדי המפתח/ערך ב-instance.json_instance
`instance`	חובה: `PointwiseMetricInstance` קלט להערכה, שמורכב מ-json_instance.
`instance.json_instance`	`string` (אופציונלי) צמדי מפתח/ערך בפורמט JSON. לדוגמה, {"key_1": "value_1", "key_2": "value_2"}. הוא משמש לעיבוד של metric_spec.metric_prompt_template.

`PointwiseMetricResult`

{
  "pointwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

תשובה
`score`	‫`float`: ציון לתוצאת הערכת מדד נקודתית.
`explanation`	`string`: הצדקה להקצאת הניקוד.

`PairwiseMetricInput`

{
  "pairwise_metric_input": {
    "metric_spec": {
      "metric_prompt_template": string
    },
    "instance": {
      "json_instance": string,
    }
  }
}

פרמטרים
`metric_spec`	חובה: `PairwiseMetricSpec` מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`metric_spec.metric_prompt_template`	חובה: `string` תבנית הנחיה שמגדירה את המדד. הוא מוצג על ידי צמדי המפתח/ערך ב-instance.json_instance
`instance`	חובה: `PairwiseMetricInstance` קלט להערכה, שמורכב מ-json_instance.
`instance.json_instance`	`string` (אופציונלי) צמדי מפתח-ערך בפורמט JSON. לדוגמה, {"key_1": "value_1", "key_2": "value_2"}. הוא משמש לעיבוד של metric_spec.metric_prompt_template.

`PairwiseMetricResult`

{
  "pairwise_metric_result": {
    "score": float,
    "explanation": string,
  }
}

תשובה
`score`	‫`float`: ניקוד לתוצאת הערכת מדד בזוגות.
`explanation`	`string`: הצדקה להקצאת הניקוד.

`ToolCallValidInput`

{
  "tool_call_valid_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

פרמטרים
`metric_spec`	`ToolCallValidSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`ToolCallValidInstance` (אופציונלי) קלט להערכה, שכולל את התגובה של מודל שפה גדול (LLM) ואת ההפניה.
`instance.prediction`	`string` (אופציונלי) תגובה של מודל LLM מועמד, שהיא מחרוזת שעברה סריאליזציה ב-JSON ומכילה את המפתחות `content` ו-`tool_calls`. הערך `content` הוא פלט הטקסט מהמודל. הערך `tool_call` הוא מחרוזת JSON שעברה סריאליזציה של רשימת קריאות לכלים. לדוגמה: { "content": "", "tool_calls": [ { "name": "book_tickets", "arguments": { "movie": "Mission Impossible Dead Reckoning Part 1", "theater": "Regal Edwards 14", "location": "Mountain View CA", "showtime": "7:30", "date": "2024-03-30", "num_tix": "2" } } ] }
`instance.reference`	`string` (אופציונלי) פלט של מודל הזהב באותו פורמט כמו החיזוי.

`ToolCallValidResults`

{
  "tool_call_valid_results": {
    "tool_call_valid_metric_values": [
      {
        "score": float
      }
    ]
  }
}

תשובה

תשובה
`tool_call_valid_metric_values`	‫repeated `ToolCallValidMetricValue`: תוצאות ההערכה לכל מופע של קלט.
`tool_call_valid_metric_values.score`	‫`float`: אחת מהאפשרויות הבאות: ‫`0`: קריאה לא תקינה לכלי ‫`1`: קריאה תקינה לכלי

tool_call_valid_metric_values

‫repeated ToolCallValidMetricValue: תוצאות ההערכה לכל מופע של קלט.

tool_call_valid_metric_values.score

‫float: אחת מהאפשרויות הבאות:

‫0: קריאה לא תקינה לכלי
‫1: קריאה תקינה לכלי

`ToolNameMatchInput`

{
  "tool_name_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

פרמטרים
`metric_spec`	`ToolNameMatchSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`ToolNameMatchInstance` (אופציונלי) קלט להערכה, שכולל את התגובה של מודל שפה גדול (LLM) ואת ההפניה.
`instance.prediction`	`string` (אופציונלי) תגובה של מודל LLM מועמד, שהיא מחרוזת שעברה סריאליזציה ב-JSON ומכילה את המפתחות `content` ו-`tool_calls`. הערך `content` הוא פלט הטקסט מהמודל. הערך `tool_call` הוא מחרוזת JSON שעברה סריאליזציה של רשימת קריאות לכלים.
`instance.reference`	`string` (אופציונלי) פלט של מודל הזהב באותו פורמט כמו החיזוי.

`ToolNameMatchResults`

{
  "tool_name_match_results": {
    "tool_name_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

תשובה

תשובה
`tool_name_match_metric_values`	‫repeated `ToolNameMatchMetricValue`: תוצאות ההערכה לכל מופע של קלט.
`tool_name_match_metric_values.score`	‫`float`: אחת מהאפשרויות הבאות: ‫`0`: השם של קריאת הכלי לא תואם להפניה. ‫`1`: השם של קריאת הכלי תואם להפניה.

tool_name_match_metric_values

‫repeated ToolNameMatchMetricValue: תוצאות ההערכה לכל מופע של קלט.

tool_name_match_metric_values.score

‫float: אחת מהאפשרויות הבאות:

‫0: השם של קריאת הכלי לא תואם להפניה.
‫1: השם של קריאת הכלי תואם להפניה.

`ToolParameterKeyMatchInput`

{
  "tool_parameter_key_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

פרמטרים
`metric_spec`	`ToolParameterKeyMatchSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`ToolParameterKeyMatchInstance` (אופציונלי) קלט להערכה, שכולל את התגובה של מודל שפה גדול (LLM) ואת ההפניה.
`instance.prediction`	`string` (אופציונלי) תגובה של מודל LLM מועמד, שהיא מחרוזת שעברה סריאליזציה ב-JSON ומכילה את המפתחות `content` ו-`tool_calls`. הערך `content` הוא פלט הטקסט מהמודל. הערך `tool_call` הוא מחרוזת JSON שעברה סריאליזציה של רשימת קריאות לכלים.
`instance.reference`	`string` (אופציונלי) פלט של מודל הזהב באותו פורמט כמו החיזוי.

`ToolParameterKeyMatchResults`

{
  "tool_parameter_key_match_results": {
    "tool_parameter_key_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

תשובה
`tool_parameter_key_match_metric_values`	‫repeated `ToolParameterKeyMatchMetricValue`: תוצאות ההערכה לכל מופע של קלט.
`tool_parameter_key_match_metric_values.score`	‫`float`: `[0, 1]`, כאשר ציונים גבוהים יותר מציינים שיש יותר פרמטרים שתואמים לשמות של פרמטרי ההפניה.

`ToolParameterKVMatchInput`

{
  "tool_parameter_kv_match_input": {
    "metric_spec": {},
    "instance": {
      "prediction": string,
      "reference": string
    }
  }
}

פרמטרים
`metric_spec`	`ToolParameterKVMatchSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`instance`	`ToolParameterKVMatchInstance` (אופציונלי) קלט להערכה, שכולל את התגובה של מודל שפה גדול (LLM) ואת ההפניה.
`instance.prediction`	`string` (אופציונלי) תגובה של מודל LLM מועמד, שהיא מחרוזת שעברה סריאליזציה ב-JSON ומכילה את המפתחות `content` ו-`tool_calls`. הערך `content` הוא פלט הטקסט מהמודל. הערך `tool_call` הוא מחרוזת JSON שעברה סריאליזציה של רשימת קריאות לכלים.
`instance.reference`	`string` (אופציונלי) פלט של מודל הזהב באותו פורמט כמו החיזוי.

`ToolParameterKVMatchResults`

{
  "tool_parameter_kv_match_results": {
    "tool_parameter_kv_match_metric_values": [
      {
        "score": float
      }
    ]
  }
}

תשובה
`tool_parameter_kv_match_metric_values`	‫repeated `ToolParameterKVMatchMetricValue`: תוצאות ההערכה לכל מופע של קלט.
`tool_parameter_kv_match_metric_values.score`	‫`float`: `[0, 1]`, כאשר ציונים גבוהים יותר מציינים שיש יותר פרמטרים שתואמים לשמות ולערכים של פרמטרי ההפניה.

`CometInput`

{
  "comet_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

פרמטרים
`metric_spec`	`CometSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`metric_spec.version`	`string` (אופציונלי) ‫`COMET_22_SRC_REF`: ‫COMET 22 לתרגום, למקור ולהפניה. הוא מעריך את התרגום (החיזוי) באמצעות כל שלושת נתוני הקלט.
`metric_spec.source_language`	`string` (אופציונלי) שפת המקור בפורמט BCP-47. לדוגמה, 'es'.
`metric_spec.target_language`	`string` (אופציונלי) שפת היעד בפורמט BCP-47. לדוגמה, 'es'
`instance`	`CometInstance` (אופציונלי) קלט להערכה, שכולל את התשובה של מודל שפה גדול (LLM) ואת ההפניה. השדות המדויקים שמשמשים להערכה תלויים בגרסת COMET.
`instance.prediction`	`string` (אופציונלי) תשובה אפשרית של מודל LLM. זהו הפלט של מודל ה-LLM שנבדק.
`instance.source`	`string` (אופציונלי) טקסט המקור. השפה הזו היא השפה המקורית שממנה התרגום בוצע.
`instance.reference`	`string` (אופציונלי) הנתונים האמיתיים שמשמשים להשוואה מול התחזית. השפה של ההסבר זהה לשפה של התחזית.

`CometResult`

{
  "comet_result" : {
    "score": float
  }
}

תשובה
`score`	‫`float`: `[0, 1]`, כאשר 1 מייצג תרגום מושלם.

`MetricxInput`

{
  "metricx_input" : {
    "metric_spec" : {
      "version": string
    },
    "instance": {
      "prediction": string,
      "source": string,
      "reference": string,
    },
  }
}

פרמטרים
`metric_spec`	`MetricxSpec` (אופציונלי) מפרט המדד, שבו מוגדרת ההתנהגות של המדד.
`metric_spec.version`	אופציונלי: `string` אחת מהאפשרויות הבאות: ‫`METRICX_24_REF`: MetricX 24 לתרגום ולעיון. הוא מעריך את החיזוי (התרגום) על ידי השוואה עם קלט הטקסט של ההפניה שסופק. ‫`METRICX_24_SRC`: MetricX 24 לתרגום ולמקור. היא מעריכה את התרגום (התחזית) באמצעות הערכת איכות (QE), ללא קלט של טקסט הפניה. ‫`METRICX_24_SRC_REF`: MetricX 24 לתרגום, למקור ולסימוכין. הוא מעריך את התרגום (החיזוי) באמצעות כל שלושת נתוני הקלט.
`metric_spec.source_language`	`string` (אופציונלי) שפת המקור בפורמט BCP-47. לדוגמה, 'es'.
`metric_spec.target_language`	`string` (אופציונלי) שפת היעד בפורמט BCP-47. לדוגמה, 'es'.
`instance`	`MetricxInstance` (אופציונלי) קלט להערכה, שכולל את התשובה של מודל שפה גדול (LLM) ואת ההפניה. השדות המדויקים שמשמשים להערכה תלויים בגרסה של MetricX.
`instance.prediction`	`string` (אופציונלי) תשובה אפשרית של מודל LLM. זהו הפלט של מודל ה-LLM שנבדק.
`instance.source`	`string` (אופציונלי) טקסט המקור בשפה המקורית שממנה התרגום בוצע.
`instance.reference`	`string` (אופציונלי) הנתונים האמיתיים שמשמשים להשוואה מול התחזית. הוא כתוב באותה שפה כמו התחזית.

`MetricxResult`

{
  "metricx_result" : {
    "score": float
  }
}

תשובה
`score`	‫`float`: `[0, 25]`, כאשר 0 מייצג תרגום מושלם.

דוגמאות

הערכת פלט

בדוגמה הבאה מוצג איך להפעיל את Gen AI Evaluation API כדי להעריך את הפלט של LLM באמצעות מגוון מדדי הערכה, כולל:

summarization_quality
groundedness
fulfillment
summarization_helpfulness
summarization_verbosity

Python

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask, MetricPromptTemplateExamples

# TODO(developer): Update and un-comment below line
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

eval_dataset = pd.DataFrame(
    {
        "instruction": [
            "Summarize the text in one sentence.",
            "Summarize the text such that a five-year-old can understand.",
        ],
        "context": [
            """As part of a comprehensive initiative to tackle urban congestion and foster
            sustainable urban living, a major city has revealed ambitious plans for an
            extensive overhaul of its public transportation system. The project aims not
            only to improve the efficiency and reliability of public transit but also to
            reduce the city\'s carbon footprint and promote eco-friendly commuting options.
            City officials anticipate that this strategic investment will enhance
            accessibility for residents and visitors alike, ushering in a new era of
            efficient, environmentally conscious urban transportation.""",
            """A team of archaeologists has unearthed ancient artifacts shedding light on a
            previously unknown civilization. The findings challenge existing historical
            narratives and provide valuable insights into human history.""",
        ],
        "response": [
            "A major city is revamping its public transportation system to fight congestion, reduce emissions, and make getting around greener and easier.",
            "Some people who dig for old things found some very special tools and objects that tell us about people who lived a long, long time ago! What they found is like a new puzzle piece that helps us understand how people used to live.",
        ],
    }
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY,
        MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
        MetricPromptTemplateExamples.Pointwise.VERBOSITY,
        MetricPromptTemplateExamples.Pointwise.INSTRUCTION_FOLLOWING,
    ],
)

prompt_template = (
    "Instruction: {instruction}. Article: {context}. Summary: {response}"
)
result = eval_task.evaluate(prompt_template=prompt_template)

print("Summary Metrics:\n")

for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
# Summary Metrics:
# row_count:      2
# summarization_quality/mean:     3.5
# summarization_quality/std:      2.1213203435596424
# ...

Go

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// evaluateModelResponse evaluates the output of an LLM for groundedness, i.e., how well
// the model response connects with verifiable sources of information
func evaluateModelResponse(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	// evaluate the pre-generated model response against the reference (ground truth)
	responseToEvaluate := `
The city is undertaking a major project to revamp its public transportation system.
This initiative is designed to improve efficiency, reduce carbon emissions, and promote
eco-friendly commuting. The city expects that this investment will enhance accessibility
and usher in a new era of sustainable urban transportation.
`
	reference := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		// Check the API reference for a full list of supported metric inputs:
		// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#evaluateinstancesrequest
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_GroundednessInput{
			GroundednessInput: &aiplatformpb.GroundednessInput{
				MetricSpec: &aiplatformpb.GroundednessSpec{},
				Instance: &aiplatformpb.GroundednessInstance{
					Context:    &reference,
					Prediction: &responseToEvaluate,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetGroundednessResult()
	fmt.Fprintf(w, "score: %.2f\n", results.GetScore())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// score: 1.00
	// confidence: 1.00
	// explanation:
	// STEP 1: All aspects of the response are found in the context.
	// The response accurately summarizes the city's plan to overhaul its public transportation system, highlighting the goals of ...
	// STEP 2: According to the rubric, the response is scored 1 because all aspects of the response are attributable to the context.

	return nil
}

הערכת פלט: איכות הסיכום של זוגות

בדוגמה הבאה מוצג איך להפעיל את Gen AI evaluation service API כדי להעריך את הפלט של LLM באמצעות השוואה של איכות הסיכום בזוגות.

REST

לפני שמשתמשים בנתוני הבקשה, צריך להחליף את הנתונים הבאים:

PROJECT_ID: .
‫LOCATION: האזור שבו הבקשה תעובד.
‫PREDICTION: תגובה של מודל שפה גדול (LLM).
‫BASELINE_PREDICTION: תשובה של מודל שפה גדול (LLM) של מודל בסיסי.
‫INSTRUCTION: ההוראה שמשמשת בזמן ההסקה.
‫CONTEXT: טקסט בזמן ההסקה שמכיל את כל המידע הרלוונטי, שאפשר להשתמש בו בתגובה של מודל ה-LLM.

ה-method של ה-HTTP וכתובת ה-URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \

גוף בקשת JSON:

{
  "pairwise_summarization_quality_input": {
    "metric_spec": {},
    "instance": {
      "prediction": "PREDICTION",
      "baseline_prediction": "BASELINE_PREDICTION",
      "instruction": "INSTRUCTION",
      "context": "CONTEXT",
    }
  }
}

כדי לשלוח את הבקשה עליכם לבחור אחת מהאפשרויות הבאות:

curl

הערה: הפקודה הבאה מבוססת על ההנחה שנכנסתם ל-CLI של gcloud באמצעות חשבון המשתמש שלכם, על ידי הרצת gcloud init או gcloud auth login, או באמצעות Cloud Shell שמחבר אתכם אוטומטית ל-CLI של gcloud. כדי לבדוק איזה חשבון פעיל, אפשר להריץ את הפקודה gcloud auth list.

שומרים את גוף הבקשה בקובץ בשם request.json ומריצים את הפקודה הבאה:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \"

PowerShell

הערה: הפקודה הבאה מבוססת על ההנחה שנכנסתם ל-CLI של gcloud באמצעות חשבון המשתמש שלכם, על ידי הרצת gcloud init או gcloud auth login. כדי לבדוק איזה חשבון פעיל, אפשר להריץ את הפקודה gcloud auth list.

שומרים את גוף הבקשה בקובץ בשם request.json ומריצים את הפקודה הבאה:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/LOCATION:evaluateInstances \" | Select-Object -Expand Content

Python

במאמר התקנת Vertex AI SDK ל-Python מוסבר איך להתקין או לעדכן את Vertex AI SDK ל-Python. מידע נוסף מופיע ב מאמרי העזרה של Python API.

import pandas as pd

import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.evaluation import (
    EvalTask,
    PairwiseMetric,
    MetricPromptTemplateExamples,
)

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

prompt = """
Summarize the text such that a five-year-old can understand.

# Text

As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
"""

eval_dataset = pd.DataFrame({"prompt": [prompt]})

# Baseline model for pairwise comparison
baseline_model = GenerativeModel("gemini-2.0-flash-lite-001")

# Candidate model for pairwise comparison
candidate_model = GenerativeModel(
    "gemini-2.0-flash-001", generation_config={"temperature": 0.4}
)

prompt_template = MetricPromptTemplateExamples.get_prompt_template(
    "pairwise_summarization_quality"
)

summarization_quality_metric = PairwiseMetric(
    metric="pairwise_summarization_quality",
    metric_prompt_template=prompt_template,
    baseline_model=baseline_model,
)

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[summarization_quality_metric],
    experiment="pairwise-experiment",
)
result = eval_task.evaluate(model=candidate_model)

baseline_model_response = result.metrics_table["baseline_model_response"].iloc[0]
candidate_model_response = result.metrics_table["response"].iloc[0]
winner_model = result.metrics_table[
    "pairwise_summarization_quality/pairwise_choice"
].iloc[0]
explanation = result.metrics_table[
    "pairwise_summarization_quality/explanation"
].iloc[0]

print(f"Baseline's story:\n{baseline_model_response}")
print(f"Candidate's story:\n{candidate_model_response}")
print(f"Winner: {winner_model}")
print(f"Explanation: {explanation}")
# Example response:
# Baseline's story:
# A big city wants to make it easier for people to get around without using cars! They're going to make buses and trains ...
#
# Candidate's story:
# A big city wants to make it easier for people to get around without using cars! ... This will help keep the air clean ...
#
# Winner: CANDIDATE
# Explanation: Both responses adhere to the prompt's constraints, are grounded in the provided text, and ... However, Response B ...

Go

לפני שמנסים את הדוגמה הזו, צריך לפעול לפי Goהוראות ההגדרה במאמר Vertex AI quickstart using client libraries. מידע נוסף מופיע במאמרי העזרה של Vertex AI Go API.

כדי לבצע אימות ב-Vertex AI, צריך להגדיר את Application Default Credentials. מידע נוסף זמין במאמר הגדרת אימות לסביבת פיתוח מקומית.

import (
	context_pkg "context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// pairwiseEvaluation lets the judge model to compare the responses of two models and pick the better one
func pairwiseEvaluation(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context_pkg.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	context := `
As part of a comprehensive initiative to tackle urban congestion and foster
sustainable urban living, a major city has revealed ambitious plans for an
extensive overhaul of its public transportation system. The project aims not
only to improve the efficiency and reliability of public transit but also to
reduce the city\'s carbon footprint and promote eco-friendly commuting options.
City officials anticipate that this strategic investment will enhance
accessibility for residents and visitors alike, ushering in a new era of
efficient, environmentally conscious urban transportation.
`
	instruction := "Summarize the text such that a five-year-old can understand."
	baselineResponse := `
The city wants to make it easier for people to get around without using cars.
They're going to make the buses and trains better and faster, so people will want to
use them more. This will help the air be cleaner and make the city a better place to live.
`
	candidateResponse := `
The city is making big changes to how people get around. They want to make the buses and
trains work better and be easier for everyone to use. This will also help the environment
by getting people to use less gas. The city thinks these changes will make it easier for
everyone to get where they need to go.
`

	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_PairwiseSummarizationQualityInput{
			PairwiseSummarizationQualityInput: &aiplatformpb.PairwiseSummarizationQualityInput{
				MetricSpec: &aiplatformpb.PairwiseSummarizationQualitySpec{},
				Instance: &aiplatformpb.PairwiseSummarizationQualityInstance{
					Context:            &context,
					Instruction:        &instruction,
					Prediction:         &candidateResponse,
					BaselinePrediction: &baselineResponse,
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	results := resp.GetPairwiseSummarizationQualityResult()
	fmt.Fprintf(w, "choice: %s\n", results.GetPairwiseChoice())
	fmt.Fprintf(w, "confidence: %.2f\n", results.GetConfidence())
	fmt.Fprintf(w, "explanation:\n%s\n", results.GetExplanation())
	// Example response:
	// choice: BASELINE
	// confidence: 0.50
	// explanation:
	// BASELINE response is easier to understand. For example, the phrase "..." is easier to understand than "...". Thus, BASELINE response is ...

	return nil
}

קבלת ציון ROUGE

בדוגמה הבאה מוצגת קריאה ל-API של שירות ההערכה של AI גנרטיבי כדי לקבל את ציון ה-ROUGE של תחזית שנוצרה על סמך מספר קלטים. הקלט של ROUGE משתמש ב-metric_spec, שקובע את ההתנהגות של המדד.

REST

לפני שמשתמשים בנתוני הבקשה, צריך להחליף את הנתונים הבאים:

PROJECT_ID: .
‫LOCATION: האזור שבו הבקשה תעובד.
‫PREDICTION: תגובה של מודל שפה גדול (LLM).
REFERENCE: תשובה מושלמת של מודל שפה גדול (LLM) לעיון.
‫ROUGE_TYPE: החישוב שמשמש לקביעת הציון של התנהלות לא תקינה. במאמר metric_spec.rouge_type מפורטים הערכים הקבילים.
‫USE_STEMMER: קובעת אם נעשה שימוש ב-Porter stemmer כדי להסיר סיומות של מילים ולשפר את ההתאמה. ערכים קבילים מפורטים במאמר metric_spec.use_stemmer.
‫SPLIT_SUMMARIES: קובע אם יתווספו שורות חדשות בין rougeLsum משפטים. ערכים קבילים מפורטים במאמר metric_spec.split_summaries .

ה-method של ה-HTTP וכתובת ה-URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \

גוף בקשת JSON:

{
  "rouge_input": {
    "instances": {
      "prediction": "PREDICTION",
      "reference": "REFERENCE.",
    },
    "metric_spec": {
      "rouge_type": "ROUGE_TYPE",
      "use_stemmer": USE_STEMMER,
      "split_summaries": SPLIT_SUMMARIES,
    }
  }
}

כדי לשלוח את הבקשה עליכם לבחור אחת מהאפשרויות הבאות:

curl

שומרים את גוף הבקשה בקובץ בשם request.json ומריצים את הפקודה הבאה:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \"

PowerShell

שומרים את גוף הבקשה בקובץ בשם request.json ומריצים את הפקודה הבאה:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID-/locations/REGION:evaluateInstances \" | Select-Object -Expand Content

Python

במאמר התקנת Vertex AI SDK ל-Python מוסבר איך להתקין או לעדכן את Vertex AI SDK ל-Python. מידע נוסף מופיע ב מאמרי העזרה של Python API.

import pandas as pd

import vertexai
from vertexai.preview.evaluation import EvalTask

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

reference_summarization = """
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching."""

# Compare pre-generated model responses against the reference (ground truth).
eval_dataset = pd.DataFrame(
    {
        "response": [
            """The Great Barrier Reef, the world's largest coral reef system located
        in Australia, is a vast and diverse ecosystem. However, it faces serious
        threats from climate change, ocean acidification, and coral bleaching,
        endangering its rich marine life.""",
            """The Great Barrier Reef, a vast coral reef system off the coast of
        Queensland, Australia, is the world's largest. It's a complex ecosystem
        supporting diverse marine life, including endangered species. However,
        climate change, ocean acidification, and coral bleaching are serious
        threats to its survival.""",
            """The Great Barrier Reef, the world's largest coral reef system off the
        coast of Australia, is a vast and diverse ecosystem with thousands of
        reefs and islands. It is home to a multitude of marine life, including
        endangered species, but faces serious threats from climate change, ocean
        acidification, and coral bleaching.""",
        ],
        "reference": [reference_summarization] * 3,
    }
)
eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        "rouge_1",
        "rouge_2",
        "rouge_l",
        "rouge_l_sum",
    ],
)
result = eval_task.evaluate()

print("Summary Metrics:\n")
for key, value in result.summary_metrics.items():
    print(f"{key}: \t{value}")

print("\n\nMetrics Table:\n")
print(result.metrics_table)
# Example response:
#
# Summary Metrics:
#
# row_count:      3
# rouge_1/mean:   0.7191161666666667
# rouge_1/std:    0.06765143922270488
# rouge_2/mean:   0.5441118566666666
# ...
# Metrics Table:
#
#                                        response                         reference  ...  rouge_l/score  rouge_l_sum/score
# 0  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.577320           0.639175
# 1  The Great Barrier Reef, a vast coral...  \n    The Great Barrier Reef, the ...  ...       0.552381           0.666667
# 2  The Great Barrier Reef, the world's ...  \n    The Great Barrier Reef, the ...  ...       0.774775           0.774775

Go

כדי לבצע אימות ב-Vertex AI, צריך להגדיר את Application Default Credentials. מידע נוסף זמין במאמר הגדרת אימות לסביבת פיתוח מקומית.

import (
	"context"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
)

// getROUGEScore evaluates a model response against a reference (ground truth) using the ROUGE metric
func getROUGEScore(w io.Writer, projectID, location string) error {
	// location = "us-central1"
	ctx := context.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewEvaluationClient(ctx, option.WithEndpoint(apiEndpoint))

	if err != nil {
		return fmt.Errorf("unable to create aiplatform client: %w", err)
	}
	defer client.Close()

	modelResponse := `
The Great Barrier Reef, the world's largest coral reef system located in Australia,
is a vast and diverse ecosystem. However, it faces serious threats from climate change,
ocean acidification, and coral bleaching, endangering its rich marine life.
`
	reference := `
The Great Barrier Reef, the world's largest coral reef system, is
located off the coast of Queensland, Australia. It's a vast
ecosystem spanning over 2,300 kilometers with thousands of reefs
and islands. While it harbors an incredible diversity of marine
life, including endangered species, it faces serious threats from
climate change, ocean acidification, and coral bleaching.
`
	req := aiplatformpb.EvaluateInstancesRequest{
		Location: fmt.Sprintf("projects/%s/locations/%s", projectID, location),
		MetricInputs: &aiplatformpb.EvaluateInstancesRequest_RougeInput{
			RougeInput: &aiplatformpb.RougeInput{
				// Check the API reference for the list of supported ROUGE metric types:
				// https://cloud.google.com/vertex-ai/docs/reference/rpc/google.cloud.aiplatform.v1beta1#rougespec
				MetricSpec: &aiplatformpb.RougeSpec{
					RougeType: "rouge1",
				},
				Instances: []*aiplatformpb.RougeInstance{
					{
						Prediction: &modelResponse,
						Reference:  &reference,
					},
				},
			},
		},
	}

	resp, err := client.EvaluateInstances(ctx, &req)
	if err != nil {
		return fmt.Errorf("evaluateInstances failed: %v", err)
	}

	fmt.Fprintln(w, "evaluation results:")
	fmt.Fprintln(w, resp.GetRougeResults().GetRougeMetricValues())
	// Example response:
	// [score:0.6597938]

	return nil
}

המאמרים הבאים

לתיעוד מפורט, אפשר לעיין במאמר הפעלת הערכה.

Gen AI evaluation service API קל לארגן דפים בעזרת אוספים אפשר לשמור ולסווג תוכן על סמך ההעדפות שלך.

תחביר לדוגמה

curl

Python

רשימת פרמטרים

ExactMatchInput

ExactMatchResults

BleuInput

BleuResults

RougeInput

RougeResults

FluencyInput

FluencyResult

CoherenceInput

CoherenceResult

SafetyInput

SafetyResult

GroundednessInput

GroundednessResult

FulfillmentInput

FulfillmentResult

SummarizationQualityInput

SummarizationQualityResult

PairwiseSummarizationQualityInput

PairwiseSummarizationQualityResult

SummarizationHelpfulnessInput

SummarizationHelpfulnessResult

SummarizationVerbosityInput

SummarizationVerbosityResult

QuestionAnsweringQualityInput

QuestionAnsweringQualityResult

PairwiseQuestionAnsweringQualityInput

PairwiseQuestionAnsweringQualityResult

QuestionAnsweringRelevanceInput

QuestionAnsweringRelevancyResult

QuestionAnsweringHelpfulnessInput

QuestionAnsweringHelpfulnessResult

QuestionAnsweringCorrectnessInput

QuestionAnsweringCorrectnessResult

PointwiseMetricInput

PointwiseMetricResult

PairwiseMetricInput

PairwiseMetricResult

ToolCallValidInput

ToolCallValidResults

ToolNameMatchInput

ToolNameMatchResults

ToolParameterKeyMatchInput

ToolParameterKeyMatchResults

ToolParameterKVMatchInput

ToolParameterKVMatchResults

CometInput

CometResult

MetricxInput

MetricxResult

דוגמאות

הערכת פלט

Python

Go

הערכת פלט: איכות הסיכום של זוגות

REST

curl

PowerShell

Python

Python

Go

Go

קבלת ציון ROUGE

REST

curl

PowerShell

Python

Python

Go

Go

המאמרים הבאים

Gen AI evaluation service API

`ExactMatchInput`

`ExactMatchResults`

`BleuInput`

`BleuResults`

`RougeInput`

`RougeResults`

`FluencyInput`

`FluencyResult`

`CoherenceInput`

`CoherenceResult`

`SafetyInput`

`SafetyResult`

`GroundednessInput`

`GroundednessResult`

`FulfillmentInput`

`FulfillmentResult`

`SummarizationQualityInput`

`SummarizationQualityResult`

`PairwiseSummarizationQualityInput`

`PairwiseSummarizationQualityResult`

`SummarizationHelpfulnessInput`

`SummarizationHelpfulnessResult`

`SummarizationVerbosityInput`

`SummarizationVerbosityResult`

`QuestionAnsweringQualityInput`

`QuestionAnsweringQualityResult`

`PairwiseQuestionAnsweringQualityInput`

`PairwiseQuestionAnsweringQualityResult`

`QuestionAnsweringRelevanceInput`

`QuestionAnsweringRelevancyResult`

`QuestionAnsweringHelpfulnessInput`

`QuestionAnsweringHelpfulnessResult`

`QuestionAnsweringCorrectnessInput`

`QuestionAnsweringCorrectnessResult`

`PointwiseMetricInput`

`PointwiseMetricResult`

`PairwiseMetricInput`

`PairwiseMetricResult`

`ToolCallValidInput`

`ToolCallValidResults`

`ToolNameMatchInput`

`ToolNameMatchResults`

`ToolParameterKeyMatchInput`

`ToolParameterKeyMatchResults`

`ToolParameterKVMatchInput`

`ToolParameterKVMatchResults`

`CometInput`

`CometResult`

`MetricxInput`

`MetricxResult`