Metrics for general pairwise text generation evaluation results.
modelWinRatenumber
Percentage of time the autorater decided the model had the better response.
baselineModelWinRatenumber
Percentage of time the autorater decided the baseline model had the better response.
humanPreferenceModelWinRatenumber
Percentage of time humans decided the model had the better response.
humanPreferenceBaselineModelWinRatenumber
Percentage of time humans decided the baseline model had the better response.
Number of examples where both the autorater and humans decided that the model had the better response.
Number of examples where the autorater chose the model, but humans preferred the baseline model.
Number of examples where the autorater chose the baseline model, but humans preferred the model.
Number of examples where both the autorater and humans decided that the model had the worse response.
accuracynumber
Fraction of cases where the autorater agreed with the human raters.
precisionnumber
Fraction of cases where the autorater and humans thought the model had a better response out of all cases where the autorater thought the model had a better response. True positive divided by all positive.
recallnumber
Fraction of cases where the autorater and humans thought the model had a better response out of all cases where the humans thought the model had a better response.
f1Scorenumber
Harmonic mean of precision and recall.
cohensKappanumber
A measurement of agreement between the autorater and human raters that takes the likelihood of random agreement into account.
| JSON representation |
|---|
{ "modelWinRate": number, "baselineModelWinRate": number, "humanPreferenceModelWinRate": number, "humanPreferenceBaselineModelWinRate": number, "truePositiveCount": string, "falsePositiveCount": string, "falseNegativeCount": string, "trueNegativeCount": string, "accuracy": number, "precision": number, "recall": number, "f1Score": number, "cohensKappa": number } |