Tool: create_evaluation_dataset
Creates a new evaluation dataset.
The following sample demonstrate how to use curl to invoke the create_evaluation_dataset MCP tool.
| Curl Request |
|---|
curl --location 'https://ces.[REGION].rep.googleapis.com/mcp' \ --header 'content-type: application/json' \ --header 'accept: application/json, text/event-stream' \ --data '{ "method": "tools/call", "params": { "name": "create_evaluation_dataset", "arguments": { // provide these details according to the tool's MCP specification } }, "jsonrpc": "2.0", "id": 1 }' |
Input Schema
Request message for EvaluationService.CreateEvaluationDataset.
CreateEvaluationDatasetRequest
| JSON representation |
|---|
{
"parent": string,
"evaluationDatasetId": string,
"evaluationDataset": {
object ( |
| Fields | |
|---|---|
parent |
Required. The app to create the evaluation for. Format: |
evaluationDatasetId |
Optional. The ID to use for the evaluation dataset, which will become the final component of the evaluation dataset's resource name. If not provided, a unique ID will be automatically assigned for the evaluation. |
evaluationDataset |
Required. The evaluation dataset to create. |
EvaluationDataset
| JSON representation |
|---|
{
"name": string,
"displayName": string,
"evaluations": [
string
],
"createTime": string,
"updateTime": string,
"etag": string,
"createdBy": string,
"lastUpdatedBy": string,
"aggregatedMetrics": {
object ( |
| Fields | |
|---|---|
name |
Identifier. The unique identifier of this evaluation dataset. Format: |
displayName |
Required. User-defined display name of the evaluation dataset. Unique within an App. |
evaluations[] |
Optional. Evaluations that are included in this dataset. |
createTime |
Output only. Timestamp when the evaluation dataset was created. Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: |
updateTime |
Output only. Timestamp when the evaluation dataset was last updated. Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: |
etag |
Output only. Etag used to ensure the object hasn't changed during a read-modify-write operation. If the etag is empty, the update will overwrite any concurrent changes. |
createdBy |
Output only. The user who created the evaluation dataset. |
lastUpdatedBy |
Output only. The user who last updated the evaluation dataset. |
aggregatedMetrics |
Output only. The aggregated metrics for this evaluation dataset across all runs. |
Timestamp
| JSON representation |
|---|
{ "seconds": string, "nanos": integer } |
| Fields | |
|---|---|
seconds |
Represents seconds of UTC time since Unix epoch 1970-01-01T00:00:00Z. Must be between -62135596800 and 253402300799 inclusive (which corresponds to 0001-01-01T00:00:00Z to 9999-12-31T23:59:59Z). |
nanos |
Non-negative fractions of a second at nanosecond resolution. This field is the nanosecond portion of the duration, not an alternative to seconds. Negative second values with fractions must still have non-negative nanos values that count forward in time. Must be between 0 and 999,999,999 inclusive. |
AggregatedMetrics
| JSON representation |
|---|
{
"metricsByAppVersion": [
{
object ( |
| Fields | |
|---|---|
metricsByAppVersion[] |
Output only. Aggregated metrics, grouped by app version ID. |
MetricsByAppVersion
| JSON representation |
|---|
{ "appVersionId": string, "toolMetrics": [ { object ( |
| Fields | |
|---|---|
appVersionId |
Output only. The app version ID. |
toolMetrics[] |
Output only. Metrics for each tool within this app version. |
semanticSimilarityMetrics[] |
Output only. Metrics for semantic similarity within this app version. |
hallucinationMetrics[] |
Output only. Metrics for hallucination within this app version. |
toolCallLatencyMetrics[] |
Output only. Metrics for tool call latency within this app version. |
turnLatencyMetrics[] |
Output only. Metrics for turn latency within this app version. |
passCount |
Output only. The number of times the evaluation passed. |
failCount |
Output only. The number of times the evaluation failed. |
metricsByTurn[] |
Output only. Metrics aggregated per turn within this app version. |
ToolMetrics
| JSON representation |
|---|
{ "tool": string, "passCount": integer, "failCount": integer } |
| Fields | |
|---|---|
tool |
Output only. The name of the tool. |
passCount |
Output only. The number of times the tool passed. |
failCount |
Output only. The number of times the tool failed. |
SemanticSimilarityMetrics
| JSON representation |
|---|
{ "score": number } |
| Fields | |
|---|---|
score |
Output only. The average semantic similarity score (0-4). |
HallucinationMetrics
| JSON representation |
|---|
{ "score": number } |
| Fields | |
|---|---|
score |
Output only. The average hallucination score (0 to 1). |
ToolCallLatencyMetrics
| JSON representation |
|---|
{ "tool": string, "averageLatency": string } |
| Fields | |
|---|---|
tool |
Output only. The name of the tool. |
averageLatency |
Output only. The average latency of the tool calls. A duration in seconds with up to nine fractional digits, ending with ' |
Duration
| JSON representation |
|---|
{ "seconds": string, "nanos": integer } |
| Fields | |
|---|---|
seconds |
Signed seconds of the span of time. Must be from -315,576,000,000 to +315,576,000,000 inclusive. Note: these bounds are computed from: 60 sec/min * 60 min/hr * 24 hr/day * 365.25 days/year * 10000 years |
nanos |
Signed fractions of a second at nanosecond resolution of the span of time. Durations less than one second are represented with a 0 |
TurnLatencyMetrics
| JSON representation |
|---|
{ "averageLatency": string } |
| Fields | |
|---|---|
averageLatency |
Output only. The average latency of the turns. A duration in seconds with up to nine fractional digits, ending with ' |
MetricsByTurn
| JSON representation |
|---|
{ "turnIndex": integer, "toolMetrics": [ { object ( |
| Fields | |
|---|---|
turnIndex |
Output only. The turn index (0-based). |
toolMetrics[] |
Output only. Metrics for each tool within this turn. |
semanticSimilarityMetrics[] |
Output only. Metrics for semantic similarity within this turn. |
hallucinationMetrics[] |
Output only. Metrics for hallucination within this turn. |
toolCallLatencyMetrics[] |
Output only. Metrics for tool call latency within this turn. |
turnLatencyMetrics[] |
Output only. Metrics for turn latency within this turn. |
Output Schema
An evaluation dataset represents a set of evaluations that are grouped together basaed on shared tags.
EvaluationDataset
| JSON representation |
|---|
{
"name": string,
"displayName": string,
"evaluations": [
string
],
"createTime": string,
"updateTime": string,
"etag": string,
"createdBy": string,
"lastUpdatedBy": string,
"aggregatedMetrics": {
object ( |
| Fields | |
|---|---|
name |
Identifier. The unique identifier of this evaluation dataset. Format: |
displayName |
Required. User-defined display name of the evaluation dataset. Unique within an App. |
evaluations[] |
Optional. Evaluations that are included in this dataset. |
createTime |
Output only. Timestamp when the evaluation dataset was created. Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: |
updateTime |
Output only. Timestamp when the evaluation dataset was last updated. Uses RFC 3339, where generated output will always be Z-normalized and use 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted. Examples: |
etag |
Output only. Etag used to ensure the object hasn't changed during a read-modify-write operation. If the etag is empty, the update will overwrite any concurrent changes. |
createdBy |
Output only. The user who created the evaluation dataset. |
lastUpdatedBy |
Output only. The user who last updated the evaluation dataset. |
aggregatedMetrics |
Output only. The aggregated metrics for this evaluation dataset across all runs. |
Timestamp
| JSON representation |
|---|
{ "seconds": string, "nanos": integer } |
| Fields | |
|---|---|
seconds |
Represents seconds of UTC time since Unix epoch 1970-01-01T00:00:00Z. Must be between -62135596800 and 253402300799 inclusive (which corresponds to 0001-01-01T00:00:00Z to 9999-12-31T23:59:59Z). |
nanos |
Non-negative fractions of a second at nanosecond resolution. This field is the nanosecond portion of the duration, not an alternative to seconds. Negative second values with fractions must still have non-negative nanos values that count forward in time. Must be between 0 and 999,999,999 inclusive. |
AggregatedMetrics
| JSON representation |
|---|
{
"metricsByAppVersion": [
{
object ( |
| Fields | |
|---|---|
metricsByAppVersion[] |
Output only. Aggregated metrics, grouped by app version ID. |
MetricsByAppVersion
| JSON representation |
|---|
{ "appVersionId": string, "toolMetrics": [ { object ( |
| Fields | |
|---|---|
appVersionId |
Output only. The app version ID. |
toolMetrics[] |
Output only. Metrics for each tool within this app version. |
semanticSimilarityMetrics[] |
Output only. Metrics for semantic similarity within this app version. |
hallucinationMetrics[] |
Output only. Metrics for hallucination within this app version. |
toolCallLatencyMetrics[] |
Output only. Metrics for tool call latency within this app version. |
turnLatencyMetrics[] |
Output only. Metrics for turn latency within this app version. |
passCount |
Output only. The number of times the evaluation passed. |
failCount |
Output only. The number of times the evaluation failed. |
metricsByTurn[] |
Output only. Metrics aggregated per turn within this app version. |
ToolMetrics
| JSON representation |
|---|
{ "tool": string, "passCount": integer, "failCount": integer } |
| Fields | |
|---|---|
tool |
Output only. The name of the tool. |
passCount |
Output only. The number of times the tool passed. |
failCount |
Output only. The number of times the tool failed. |
SemanticSimilarityMetrics
| JSON representation |
|---|
{ "score": number } |
| Fields | |
|---|---|
score |
Output only. The average semantic similarity score (0-4). |
HallucinationMetrics
| JSON representation |
|---|
{ "score": number } |
| Fields | |
|---|---|
score |
Output only. The average hallucination score (0 to 1). |
ToolCallLatencyMetrics
| JSON representation |
|---|
{ "tool": string, "averageLatency": string } |
| Fields | |
|---|---|
tool |
Output only. The name of the tool. |
averageLatency |
Output only. The average latency of the tool calls. A duration in seconds with up to nine fractional digits, ending with ' |
Duration
| JSON representation |
|---|
{ "seconds": string, "nanos": integer } |
| Fields | |
|---|---|
seconds |
Signed seconds of the span of time. Must be from -315,576,000,000 to +315,576,000,000 inclusive. Note: these bounds are computed from: 60 sec/min * 60 min/hr * 24 hr/day * 365.25 days/year * 10000 years |
nanos |
Signed fractions of a second at nanosecond resolution of the span of time. Durations less than one second are represented with a 0 |
TurnLatencyMetrics
| JSON representation |
|---|
{ "averageLatency": string } |
| Fields | |
|---|---|
averageLatency |
Output only. The average latency of the turns. A duration in seconds with up to nine fractional digits, ending with ' |
MetricsByTurn
| JSON representation |
|---|
{ "turnIndex": integer, "toolMetrics": [ { object ( |
| Fields | |
|---|---|
turnIndex |
Output only. The turn index (0-based). |
toolMetrics[] |
Output only. Metrics for each tool within this turn. |
semanticSimilarityMetrics[] |
Output only. Metrics for semantic similarity within this turn. |
hallucinationMetrics[] |
Output only. Metrics for hallucination within this turn. |
toolCallLatencyMetrics[] |
Output only. Metrics for tool call latency within this turn. |
turnLatencyMetrics[] |
Output only. Metrics for turn latency within this turn. |
Tool Annotations
Destructive Hint: ❌ | Idempotent Hint: ❌ | Read Only Hint: ❌ | Open World Hint: ❌