The ML.EVALUATE function
This document describes the ML.EVALUATE function, which lets you
evaluate model metrics.
Supported models
You can use the ML.EVALUATE function with all model types except for the
following:
Syntax
The ML.EVALUATE function syntax differs depending on the type of model that
you use the function with. Choose the option appropriate for your use case.
Times series
ML.EVALUATE(
MODEL `PROJECT_ID.DATASET.MODEL`
[, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }],
STRUCT(
[PERFORM_AGGREGATION AS perform_aggregation]
[, HORIZON AS horizon]
[, CONFIDENCE_LEVEL AS confidence_level])
)
Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
If you specify a
TABLEvalue, the input column names in the table must match the column names in the model, and their types must be compatible according to BigQuery implicit coercion rules.QUERY_STATEMENT: a GoogleSQL query that is used to generate the evaluation data. For the supported SQL syntax of theQUERY_STATEMENTclause in GoogleSQL, see Query syntax.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
If you used the
TRANSFORMclause in theCREATE MODELstatement that created the model, then you can only specify the input columns present in theTRANSFORMclause in the query.
PERFORM_AGGREGATION: aBOOLvalue that indicates the level of evaluation for forecasting accuracy. If you specifyTRUE, then the forecasting accuracy is on the time series level. If you specifyFALSE, the forecasting accuracy is on the timestamp level. The default value isTRUE.HORIZON: anINT64value that specifies the number of forecasted time points against which the evaluation metrics are computed. The default value is the horizon value specified in theCREATE MODELstatement for the time series model, or1000if unspecified. When evaluating multiple time series at the same time, this parameter applies to each time series.You can only use the
HORIZONargument when the following conditions are met:- The model type is
ARIMA_PLUS. - You have specified a value for either the
TABLEorQUERY_STATEMENTargument.
- The model type is
CONFIDENCE_LEVEL: aFLOAT64value that specifies the percentage of the future values that fall in the prediction interval. The default value is0.95. The valid input range is[0, 1).You can only use the
CONFIDENCE_LEVELargument when the following conditions are met:- The model type is
ARIMA_PLUS. - You have specified a value for either the
TABLEorQUERY_STATEMENTargument. The
PERFORM_AGGREGATIONargument value isFALSE.The value of the
CONFIDENCE_LEVELargument affects theupper_boundandlower_boundvalues in the output.
- The model type is
Classification & regression
ML.EVALUATE(
MODEL `PROJECT_ID.DATASET.MODEL`
[, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }],
STRUCT(
[THRESHOLD AS threshold]
[, TRIAL_ID AS trial_id])
)
Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
If you specify a
TABLEvalue, the input column names in the table must match the column names in the model, and their types must be compatible according to BigQuery implicit coercion rules.The table must have a column that matches the label column name that is provided during model training. You can provide this value by using the
input_label_colsoption during model training. Ifinput_label_colsis unspecified, the column namedlabelin the training data is used.QUERY_STATEMENT: a GoogleSQL query that is used to generate the evaluation data. For the supported SQL syntax of theQUERY_STATEMENTclause in GoogleSQL, see Query syntax.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
If you used the
TRANSFORMclause in theCREATE MODELstatement that created the model, then you can only specify the input columns present in theTRANSFORMclause in the query.The query must have a column that matches the label column name that is provided during model training. You can provide this value by using the
input_label_colsoption during model training. Ifinput_label_colsis unspecified, the column namedlabelin the training data is used.
THRESHOLD: aFLOAT64value that specifies a custom threshold for the evaluation. You can only use theTHRESHOLDargument with binary-class classification models. The default value is0.5.A
0value for precision or recall means that the selected threshold produced no true positive labels. ANaNvalue for precision means that the selected threshold produced no positive labels, neither true positives nor false positives.You must specify a value for either the
TABLEorQUERY_STATEMENTargument in order to specify a threshold.
TRIAL_ID: anINT64value that identifies the hyperparameter tuning trial that you want the function to evaluate. TheML.EVALUATEfunction uses the optimal trial by default. Only specify this argument if you ran hyperparameter tuning when creating the model.
Remote over Gemini
ML.EVALUATE(
MODEL `PROJECT_ID.DATASET.MODEL`
[, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }],
STRUCT(
[TASK_TYPE AS task_type]
[, MAX_OUTPUT_TOKENS AS max_output_tokens]
[, TEMPERATURE AS temperature]
[, TOP_P AS top_k])
)
Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
If the remote model isn't configured to use supervised tuning, the following column naming requirements apply:
- The table must have a column named
input_textthat contains the prompt text to use when evaluating the model. - The table must have a column named
output_textthat contains the generated text that you would expect to be returned by the model.
If the remote model is configured to use supervised tuning, the following column naming requirements apply:
- The table must have a column whose name matches the prompt column
name that is provided during model training. You can provide this
value by using the
prompt_coloption during model training. Ifprompt_colis unspecified, the column namedpromptin the training data is used. An error is returned if there is no column namedprompt. The table must have a column whose name matches the label column name that is provided during model training. You can provide this value by using the
input_label_colsoption during model training. Ifinput_label_colsis unspecified, the column namedlabelin the training data is used. An error is returned if there is no column namedlabel.You can find information about the label and prompt columns by looking at the model schema information in the Google Cloud console.
For more information, see
AS SELECT.- The table must have a column named
QUERY_STATEMENT: a GoogleSQL query that is used to generate the evaluation data. For the supported SQL syntax of theQUERY_STATEMENTclause in GoogleSQL, see Query syntax.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
If the remote model isn't configured to use supervised tuning, the following column naming requirements apply:
- The query must have a column named
input_textthat contains the prompt text to use when evaluating the model. - The query must have a column named
output_textthat contains the generated text that you would expect to be returned by the model.
If the remote model is configured to use supervised tuning, the following column naming requirements apply:
- The query must have a column whose name matches the prompt column
name that is provided during model training. You can provide this
value by using the
prompt_coloption during model training. Ifprompt_colis unspecified, the column namedpromptin the training data is used. An error is returned if there is no column namedprompt. The query must have a column whose name matches the label column name that is provided during model training. You can provide this value by using the
input_label_colsoption during model training. Ifinput_label_colsis unspecified, the column namedlabelin the training data is used. An error is returned if there is no column namedlabel.You can find information about the label and prompt columns by looking at the model schema information in the Google Cloud console.
For more information, see
AS SELECT.- The query must have a column named
TASK_TYPE: aSTRINGvalue that specifies the type of task for which you want to evaluate the model's performance. The valid options are the following:TEXT_GENERATIONCLASSIFICATIONSUMMARIZATIONQUESTION_ANSWERING
The default value is
TEXT_GENERATION.
MAX_OUTPUT_TOKENS: anINT64value that sets the maximum number of tokens output by the model. Specify a lower value for shorter responses and a higher value for longer responses. A token might be smaller than a word and is approximately four characters. 100 tokens correspond to approximately 60-80 words.The default value is
1024.The
MAX_OUTPUT_TOKENSvalue must be in the range[1,8192].
TEMPERATURE: aFLOAT64value that is used for sampling during the response generation. It controls the degree of randomness in token selection. LowerTEMPERATUREvalues are good for prompts that require a more deterministic and less open-ended or creative response, while higherTEMPERATUREvalues can lead to more diverse or creative results. ATEMPERATUREvalue of0is deterministic, meaning that the highest probability response is always selected.The
TEMPERATUREvalue must be in the range[0.0,1.0].The default value is
1.0.
TOP_P: aFLOAT64value in the range[0.0,1.0]that changes how the model selects tokens for output. Tokens are selected from the most to least probable until the sum of their probabilities equals theTOP_Pvalue. For example, if tokens A, B, and C have a probability of0.3,0.2, and0.1and theTOP_Pvalue is0.5, then the model selects either A or B as the next token by using theTEMPERATUREvalue and doesn't consider C. Specify a lower value for less random responses and a higher value for more random responses.The default value is
0.95.
Remote over Claude
ML.EVALUATE( MODEL `PROJECT_ID.DATASET.MODEL` [, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }], STRUCT( [TASK_TYPE AS task_type] [, MAX_OUTPUT_TOKENS AS max_output_tokens] [, TOP_K AS top_k] [, TOP_P AS top_k]) )
Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
The following column naming requirements apply:
- The table must have a column named
input_textthat contains the prompt text to use when evaluating the model. - The table must have a column named
output_textthat contains the generated text that you would expect to be returned by the model.
- The table must have a column named
QUERY_STATEMENT: a GoogleSQL query that is used to generate the evaluation data. For the supported SQL syntax of theQUERY_STATEMENTclause in GoogleSQL, see Query syntax.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
The following column naming requirements apply:
- The query must have a column named
input_textthat contains the prompt text to use when evaluating the model. - The query must have a column named
output_textthat contains the generated text that you would expect to be returned by the model.
- The query must have a column named
TASK_TYPE: aSTRINGvalue that specifies the type of task for which you want to evaluate the model's performance. The valid options are the following:TEXT_GENERATIONCLASSIFICATIONSUMMARIZATIONQUESTION_ANSWERING
The default value is
TEXT_GENERATION.
MAX_OUTPUT_TOKENS: anINT64value that sets the maximum number of tokens output by the model. Specify a lower value for shorter responses and a higher value for longer responses. A token might be smaller than a word and is approximately four characters. 100 tokens correspond to approximately 60-80 words.The default value is
1024.The
MAX_OUTPUT_TOKENSvalue must be in the range[1,4096].
TOP_K: anINT64value in the range[1,40]that changes how the model selects tokens for output. Specify a lower value for less random responses and a higher value for more random responses. The model determines an appropriate value if you don't specify one.A
TOP_Kvalue of1means the next selected token is the most probable among all tokens in the model's vocabulary, while aTOP_Kvalue of3means that the next token is selected from among the three most probable tokens by using theTEMPERATUREvalue.For each token selection step, the
TOP_Ktokens with the highest probabilities are sampled. Then tokens are further filtered based on theTOP_Pvalue, with the final token selected using temperature sampling.
TOP_P: aFLOAT64value in the range[0.0,1.0]that changes how the model selects tokens for output. Tokens are selected from the most to least probable until the sum of their probabilities equals theTOP_Pvalue. For example, if tokens A, B, and C have a probability of0.3,0.2, and0.1and theTOP_Pvalue is0.5, then the model selects either A or B as the next token by using theTEMPERATUREvalue and doesn't consider C. Specify a lower value for less random responses and a higher value for more random responses.The model determines an appropriate value if you don't specify one.
Remote over Llama or Mistral AI
ML.EVALUATE(
MODEL `PROJECT_ID.DATASET.MODEL`
[, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }],
STRUCT(
[TASK_TYPE AS task_type]
[, MAX_OUTPUT_TOKENS AS max_output_tokens]
[, TEMPERATURE AS temperature]
[, TOP_P AS top_k])
)
Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
The following column naming requirements apply:
- The table must have a column named
input_textthat contains the prompt text to use when evaluating the model. - The table must have a column named
output_textthat contains the generated text that you would expect to be returned by the model.
- The table must have a column named
QUERY_STATEMENT: a GoogleSQL query that is used to generate the evaluation data. For the supported SQL syntax of theQUERY_STATEMENTclause in GoogleSQL, see Query syntax.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
The following column naming requirements apply:
- The query must have a column named
input_textthat contains the prompt text to use when evaluating the model. - The query must have a column named
output_textthat contains the generated text that you would expect to be returned by the model.
- The query must have a column named
TASK_TYPE: aSTRINGvalue that specifies the type of task for which you want to evaluate the model's performance. The valid options are the following:TEXT_GENERATIONCLASSIFICATIONSUMMARIZATIONQUESTION_ANSWERING
The default value is
TEXT_GENERATION.
MAX_OUTPUT_TOKENS: anINT64value that sets the maximum number of tokens output by the model. Specify a lower value for shorter responses and a higher value for longer responses. A token might be smaller than a word and is approximately four characters. 100 tokens correspond to approximately 60-80 words.The default value is
1024.The
MAX_OUTPUT_TOKENSvalue must be in the range[1,4096].
TEMPERATURE: aFLOAT64value that is used for sampling during the response generation. It controls the degree of randomness in token selection. LowerTEMPERATUREvalues are good for prompts that require a more deterministic and less open-ended or creative response, while higherTEMPERATUREvalues can lead to more diverse or creative results. ATEMPERATUREvalue of0is deterministic, meaning that the highest probability response is always selected.The
TEMPERATUREvalue must be in the range[0.0,1.0].The default value is
1.0.
TOP_P: aFLOAT64value in the range[0.0,1.0]that changes how the model selects tokens for output. Tokens are selected from the most to least probable until the sum of their probabilities equals theTOP_Pvalue. For example, if tokens A, B, and C have a probability of0.3,0.2, and0.1and theTOP_Pvalue is0.5, then the model selects either A or B as the next token by using theTEMPERATUREvalue and doesn't consider C. Specify a lower value for less random responses and a higher value for more random responses.The model determines an appropriate value if you don't specify one.
Remote over open
ML.EVALUATE(
MODEL `PROJECT_ID.DATASET.MODEL`
[, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }],
STRUCT(
[TASK_TYPE AS task_type]
[, MAX_OUTPUT_TOKENS AS max_output_tokens]
[, TEMPERATURE AS temperature]
[, TOP_K AS top_k]
[, TOP_P AS top_p])
)
Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
The following column naming requirements apply:
- The table must have a column named
input_textthat contains the prompt text to use when evaluating the model. - The table must have a column named
output_textthat contains the generated text that you would expect to be returned by the model.
- The table must have a column named
QUERY_STATEMENT: a GoogleSQL query that is used to generate the evaluation data. For the supported SQL syntax of theQUERY_STATEMENTclause in GoogleSQL, see Query syntax.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
The following column naming requirements apply:
- The query must have a column named
input_textthat contains the prompt text to use when evaluating the model. - The query must have a column named
output_textthat contains the generated text that you would expect to be returned by the model.
- The query must have a column named
TASK_TYPE: aSTRINGvalue that specifies the type of task for which you want to evaluate the model's performance. The valid options are the following:TEXT_GENERATIONCLASSIFICATIONSUMMARIZATIONQUESTION_ANSWERING
The default value is
TEXT_GENERATION.
MAX_OUTPUT_TOKENS: anINT64value that sets the maximum number of tokens output by the model. Specify a lower value for shorter responses and a higher value for longer responses. A token might be smaller than a word and is approximately four characters. 100 tokens correspond to approximately 60-80 words.The model determines an appropriate value if you don't specify one.
The
MAX_OUTPUT_TOKENSvalue must be in the range[1,4096].
TEMPERATURE: aFLOAT64value that is used for sampling during the response generation. It controls the degree of randomness in token selection. LowerTEMPERATUREvalues are good for prompts that require a more deterministic and less open-ended or creative response, while higherTEMPERATUREvalues can lead to more diverse or creative results. ATEMPERATUREvalue of0is deterministic, meaning that the highest probability response is always selected.The
TEMPERATUREvalue must be in the range[0.0,1.0].The model determines an appropriate value if you don't specify one.
TOP_K: anINT64value in the range[1,40]that changes how the model selects tokens for output. Specify a lower value for less random responses and a higher value for more random responses. The model determines an appropriate value if you don't specify one.A
TOP_Kvalue of1means the next selected token is the most probable among all tokens in the model's vocabulary, while aTOP_Kvalue of3means that the next token is selected from among the three most probable tokens by using theTEMPERATUREvalue.For each token selection step, the
TOP_Ktokens with the highest probabilities are sampled. Then tokens are further filtered based on theTOP_Pvalue, with the final token selected using temperature sampling.
TOP_P: aFLOAT64value in the range[0.0,1.0]that changes how the model selects tokens for output. Tokens are selected from the most to least probable until the sum of their probabilities equals theTOP_Pvalue. For example, if tokens A, B, and C have a probability of0.3,0.2, and0.1and theTOP_Pvalue is0.5, then the model selects either A or B as the next token by using theTEMPERATUREvalue and doesn't consider C. Specify a lower value for less random responses and a higher value for more random responses.The model determines an appropriate value if you don't specify one.
All other models
ML.EVALUATE(
MODEL `PROJECT_ID.DATASET.MODEL`
[, { TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }],
STRUCT(
[THRESHOLD AS threshold]
[, TRIAL_ID AS trial_id])
)
Arguments
ML.EVALUATE takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the project that contains the resource.MODEL: the name of the model.TABLE: the name of the input table that contains the evaluation data.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
If you specify a
TABLEvalue, the input column names in the table must match the column names in the model, and their types must be compatible according to BigQuery implicit coercion rules.QUERY_STATEMENT: a GoogleSQL query that is used to generate the evaluation data. For the supported SQL syntax of theQUERY_STATEMENTclause in GoogleSQL, see Query syntax.If you don't specify a table or query to provide input data, the evaluation metrics that are generated for the model during training are returned.
If you used the
TRANSFORMclause in theCREATE MODELstatement that created the model, then you can only specify the input columns present in theTRANSFORMclause in the query.
THRESHOLD: aFLOAT64value that specifies a custom threshold for the evaluation. You can only use theTHRESHOLDargument with binary-class classification models. The default value is0.5.A
0value for precision or recall means that the selected threshold produced no true positive labels. ANaNvalue for precision means that the selected threshold produced no positive labels, neither true positives nor false positives.You must specify a value for either the
TABLEorQUERY_STATEMENTargument in order to specify a threshold.
TRIAL_ID: anINT64value that identifies the hyperparameter tuning trial that you want the function to evaluate. TheML.EVALUATEfunction uses the optimal trial by default. Only specify this argument if you ran hyperparameter tuning when creating the model.You can't use the
TRIAL_IDargument with PCA models.
Output
ML.EVALUATE returns a single row of metrics applicable to the
type of model specified.
For models that return them, the precision, recall, f1_score, log_loss,
and roc_auc metrics are macro-averaged for all of the class labels. For a
macro-average, metrics are calculated for each label and then an unweighted
average is taken of those values.
Time series
ML.EVALUATE returns the following columns for ARIMA_PLUS or
ARIMA_PLUS_XREG models when input data is provided and
perform_aggregation is FALSE:
time_series_id_colortime_series_id_cols: a value that contains the identifiers of a time series.time_series_id_colcan be anINT64orSTRINGvalue.time_series_id_colscan be anARRAY<INT64>orARRAY<STRING>value. Only present when forecasting multiple time series at once. The column names and types are inherited from theTIME_SERIES_ID_COLoption as specified in theCREATE MODELstatement.ARIMA_PLUS_XREGmodels don't support this column.time_series_timestamp_col: aSTRINGvalue that contains the timestamp column for a time series. The column name and type are inherited from theTIME_SERIES_TIMESTAMP_COLoption as specified in theCREATE MODELstatement.time_series_data_col: aSTRINGvalue that contains the data column for a time series. The column name and type are inherited from theTIME_SERIES_DATA_COLoption as specified in theCREATE MODELstatement.forecasted_time_series_data_col: aSTRINGvalue that contains the same data astime_series_data_colbut withforecasted_prefixed to the column name.lower_bound: aFLOAT64value that contains the lower bound of the prediction interval.upper_bound: aFLOAT64value that contains the upper bound of the prediction interval.absolute_error: aFLOAT64value that contains the absolute value of the difference between the forecasted value and the actual data value.absolute_percentage_error: aFLOAT64value that contains the absolute value of the absolute error divided by the actual value.
ML.EVALUATE returns the following columns for ARIMA_PLUS or
ARIMA_PLUS_XREG models when input data is provided and
perform_aggregation is TRUE:
time_series_id_colortime_series_id_cols: the identifiers of a time series. Only present when forecasting multiple time series at once. The column names and types are inherited from theTIME_SERIES_ID_COLoption as specified in theCREATE MODELstatement.ARIMA_PLUS_XREGmodels don't support this column.mean_absolute_error: aFLOAT64value that contains the mean absolute error for the model.mean_squared_error: aFLOAT64value that contains the mean squared error for the model.root_mean_squared_error: aFLOAT64value that contains the root mean squared error for the model.mean_absolute_percentage_error: aFLOAT64value that contains the mean absolute percentage error for the model.symmetric_mean_absolute_percentage_error: aFLOAT64value that contains the symmetric mean absolute percentage error for the model.
ML.EVALUATE returns the following columns for an ARIMA_PLUS model when
input data isn't provided:
time_series_id_colortime_series_id_cols: the identifiers of a time series. Only present when forecasting multiple time series at once. The column names and types are inherited from theTIME_SERIES_ID_COLoption as specified in theCREATE MODELstatement.non_seasonal_p: anINT64value that contains the order for the autoregressive model. For more information, see Autoregressive integrated moving average.non_seasonal_d: anINT64that contains the degree of differencing for the non-seasonal model. For more information, see Autoregressive integrated moving average.non_seasonal_q: anINT64that contains the order for the moving average model. For more information, see Autoregressive integrated moving average.has_drift: aBOOLvalue that indicates whether the model includes a linear drift term.log_likelihood: aFLOAT64value that contains the log likelihood for the model.aic: aFLOAT64value that contains the Akaike information criterion for the model.variance: aFLOAT64value that measures how far the observed value differs from the predicted value mean.seasonal_periods: aSTRINGvalue that contains the seasonal period for the model.has_holiday_effect: aBOOLvalue that indicates whether the model includes any holiday effects.has_spikes_and_dips: aBOOLvalue that indicates whether the model performs automatic spikes and dips detection and cleanup.has_step_changes: aBOOLvalue that indicates whether the model has step changes.
Classification
The following types of models are classification models:
- Logistic regressor
- Boosted tree classifier
- Random forest classifier
- DNN classifier
- Wide & Deep classifier
- AutoML Tables classifier
ML.EVALUATE returns the following columns for classification models:
trial_id: anINT64value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model. This column doesn't apply for AutoML Tables models.precision: aFLOAT64value that contains the precision for the model.recall: aFLOAT64value that contains the recall for the model.accuracy: aFLOAT64value that contains the accuracy for the model.accuracyis computed as a global total or micro-average. For a micro-average, the metric is calculated globally by counting the total number of correctly predicted rows.f1_score: aFLOAT64value that contains the F1 score for the model.log_loss: aFLOAT64value that contains the logistic loss for the model.roc_auc: aFLOAT64value that contains the area under the receiver operating characteristic curve for the model.
Regression
The following types of models are regression models:
- Linear regression
- Boosted tree regressor
- Random forest regressor
- Deep neural network (DNN) regressor
- Wide & Deep regressor
- AutoML Tables regressor
ML.EVALUATE returns the following columns for regression models:
trial_id: anINT64value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model. This column doesn't apply for AutoML Tables models.mean_absolute_error: aFLOAT64value that contains the mean absolute error for the model.mean_squared_error: aFLOAT64value that contains the mean squared error for the model.mean_squared_log_error: aFLOAT64value that contains the mean squared logarithmic error for the model. The mean squared logarithmic error measures the distance between the actual and predicted values.median_absolute_error: aFLOAT64value that contains the median absolute error for the model.r2_score: aFLOAT64value that contains the R2 score for the model.explained_variance: aFLOAT64value that contains the explained variance for the model.
K-means
ML.EVALUATE returns the following columns for k-means models:
trial_id: anINT64value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model.davies_bouldin_index: aFLOAT64value that contains the Davies-Bouldin Index for the model.mean_squared_distance: aFLOAT64value that contains the mean squared distance for the model, which is the average of the distances between training data points to their closest centroid.
Matrix factorization
ML.EVALUATE returns the following columns for matrix factorization models
with implicit feedback:
trial_id: anINT64value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model.recall: aFLOAT64value that contains the recall for the model.mean_squared_error: aFLOAT64value that contains the mean squared error for the model.normalized_discounted_cumulative_gain: aFLOAT64value that contains the normalized discounted cumulative gain for the model.average_rank: aFLOAT64value that contains the average rank (PDF download) for the model.
ML.EVALUATE returns the following columns for matrix factorization models
with explicit feedback:
trial_id: anINT64value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model.mean_absolute_error: aFLOAT64value that contains the mean absolute error for the model.mean_squared_error: aFLOAT64value that contains the mean squared error for the model.mean_squared_log_error: aFLOAT64value that contains the mean squared logarithmic error for the model. The mean squared logarithmic error measures the distance between the actual and predicted values.mean_absolute_error: aFLOAT64value that contains the mean absolute error for the model.r2_score: aFLOAT64value that contains the R2 score for the model.explained_variance: aFLOAT64value that contains the explained variance for the model.
Remote over pre-trained models
This section describes the output for the following types of models:
- Gemini
- Anthropic Claude
- Mistral AI
- Llama
- Open models
ML.EVALUATE returns different columns depending on the task_type value
that you specify.
When you specify the TEXT_GENERATION task type, the following columns are
returned:
bleu4_score: aFLOAT64column that contains the bilingual evaluation understudy (BLEU4) score for the model.rouge-l_precision: aFLOAT64column that contains the Recall-oriented understudy for gisting evaluation (ROUGE-L) precision for the model .rouge-l_recall: aFLOAT64column that contains the ROUGE-L recall for the model.rouge-l_f1: aFLOAT64column that contains the ROUGE-L F1 score for the model.evaluation_status: aSTRINGcolumn in JSON format that contains the following elements:num_successful_rows: the number of successful inference rows returned from Vertex AI.num_total_rows: the number of total input rows.
When you specify the CLASSIFICATION task type, the following columns are
returned:
precision: aFLOAT64column that contains the precision for the model .recall: aFLOAT64column that contains the recall for the model.f1: aFLOAT64column that contains the F1 score for the model.label: aSTRINGcolumn that contains the label generated for the input data.evaluation_status: aSTRINGcolumn in JSON format that contains the following elements:num_successful_rows: the number of successful inference rows returned from Vertex AI.num_total_rows: the number of total input rows.
When you specify the SUMMARIZATION task type, the following columns are
returned:
rouge-l_precision: aFLOAT64column that contains the Recall-oriented understudy for gisting evaluation (ROUGE-L) precision for the model.rouge-l_recall: aFLOAT64column that contains the ROUGE-L recall for the model.rouge-l_f1: aFLOAT64column that contains the ROUGE-L F1 score for the model.evaluation_status: aSTRINGcolumn in JSON format that contains the following elements:num_successful_rows: the number of successful inference rows returned from Vertex AI.num_total_rows: the number of total input rows.
When you specify the QUESTION_ANSWERING task type, the following columns are
returned:
exact_match: aFLOAT64column that indicates if the generated text exactly matches the ground truth. This value is1if the generated text equals the ground truth, otherwise it is0. This metric is an average across all of the input rows.evaluation_status: aSTRINGcolumn in JSON format that contains the following elements:num_successful_rows: the number of successful inference rows returned from Vertex AI.num_total_rows: the number of total input rows.
Remote over custom models
ML.EVALUATE returns the following column for remote models over
custom models deployed to Vertex AI:
remote_eval_metrics: aJSONcolumn containing appropriate metrics for the model type.
PCA
ML.EVALUATE returns the following column for PCA models:
total_explained_variance_ratio: aFLOAT64value that contains the percentage of the cumulative variance explained by all the returned principal components. For more information, see theML.PRINCIPAL_COMPONENT_INFOfunction.
Autoencoder
ML.EVALUATE returns the following columns for autoencoder models:
mean_absolute_error: aFLOAT64value that contains the mean absolute error for the model.mean_squared_error: aFLOAT64value that contains the mean squared error for the model.mean_squared_log_error: aFLOAT64value that contains the mean squared logarithmic error for the model. The mean squared logarithmic error measures the distance between the actual and predicted values.
Limitations
ML.EVALUATE is subject to the following limitations:
ML.EVALUATEdoesn't support imported TensorFlow models or remote models over Cloud AI services.- For remote models over Vertex AI endpoints,
ML.EVALUATEfetches evaluation result from the Vertex AI endpoint and doesn't take any input data.
Costs
When used with remote models over Vertex AI LLMs,
ML.EVALUATE costs are calculated based on the following:
- The bytes processed from the input table. These charges are billed from BigQuery to your project. For more information, see BigQuery pricing.
- The input to and output from the LLM. These charges are billed from Vertex AI to your project. For more information, see Vertex AI pricing.
Examples
The following examples show how to use ML.EVALUATE.
ML.EVALUATE with no input data specified
The following query evaluates a model with no input data specified:
SELECT * FROM ML.EVALUATE(MODEL `mydataset.mymodel`)
ML.EVALUATE with a custom threshold and input data
The following query evaluates a model with input data and a custom
threshold of 0.55:
SELECT * FROM ML.EVALUATE(MODEL `mydataset.mymodel`, ( SELECT custom_label, column1, column2 FROM `mydataset.mytable`), STRUCT(0.55 AS threshold))
ML.EVALUATE to calculate forecasting accuracy of a time series
The following query evaluates the 30-point forecasting accuracy for a time series model:
SELECT * FROM ML.EVALUATE(MODEL `mydataset.my_arima_model`, ( SELECT timeseries_date, timeseries_metric FROM `mydataset.mytable`), STRUCT(TRUE AS perform_aggregation, 30 AS horizon))
ML.EVALUATE to calculate ARIMA_PLUS forecasting accuracy for each forecasted timestamp
The following query evaluates the forecasting accuracy for each of the 30
forecasted points of a time series model. It also computes the prediction
interval based on a confidence level of 0.9.
SELECT * FROM ML.EVALUATE(MODEL `mydataset.my_arima_model`, ( SELECT timeseries_date, timeseries_metric FROM `mydataset.mytable`), STRUCT(FALSE AS perform_aggregation, 0.9 AS confidence_level, 30 AS horizon))
ML.EVALUATE to calculate ARIMA_PLUS_XREG forecasting accuracy for each forecasted timestamp
The following query evaluates the forecasting accuracy for each of the 30
forecasted points of a time series model. It also computes the prediction
interval based on a confidence level of 0.9. Note that you need to include the
side features for the evaluation data.
SELECT * FROM ML.EVALUATE(MODEL `mydataset.my_arima_xreg_model`, ( SELECT timeseries_date, timeseries_metric, feature1, feature2 FROM `mydataset.mytable`), STRUCT(FALSE AS perform_aggregation, 0.9 AS confidence_level, 30 AS horizon))
ML.EVALUATE to calculate LLM text generation accuracy
The following query evaluates the LLM text generation accuracy for the classification task type for each label from the evaluation table.
SELECT * FROM ML.EVALUATE(MODEL `mydataset.my_llm`, ( SELECT prompt, label FROM `mydataset.mytable`), STRUCT('classification' AS task_type))
What's next
- For more information about model evaluation, see BigQuery ML model evaluation overview.
- For more information about supported SQL statements and functions for ML models, see End-to-end user journeys for ML models.