The ML.PREDICT function
This document describes the ML.PREDICT function, which you can use to
predict outcomes by using a model. ML.PREDICT works with the following models:
- Linear and logistic regression models
- Boosted tree models
- Random forest models
- Deep neural network (DNN) models
- Wide-and-deep models
- K-means models
- Principal component analysis (PCA) models
- Autoencoder models
- Imported models:
- Vertex AI hosted models
For PCA and autoencoder models, you can use the
ML.GENERATE_EMBEDDING function as an alternative to the ML.PREDICT
function.
ML.GENERATE_EMBEDDING generates the same embedding data as ML.PREDICT
as an array in a single column, rather than in a series of columns. Having all
of the embeddings in a single column lets you directly use the
VECTOR_SEARCH function
on theML.GENERATE_EMBEDDING output.
You can run prediction during model creation, after model creation, or after a
failure (as long as at least one iteration is finished). ML.PREDICT always uses
the model weights from the last successful iteration.
Syntax
ML.PREDICT(
MODEL `PROJECT_ID.DATASET.MODEL_NAME`,
{ TABLE `PROJECT_ID.DATASET.TABLE` | (QUERY_STATEMENT) }
STRUCT(
[THRESHOLD AS threshold]
[, KEEP_ORIGINAL_COLUMNS AS keep_original_columns]
[, TRIAL_ID AS trial_id])
)
Arguments
ML.PREDICT takes the following arguments:
PROJECT_ID: the project that contains the resource.DATASET: the dataset that contains the resource.MODEL: the name of the model.TABLE: The name of the input table that contains the evaluation data.If
TABLEis specified, the input column names in the table must match the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules.For TensorFlow Lite, Open Neural Network Exchange (ONNX), and XGBoost models, the input must be convertible to the type expected by the model.
For remote models, the input columns must contain all Vertex AI endpoint input fields.
If there are unused columns from the table, they are passed through as output columns.
QUERY_STATEMENT: The GoogleSQL query that is used to generate the evaluation data. See the GoogleSQL query syntax page for the supported SQL syntax of theQUERY_STATEMENTclause.If
QUERY_STATEMENTis specified, the input column names from the query must match the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules.For TensorFlow Lite, ONNX, and XGBoost models, the input must be convertible to the type expected by the model.
For remote models, the input columns must contain all Vertex AI endpoint input fields.
If there are unused columns from the query, they are passed through as output columns.
If you used the
TRANSFORMclause in theCREATE MODELstatement that created the model, then only the input columns present in theTRANSFORMclause must appear inQUERY_STATEMENT.If you are running inference on image data from an object table, you must use the
ML.DECODE_IMAGEfunction to convert image bytes to a multi-dimensionalARRAYrepresentation. You can useML.DECODE_IMAGEoutput directly in anML.PREDICTstatement, or you can write the results fromML.DECODE_IMAGEto a table column and reference that column when you callML.PREDICT. For more information, see Predict an outcome from image data with an imported TensorFlow model.THRESHOLD: a
FLOAT64value that specifies a custom threshold for a binary classification model. It is used as the cutoff between the two labels. Predictions above the threshold are positive predictions. Predictions below the threshold are negative predictions. The default value is0.5.KEEP_ORIGINAL_COLUMNS: aBOOLvalue that specifies whether to output the input table columns. IfTRUE, the columns from the input table are output. The default value isFALSE.KEEP_ORIGINAL_COLUMNSonly applies to principal component analysis (PCA) models.TRIAL_ID: anINT64value that identifies the hyperparameter tuning trial that you want the function to evaluate. The function uses the optimal trial by default. Only specify this argument if you ran hyperparameter tuning when creating the model.
Output
The output of the ML.PREDICT function has as many rows as the input table, and
it includes all columns from the input table and all output columns from the
model. The output column names for the model are predicted_<label_column_name>
and, for classification models, predicted_<label_column_name>_probs. In
both columns, label_column_name is the name of the input label column that's
used during training.
Regression models
For the following types of regression models:
- Linear regression
- Boosted tree regressor
- Random forest regressor
- DNN regressor
- Wide-and-deep regressor
The following column is returned:
predicted_<label_column_name>: aSTRINGvalue that contains the predicted value of the label.
Classification models
For the following types of binary-class classification models:
- Logistic regression
- Boosted tree classifier
- Random forest classifier
- DNN classifier
- Wide-and-deep classifier
The following columns are returned:
- The
predicted_<label_column_name>: aSTRINGvalue that contains one of the two input labels, depending on which label has the higher predicted probability. - The
predicted_<label_column_name>_probs: anARRAY<STRUCT>value in the form[<label, probability>]that contains the predicted probability of each label.
For the following types of multiclass classification models:
- Logistic regression
- Boosted tree classifier
- Random forest classifier
- DNN classifier
- Wide-and-deep classifier
The following columns are returned:
- The
predicted_<label_column_name>: aSTRINGvalue that contains the label with the highest predicted probability score. - The
predicted_<label_column_name>_probs: aFLOAT64value that contains the probability for each class label, calculated using a softmax function.
K-means models
For k-means models, the following columns are returned:
centroid_id: anINT64value that identifies the centroid.nearest_centroids_distance: anARRAY<STRUCT>value that contains the distances to the nearestkclusters, wherekis equal to the lesser ofnum_clustersor5. If the model was created with thestandardize_featuresoption set toTRUE, then the model computes these distances using standardized features; otherwise, it computes these distances using non-standardized features.
PCA models
For PCA models, the following columns are returned:
principal_component_<index>: anINT64value that represents the projection of the input data onto each principal component. These values can also be considered as embedded low-dimensional features in the space that is spanned by the principal components.
The original input columns are appended if the keep_original_columns
argument is set to TRUE.
Autoencoder models
For autoencoder models, the following columns are returned:
latent_col_<index>: anINT64value that represents the dimensions of the latent space.
The original input columns are appended after the latent space columns.
Imported models
For TensorFlow Lite models, the output is the output of the TensorFlow Lite model's predict method.
For ONNX models, the output is the output of the ONNX model's predict method.
For XGBoost models, the output is the output of the XGBoost model's predict method.
Remote models
For remote models, the output columns contain all Vertex AI endpoint
output fields, and also a remote_model_status field that contains status
messages from Vertex AI endpoint.
Missing data imputation
In statistics, imputation is used to replace missing data with substituted
values. When you train a model in BigQuery ML, NULL values are
treated as missing data. When you predict outcomes in BigQuery ML,
missing values can occur when BigQuery ML encounters a NULL
value or a previously unseen value. BigQuery ML handles missing
data differently, based on the type of data in the column.
| Column type | Imputation method |
|---|---|
| Numeric | In both training and prediction, NULL values in numeric
columns are replaced with the mean value of the given column, as calculated
by the feature column in the original input data. |
| One-hot/Multi-hot encoded | In both training and prediction, NULL values in the
encoded columns are mapped to an additional category that is added to
the data. Previously unseen data is assigned a weight of 0 during
prediction. |
TIMESTAMP |
TIMESTAMP columns use a mixture of imputation methods
from both standardized and one-hot encoded columns. For the generated Unix
time column, BigQuery ML replaces values with the mean Unix
time across the original columns. For other generated values,
BigQuery ML assigns them to the respective NULL
category for each extracted feature. |
STRUCT |
In both training and prediction, each field of the STRUCT
is imputed according to its type. |
Permissions
You must have the bigquery.models.getData
Identity and Access Management (IAM) permission
in order to run ML.PREDICT.
Examples
The following examples assume your model and input table are in your default project.
Predict an outcome
The following example predicts an outcome and returns the following columns:
predicted_labellabelcolumn1column2
SELECT * FROM ML.PREDICT(MODEL `mydataset.mymodel`, ( SELECT label, column1, column2 FROM `mydataset.mytable`))
Compare predictions from two different models
The following example creates two models and then compares their output:
Create the first model:
CREATE MODEL `mydataset.mymodel1` OPTIONS (model_type='linear_reg', input_label_cols=['label'], ) AS SELECT label, input_column1 FROM `mydataset.mytable`
Create the second model:
CREATE MODEL `mydataset.mymodel2` OPTIONS (model_type='linear_reg', input_label_cols=['label'], ) AS SELECT label, input_column2 FROM `mydataset.mytable`
Compare the output of the two models:
SELECT label, predicted_label1, predicted_label AS predicted_label2 FROM ML.PREDICT(MODEL `mydataset.mymodel2`, ( SELECT * EXCEPT (predicted_label), predicted_label AS predicted_label1 FROM ML.PREDICT(MODEL `mydataset.mymodel1`, TABLE `mydataset.mytable`)))
Specify a custom threshold
The following example runs prediction with input data and
a custom threshold of 0.55:
SELECT * FROM ML.PREDICT(MODEL `mydataset.mymodel`, ( SELECT custom_label, column1, column2 FROM `mydataset.mytable`), STRUCT(0.55 AS threshold))
Predict an outcome from structured data with an imported TensorFlow model
The following query predicts outcomes using an imported
TensorFlow model. The input_data table contains inputs in the
schema expected by my_model. See
the CREATE MODEL statement for TensorFlow models
for more information.
SELECT * FROM ML.PREDICT(MODEL `my_project.my_dataset.my_model`, (SELECT * FROM input_data))
Predict an outcome from image data with an imported TensorFlow model
If you are running inference on image data from an
object table, you must use the
ML.DECODE_IMAGE function
to convert image bytes to a multi-dimensional ARRAY representation. You can
use ML.DECODE_IMAGE output directly in an ML.PREDICT function,
or you can write the results from ML.DECODE_IMAGE to a table column and
reference that column when you call ML.PREDICT. You can also pass
ML.DECODE_IMAGE output to another image processing function for
additional preprocessing during either of these procedures.
You can join the object table to standard BigQuery tables to limit the data used in inference, or to provide additional input to the model.
The following examples show different ways you can use the ML.PREDICT
function with image data.
Example 1
The following example uses the ML.DECODE_IMAGE function directly in the
ML.PREDICT function. It returns the inference results for all images in the
object table, for a model with an input field of input and an output
field of feature:
SELECT * FROM ML.PREDICT( MODEL `my_dataset.vision_model`, (SELECT uri, ML.RESIZE_IMAGE(ML.DECODE_IMAGE(data), 480, 480, FALSE) AS input FROM `my_dataset.object_table`) );
Example 2
The following example uses the ML.DECODE_IMAGE function directly in the
ML.PREDICT function, and uses the ML.CONVERT_COLOR_SPACE function in the
ML.PREDICT function to convert
the image color space from RBG to YIQ. It also shows how to
use object table fields to filter the objects included in inference.
It returns the inference results for all JPG images in the
object table, for a model with an input field of input and an output
field of feature:
SELECT * FROM ML.PREDICT( MODEL `my_dataset.vision_model`, (SELECT uri, ML.CONVERT_COLOR_SPACE(ML.RESIZE_IMAGE(ML.DECODE_IMAGE(data), 224, 280, TRUE), 'YIQ') AS input FROM `my_dataset.object_table` WHERE content_type = 'image/jpeg') );
Example 3
The following example uses results from ML.DECODE_IMAGE that have been
written to a table column but not processed any further. It uses
ML.RESIZE_IMAGE and ML.CONVERT_IMAGE_TYPE in the ML.PREDICT function to
process the image data. It returns the inference results for all images in the
decoded images table, for a model with an input field of input and an output
field of feature.
Create the decoded images table:
CREATE OR REPLACE TABLE `my_dataset.decoded_images` AS (SELECT ML.DECODE_IMAGE(data) AS decoded_image FROM `my_dataset.object_table`);
Run inference on the decoded images table:
SELECT * FROM ML.PREDICT( MODEL`my_dataset.vision_model`, (SELECT uri, ML.CONVERT_IMAGE_TYPE(ML.RESIZE_IMAGE(decoded_image, 480, 480, FALSE)) AS input FROM `my_dataset.decoded_images`) );
Example 4
The following example uses results from ML.DECODE_IMAGE that have been
written to a table column and preprocessed using
ML.RESIZE_IMAGE. It returns the inference results for all images in the
decoded images table, for a model with an input field of input and an output
field of feature.
Create the table:
CREATE OR REPLACE TABLE `my_dataset.decoded_images` AS (SELECT ML.RESIZE_IMAGE(ML.DECODE_IMAGE(data) 480, 480, FALSE) AS decoded_image FROM `my_dataset.object_table`);
Run inference on the decoded images table:
SELECT * FROM ML.PREDICT( MODEL `my_dataset.vision_model`, (SELECT uri, decoded_image AS input FROM `my_dataset.decoded_images`) );
Example 5
The following example uses the ML.DECODE_IMAGE function directly in the
ML.PREDICT function. In this example, the model has an output field of
embeddings and two input fields: one that expects an
image, f_img, and one that expects a string, f_txt. The image
input comes from the object table and the string input comes from a
standard BigQuery table that is joined with the object table
by using the uri column.
SELECT * FROM ML.PREDICT( MODEL `my_dataset.mixed_model`, (SELECT uri, ML.RESIZE_IMAGE(ML.DECODE_IMAGE(my_dataset.my_object_table.data), 224, 224, FALSE) AS f_img, my_dataset.image_description.description AS f_txt FROM `my_dataset.object_table` JOIN `my_dataset.image_description` ON object_table.uri = image_description.uri) );
Predict an outcome with a model trained with the TRANSFORM clause
The following example trains a model using the
TRANSFORM clause:
CREATE MODEL `mydataset.mymodel` TRANSFORM(f1 + f2 as c, label) OPTIONS(...) AS SELECT f1, f2, f3, label FROM t;
Because the f3 column doesn't appear in the TRANSFORM clause,
the following prediction query omits that column in the QUERY_STATEMENT:
SELECT * FROM ML.PREDICT( MODEL `mydataset.mymodel`, (SELECT f1, f2 FROM t1));
If f3 is provided in the SELECT statement, it isn't used for calculating
predictions but is instead passed through for use in the rest of the
SQL statement.
Predict dimensionality reduction results (latent space) with an autoencoder model
The following example runs prediction against a previously built autoencoder model, where the input was 4 dimensional (4 input columns) and the dimensionality reduction had 2 dimensions (2 output columns):
SELECT * FROM ML.PREDICT( MODEL `mydataset.mymodel`, (SELECT f1, f2, f3, f4 FROM t1));
What's next
- For more information about model inference, see Model inference overview.
- For more information about supported SQL statements and functions for ML models, see End-to-end user journeys for ML models.