Use ML and AI with BigQuery DataFrames
BigQuery DataFrames provides ML and AI capabilities for
BigQuery DataFrames using the bigframes.ml library.
You can preprocess data, create estimators to train models in BigQuery DataFrames, create ML pipelines, and split training and testing datasets.
Required roles
To get the permissions that you need to complete the tasks in this document, ask your administrator to grant you the following IAM roles on your project:
-
Use remote models or AI functionalities:
BigQuery Connection Admin (
roles/bigquery.connectionAdmin) -
Use BigQuery DataFrames in a BigQuery notebook:
-
BigQuery User (
roles/bigquery.user) -
Notebook Runtime User (
roles/aiplatform.notebookRuntimeUser) -
Code Creator (
roles/dataform.codeCreator)
-
BigQuery User (
-
Use default BigQuery connection:
-
BigQuery Data Editor (
roles/bigquery.dataEditor) -
BigQuery Connection Admin (
roles/bigquery.connectionAdmin) -
Cloud Functions Developer (
roles/cloudfunctions.developer) -
Service Account User (
roles/iam.serviceAccountUser) -
Storage Object Viewer (
roles/storage.objectViewer)
-
BigQuery Data Editor (
-
Use BigQuery DataFrames ML remote models:
BigQuery Connection Admin (
roles/bigquery.connectionAdmin)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
ML locations
The bigframes.ml library supports the same locations as
BigQuery ML. BigQuery ML model prediction and other
ML functions are supported in all BigQuery regions. Support for
model training varies by region. For more information, see
BigQuery ML locations.
Preprocess data
Create transformers to prepare data for use in estimators (models) by
using the
bigframes.ml.preprocessing module
and the
bigframes.ml.compose module.
BigQuery DataFrames offers the following transformations:
To bin continuous data into intervals, use the
KBinsDiscretizerclass in thebigframes.ml.preprocessingmodule.To normalize the target labels as integer values, use the
LabelEncoderclass in thebigframes.ml.preprocessingmodule.To scale each feature to the range
[-1, 1]by its maximum absolute value, use theMaxAbsScalerclass in thebigframes.ml.preprocessingmodule.To standardize features by scaling each feature to the range
[0, 1], use theMinMaxScalerclass in thebigframes.ml.preprocessingmodule.To standardize features by removing the mean and scaling to unit variance, use the
StandardScalerclass in thebigframes.ml.preprocessingmodule.To transform categorical values into numeric format, use the
OneHotEncoderclass in thebigframes.ml.preprocessingmodule.To apply transformers to DataFrames columns, use the
ColumnTransformerclass in thebigframes.ml.composemodule.
Train models
You can create estimators to train models in BigQuery DataFrames.
Clustering models
You can create estimators for clustering models by using the
bigframes.ml.cluster module.
To create K-means clustering models, use the
KMeans class. Use these models for data
segmentation. For example, identifying customer segments. K-means is
an unsupervised learning technique, so model training doesn't
require labels or split data for training or evaluation.
You can use the bigframes.ml.cluster module to create estimators for
clustering models.
The following code sample shows using the bigframes.ml.cluster KMeans
class to create a k-means clustering model for data segmentation:
Decomposition models
You can create estimators for decomposition models by using the
bigframes.ml.decomposition module.
To create principal component analysis (PCA) models, use the PCA
class. Use these
models for computing principal components and using them to perform
a change of basis on the data. Using the PCA class provides dimensionality
reduction by projecting each data point onto only the first few
principal components to obtain lower-dimensional data while
preserving as much of the data's variation as possible.
Ensemble models
You can create estimators for ensemble models by using the
bigframes.ml.ensemble module.
To create random forest classifier models, use the
RandomForestClassifierclass. Use these models for constructing multiple learning method decision trees for classification.To create random forest regression models, use the
RandomForestRegressorclass. Use these models for constructing multiple learning method decision trees for regression.To create gradient boosted tree classifier models, use the
XGBClassifierclass. Use these models for additively constructing multiple learning method decision trees for classification.To create gradient boosted tree regression models, use the
XGBRegressorclass. Use these models for additively constructing multiple learning method decision trees for regression.
Forecasting models
You can create estimators for forecasting models by using the
bigframes.ml.forecasting module.
To create time series forecasting models, use the
ARIMAPlus class.
Imported models
You can create estimators for imported models by using the
bigframes.ml.imported module.
To import Open Neural Network Exchange (ONNX) models, use the
ONNXModelclass.To import TensorFlow model, use the
TensorFlowModelclass.To import XGBoostModel models, use the
XGBoostModelclass.
Linear models
Create estimators for linear models by using the
bigframes.ml.linear_model module.
To create linear regression models, use the
LinearRegressionclass. Use these models for forecasting, such as forecasting the sales of an item on a given day.To create logistic regression models, use the
LogisticRegressionclass. Use these models for the classification of two or more possible values such as whether an input islow-value,medium-value, orhigh-value.
The following code sample shows using bigframes.ml to do the
following:
- Load data from BigQuery.
- Clean and prepare training data.
- Create and apply a bigframes.ml.LinearRegression regression model.
Large language models
You can create estimators for LLMs by using the
bigframes.ml.llm module.
To create Gemini text generator models, use the
GeminiTextGeneratorclass. Use these models for text generation tasks.To create estimators for remote large language models (LLMs), use the
bigframes.ml.llmmodule.
The following code sample shows using the bigframes.ml.llm
GeminiTextGenerator
class to create a Gemini model for code generation:
Remote models
To use BigQuery DataFrames ML remote models (bigframes.ml.remote
or bigframes.ml.llm), you must enable the following APIs:
When you use BigQuery DataFrames ML remote models, you need the
Project IAM Admin role (roles/resourcemanager.projectIamAdmin)
if you use a default BigQuery connection, or the
Browser role (roles/browser)
if you use a pre-configured connection. You can avoid this requirement by
setting the bigframes.pandas.options.bigquery.skip_bq_connection_check option
to True, in which case the connection (default or pre-configured) is used
as-is without any existence or permission check. If you use the
pre-configured connection and skip the connection check, verify the
following:
- The connection is created in the right location.
- If you use BigQuery DataFrames ML remote models, the service
account has the
Vertex AI User role (
roles/aiplatform.user) on the project.
Creating a remote model in BigQuery DataFrames creates a
BigQuery connection.
By default, a connection of the name bigframes-default-connection is used. You
can use a pre-configured BigQuery connection if you prefer,
in which case the connection creation is skipped. The service account
for the default connection is granted the
Vertex AI User role (roles/aiplatform.user) on the project.
Create pipelines
You can create ML pipelines by using
bigframes.ml.pipeline module.
Pipelines let you assemble several ML steps to be cross-validated together while
setting different parameters. This simplifies your code, and lets you deploy
data preprocessing steps and an estimator together.
To create a pipeline of transforms with a final estimator, use the
Pipeline class.
Select models
To split your training and testing datasets and select the best models, use the
bigframes.ml.model_selection module
module:
To split the data into training and testing (evaluation sets), as shown in the following code sample, use the
train_test_splitfunction:X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)To create multi-fold training and testing sets to train and evaluate models, as shown in the following code sample, use the
KFoldclass and theKFold.splitmethod. This feature is valuable for small datasets.kf = KFold(n_splits=5) for i, (X_train, X_test, y_train, y_test) in enumerate(kf.split(X, y)): # Train and evaluate models with training and testing setsTo automatically create multi-fold training and testing sets, train and evaluate the model, and get the result of each fold, as shown in the following code sample, use the
cross_validatefunction:scores = cross_validate(model, X, y, cv=5)
What's next
- Learn about the BigQuery DataFrames data type system.
- Learn how to generate BigQuery DataFrames code with Gemini.
- Learn how to analyze package downloads from PyPI with BigQuery DataFrames.
- View BigQuery DataFrames source code, sample notebooks, and samples on GitHub.
- Explore the BigQuery DataFrames API reference.