Use ML and AI with BigQuery DataFrames

BigQuery DataFrames provides ML and AI capabilities for BigQuery DataFrames using the bigframes.ml library.

You can preprocess data, create estimators to train models in BigQuery DataFrames, create ML pipelines, and split training and testing datasets.

Required roles

To get the permissions that you need to complete the tasks in this document, ask your administrator to grant you the following IAM roles on your project:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

ML locations

The bigframes.ml library supports the same locations as BigQuery ML. BigQuery ML model prediction and other ML functions are supported in all BigQuery regions. Support for model training varies by region. For more information, see BigQuery ML locations.

Preprocess data

Create transformers to prepare data for use in estimators (models) by using the bigframes.ml.preprocessing module and the bigframes.ml.compose module. BigQuery DataFrames offers the following transformations:

  • To bin continuous data into intervals, use the KBinsDiscretizer class in the bigframes.ml.preprocessing module.

  • To normalize the target labels as integer values, use the LabelEncoder class in the bigframes.ml.preprocessing module.

  • To scale each feature to the range [-1, 1] by its maximum absolute value, use the MaxAbsScaler class in the bigframes.ml.preprocessing module.

  • To standardize features by scaling each feature to the range [0, 1], use the MinMaxScaler class in the bigframes.ml.preprocessing module.

  • To standardize features by removing the mean and scaling to unit variance, use the StandardScaler class in the bigframes.ml.preprocessing module.

  • To transform categorical values into numeric format, use the OneHotEncoder class in the bigframes.ml.preprocessing module.

  • To apply transformers to DataFrames columns, use the ColumnTransformer class in the bigframes.ml.compose module.

Train models

You can create estimators to train models in BigQuery DataFrames.

Clustering models

You can create estimators for clustering models by using the bigframes.ml.cluster module. To create K-means clustering models, use the KMeans class. Use these models for data segmentation. For example, identifying customer segments. K-means is an unsupervised learning technique, so model training doesn't require labels or split data for training or evaluation.

You can use the bigframes.ml.cluster module to create estimators for clustering models.

The following code sample shows using the bigframes.ml.cluster KMeans class to create a k-means clustering model for data segmentation:

from bigframes.ml.cluster import KMeans
import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Create the KMeans model
cluster_model = KMeans(n_clusters=10)
cluster_model.fit(bq_df["culmen_length_mm"], bq_df["sex"])

# Predict using the model
result = cluster_model.predict(bq_df)
# Score the model
score = cluster_model.score(bq_df)

Decomposition models

You can create estimators for decomposition models by using the bigframes.ml.decomposition module. To create principal component analysis (PCA) models, use the PCA class. Use these models for computing principal components and using them to perform a change of basis on the data. Using the PCA class provides dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible.

Ensemble models

You can create estimators for ensemble models by using the bigframes.ml.ensemble module.

  • To create random forest classifier models, use the RandomForestClassifier class. Use these models for constructing multiple learning method decision trees for classification.

  • To create random forest regression models, use the RandomForestRegressor class. Use these models for constructing multiple learning method decision trees for regression.

  • To create gradient boosted tree classifier models, use the XGBClassifier class. Use these models for additively constructing multiple learning method decision trees for classification.

  • To create gradient boosted tree regression models, use the XGBRegressor class. Use these models for additively constructing multiple learning method decision trees for regression.

Forecasting models

You can create estimators for forecasting models by using the bigframes.ml.forecasting module. To create time series forecasting models, use the ARIMAPlus class.

Imported models

You can create estimators for imported models by using the bigframes.ml.imported module.

Linear models

Create estimators for linear models by using the bigframes.ml.linear_model module.

  • To create linear regression models, use the LinearRegression class. Use these models for forecasting, such as forecasting the sales of an item on a given day.

  • To create logistic regression models, use the LogisticRegression class. Use these models for the classification of two or more possible values such as whether an input is low-value, medium-value, or high-value.

The following code sample shows using bigframes.ml to do the following:

from bigframes.ml.linear_model import LinearRegression
import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Filter down to the data to the Adelie Penguin species
adelie_data = bq_df[bq_df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# Drop the species column
adelie_data = adelie_data.drop(columns=["species"])

# Drop rows with nulls to get training data
training_data = adelie_data.dropna()

# Specify your feature (or input) columns and the label (or output) column:
feature_columns = training_data[
    ["island", "culmen_length_mm", "culmen_depth_mm", "flipper_length_mm", "sex"]
]
label_columns = training_data[["body_mass_g"]]

test_data = adelie_data[adelie_data.body_mass_g.isnull()]

# Create the linear model
model = LinearRegression()
model.fit(feature_columns, label_columns)

# Score the model
score = model.score(feature_columns, label_columns)

# Predict using the model
result = model.predict(test_data)

Large language models

You can create estimators for LLMs by using the bigframes.ml.llm module.

  • To create Gemini text generator models, use the GeminiTextGenerator class. Use these models for text generation tasks.

  • To create estimators for remote large language models (LLMs), use the bigframes.ml.llm module.

The following code sample shows using the bigframes.ml.llm GeminiTextGenerator class to create a Gemini model for code generation:

from bigframes.ml.llm import GeminiTextGenerator
import bigframes.pandas as bpd

# Create the Gemini LLM model
session = bpd.get_global_session()
connection = f"{PROJECT_ID}.{REGION}.{CONN_NAME}"
model = GeminiTextGenerator(
    session=session, connection_name=connection, model_name="gemini-2.0-flash-001"
)

df_api = bpd.read_csv("gs://cloud-samples-data/vertex-ai/bigframe/df.csv")

# Prepare the prompts and send them to the LLM model for prediction
df_prompt_prefix = "Generate Pandas sample code for DataFrame."
df_prompt = df_prompt_prefix + df_api["API"]

# Predict using the model
df_pred = model.predict(df_prompt.to_frame(), max_output_tokens=1024)

Remote models

To use BigQuery DataFrames ML remote models (bigframes.ml.remote or bigframes.ml.llm), you must enable the following APIs:

When you use BigQuery DataFrames ML remote models, you need the Project IAM Admin role (roles/resourcemanager.projectIamAdmin) if you use a default BigQuery connection, or the Browser role (roles/browser) if you use a pre-configured connection. You can avoid this requirement by setting the bigframes.pandas.options.bigquery.skip_bq_connection_check option to True, in which case the connection (default or pre-configured) is used as-is without any existence or permission check. If you use the pre-configured connection and skip the connection check, verify the following:

  • The connection is created in the right location.
  • If you use BigQuery DataFrames ML remote models, the service account has the Vertex AI User role (roles/aiplatform.user) on the project.

Creating a remote model in BigQuery DataFrames creates a BigQuery connection. By default, a connection of the name bigframes-default-connection is used. You can use a pre-configured BigQuery connection if you prefer, in which case the connection creation is skipped. The service account for the default connection is granted the Vertex AI User role (roles/aiplatform.user) on the project.

Create pipelines

You can create ML pipelines by using bigframes.ml.pipeline module. Pipelines let you assemble several ML steps to be cross-validated together while setting different parameters. This simplifies your code, and lets you deploy data preprocessing steps and an estimator together.

To create a pipeline of transforms with a final estimator, use the Pipeline class.

Select models

To split your training and testing datasets and select the best models, use the bigframes.ml.model_selection module module:

  • To split the data into training and testing (evaluation sets), as shown in the following code sample, use the train_test_split function:

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
  • To create multi-fold training and testing sets to train and evaluate models, as shown in the following code sample, use the KFold class and the KFold.split method. This feature is valuable for small datasets.

    kf = KFold(n_splits=5)
    for i, (X_train, X_test, y_train, y_test) in enumerate(kf.split(X, y)):
    # Train and evaluate models with training and testing sets
    
  • To automatically create multi-fold training and testing sets, train and evaluate the model, and get the result of each fold, as shown in the following code sample, use the cross_validate function:

    scores = cross_validate(model, X, y, cv=5)
    

What's next