Use Biglake metastore with Spark and BigQuery using Iceberg REST Catalog and BigQuery federation

Learn how to create a BigLake Iceberg table by running a Google Cloud Serverless for Apache Spark PySpark job that connects to a BigLake metastore catalog using the Iceberg REST catalog.

The table is created with BigQuery catalog federation, which lets you query it directly from BigQuery without needing to create a separate connection or external table definition.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Enable the BigQuery,Dataproc APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the BigQuery,Dataproc APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

To get the permissions that you need to use the Apache Iceberg REST catalog in BigLake metastore, ask your administrator to grant you the following IAM roles:

  • Perform administrative tasks, such as managing catalog user access and storage access:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Grant IAM roles

To allow the Spark job to interact with BigLake and BigQuery, grant the required IAM roles to the Compute Engine default service account.

  1. In the Google Cloud console, click Activate Cloud Shell.

    Activate Cloud Shell

  2. Click Authorize.

  3. Grant the Dataproc Worker role to the Compute Engine default service account, which Dataproc uses by default.

    gcloud projects add-iam-policy-binding PROJECT_ID \
        --member="serviceAccount:$(gcloud projects describe PROJECT_ID --format='value(projectNumber)')-compute@developer.gserviceaccount.com" \
        --role="roles/dataproc.worker"
  4. Grant the Service Usage Consumer role to the Compute Engine default service account.

    gcloud projects add-iam-policy-binding PROJECT_ID \
        --member="serviceAccount:$(gcloud projects describe PROJECT_ID --format='value(projectNumber)')-compute@developer.gserviceaccount.com" \
        --role="roles/serviceusage.serviceUsageConsumer"
  5. Grant the BigQuery Data Editor role to the Compute Engine default service account.

    gcloud projects add-iam-policy-binding PROJECT_ID \
        --member="serviceAccount:$(gcloud projects describe PROJECT_ID --format='value(projectNumber)')-compute@developer.gserviceaccount.com" \
        --role="roles/bigquery.dataEditor"
  6. Grant the BigQuery User role to your user account. This lets you run query jobs from the BigQuery console.

    gcloud projects add-iam-policy-binding PROJECT_ID \
        --member="user:USER_EMAIL" \
        --role="roles/bigquery.user"
  7. Grant the BigQuery Data Viewer role to your user account. This lets you read data from the BigQuery table.

    gcloud projects add-iam-policy-binding PROJECT_ID \
        --member="user:USER_EMAIL" \
        --role="roles/bigquery.dataViewer"

    Replace the following:

    • PROJECT_ID: Your Google Cloud project ID.
    • USER_EMAIL: The email address of your Google Cloud user account.

Create a Cloud Storage bucket

If you don't have a Cloud Storage bucket to use for this quickstart, create one. BigQuery catalog federation doesn't require a BigLake catalog resource, but you need a bucket to store the Iceberg namespace data and the PySpark script.

  1. In the Google Cloud console, go to the Cloud Storage Buckets page.

    Go to Buckets

  2. Click Create.

  3. On the Create a bucket page, enter a unique name for your bucket.

  4. Click Create.

Make note of this bucket name to replace BUCKET_NAME in the following steps.

Create and run a Spark job

To create and query an Iceberg table, first create a PySpark job with the necessary Spark SQL statements. Then run the job with Serverless for Apache Spark.

Create a namespace, a table, and insert data into the table

In a text editor, create a file named quickstart.py with the following content. This script creates a namespace, a table, and inserts data into the table.

Replace BUCKET_NAME with the name of your Cloud Storage bucket and REGION with the region where you will run the Spark job (for example, us-central1).

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("quickstart").getOrCreate()

# Create a namespace, which corresponds to a BigQuery dataset.
# The LOCATION must be a Cloud Storage path. Use the BUCKET_LOCATION
# value of the bucket you created in the previous step
spark.sql("CREATE NAMESPACE IF NOT EXISTS `quickstart-catalog`.quickstart_namespace LOCATION 'gs://BUCKET_NAME/quickstart_namespace' WITH DBPROPERTIES ('gcp-region' = 'REGION')")

# Create a table within the namespace.
# The table's data is stored in Cloud storage under the namespace's location.
spark.sql("""
CREATE TABLE IF NOT EXISTS `quickstart-catalog`.quickstart_namespace.quickstart_table (
    id INT,
    name STRING
)
USING iceberg
""")

# Insert data into the table.
spark.sql("""
INSERT INTO `quickstart-catalog`.quickstart_namespace.quickstart_table
VALUES (1, 'one'), (2, 'two'), (3, 'three')
""")

# Query the table and show the results.
df = spark.sql("SELECT * FROM `quickstart-catalog`.quickstart_namespace.quickstart_table")
df.show()

Upload the script to your Cloud Storage bucket

After you create the quickstart.py script, upload it to the Cloud Storage bucket.

  1. In the Google Cloud console, go to Cloud Storage Buckets.

    Go to Buckets

  2. Click the name of your bucket.

  3. On the Objects tab, click Upload > Upload files.

  4. In the file browser, select the quickstart.py file, and then click Open.

Run the Spark job

After you upload the quickstart.py script, run it as a Serverless for Apache Spark Spark batch job.

This job creates a namespace and a table, inserts data, and displays the table contents in the job output. This quickstart uses BigQuery catalog federation, which is enabled by setting spark.sql.catalog.quickstart-catalog.warehouse=bq://projects/PROJECT_ID. Credential vending is not supported or required for catalog federation.

  1. In Cloud Shell, run the following Serverless for Apache Spark batch job using the quickstart.py script.

    gcloud dataproc batches submit pyspark gs://BUCKET_NAME/quickstart.py \
        --project=PROJECT_ID \
        --region=REGION \
        --version=2.2 \
        --properties="\
    spark.sql.defaultCatalog=quickstart-catalog,\
    spark.sql.catalog.quickstart-catalog=org.apache.iceberg.spark.SparkCatalog,\
    spark.sql.catalog.quickstart-catalog.type=rest,\
    spark.sql.catalog.quickstart-catalog.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog,\
    spark.sql.catalog.quickstart-catalog.warehouse=bq://projects/PROJECT_ID,\
    spark.sql.catalog.quickstart-catalog.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO,\
    spark.sql.catalog.quickstart-catalog.header.x-goog-user-project=PROJECT_ID,\
    spark.sql.catalog.quickstart-catalog.rest.auth.type=org.apache.iceberg.gcp.auth.GoogleAuthManager,\
    spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,\
    spark.sql.catalog.quickstart-catalog.rest-metrics-reporting-enabled=false"
    

    Replace the following:

    • BUCKET_NAME: the name of the Cloud Storage bucket that contains your PySpark application file.
    • PROJECT_ID: your Google Cloud project ID.
    • REGION: the region to run the Dataproc batch workload in.

    When the job completes, it displays an output similar to the following:

    Batch [cb9d84e9489d408baca4f9e7ab4c64ff] finished.
    metadata:
    '@type': type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata
    batch: projects/your-project/locations/us-central1/batches/cb9d84e9489d408baca4f9e7ab4c64ff
    batchUuid: 54b0b9d2-f0a1-4fdf-ae44-eead3f8e60e9
    createTime: '2026-01-24T00:10:50.224097Z'
    description: Batch
    labels:
        goog-dataproc-batch-id: cb9d84e9489d408baca4f9e7ab4c64ff
        goog-dataproc-batch-uuid: 54b0b9d2-f0a1-4fdf-ae44-eead3f8e60e9
        goog-dataproc-drz-resource-uuid: batch-54b0b9d2-f0a1-4fdf-ae44-eead3f8e60e9
        goog-dataproc-location: us-central1
    operationType: BATCH
    name: projects/your-project/regions/us-central1/operations/32287926-5f61-3572-b54a-fbad8940d6ef
    

Query the table from BigQuery

Because you configured the catalog with BigQuery catalog federation, you can query the table directly from BigQuery.

  1. In the Google Cloud console, go to BigQuery.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    SELECT * FROM `quickstart_namespace.quickstart_table`;
    
  3. Click Run.

    The query results show the data that you inserted with the Spark job.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

  1. In the Google Cloud console, go to BigQuery.

    Go to BigQuery

  2. In the query editor, run the following statements to delete the table and the namespace, which corresponds to a BigQuery dataset:

    DROP TABLE `quickstart_namespace.quickstart_table`;
    DROP SCHEMA `quickstart_namespace`;
    
  3. Go to Buckets

  4. Select your bucket and click Delete.

What's next