Use Biglake metastore with Spark and BigQuery using the Iceberg REST catalog
Learn how to use the BigLake metastore Iceberg REST catalog to create and query a BigLake Iceberg table with a Google Cloud Serverless for Apache Spark PySpark BigQuery.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the BigLake,Dataproc APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles. -
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
Enable the BigLake,Dataproc APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.
Grant IAM roles
To allow the Spark job to interact with BigLake and BigQuery, grant the required Identity and Access Management (IAM) roles to the Compute Engine default service account.
In the Google Cloud console, click Activate Cloud Shell
Click Authorize.
Grant the Dataproc Worker role to the Compute Engine default service account, which Dataproc uses by default.
gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:$(gcloud projects describe PROJECT_ID --format='value(projectNumber)')-compute@developer.gserviceaccount.com" \ --role="roles/dataproc.worker"Grant the Service Usage Consumer role to the Compute Engine default service account.
gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:$(gcloud projects describe PROJECT_ID --format='value(projectNumber)')-compute@developer.gserviceaccount.com" \ --role="roles/serviceusage.serviceUsageConsumer"Grant the BigQuery Data Editor role to the Compute Engine default service account.
gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:$(gcloud projects describe PROJECT_ID --format='value(projectNumber)')-compute@developer.gserviceaccount.com" \ --role="roles/bigquery.dataEditor"Replace the following:
PROJECT_ID: Your Google Cloud project ID.
Create a BigLake catalog
Create a BigLake catalog to manage metadata for your Iceberg tables. You connect to this catalog in your Spark job.
In the Google Cloud console, go to BigLake.
Click Create catalog.
The Create catalog page opens.
For Select a Cloud Storage bucket, click Browse, and then click Create new bucket.
Enter a unique name for your bucket. Remember this name in You use it in the
BUCKET_NAMEvariable later in this tutorial.The console creates your bucket.
From the bucket list, select your bucket and click Select.
Your catalog ID inherits the name of your bucket. Remember this name in You use it in the
CATALOG_IDvariable later in this tutorial.For Authentication method, select Credential vending mode.
Click Create.
Create and run a Spark job
To create and query an Iceberg table, first create a PySpark job with the necessary Spark SQL statements. Then run the job with Serverless for Apache Spark.
Create a namespace, a table, and insert data into the table
In a text editor, create a file named quickstart.py with the following
content. This script creates a namespace, a table, and the inserts data into the
table.
Replace BUCKET_NAME with the name of your
Cloud Storage bucket.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("quickstart").getOrCreate()
# Create a namespace, which corresponds to a BigQuery dataset.
# The LOCATION must be a Cloud Storage path. Use the BUCKET_LOCATION
# value of the bucket you created in the previous step
spark.sql("CREATE NAMESPACE IF NOT EXISTS `quickstart-catalog`.quickstart_namespace LOCATION 'gs://BUCKET_NAME/quickstart_namespace'")
# Create a table within the namespace.
# The table's data is stored in Cloud Storage under the namespace's location.
spark.sql("""
CREATE TABLE `quickstart-catalog`.quickstart_namespace.quickstart_table (
id INT,
name STRING
)
USING iceberg
""")
# Insert data into the table.
spark.sql("""
INSERT INTO `quickstart-catalog`.quickstart_namespace.quickstart_table
VALUES (1, 'one'), (2, 'two'), (3, 'three')
""")
# Query the table and show the results.
df = spark.sql("SELECT * FROM `quickstart-catalog`.quickstart_namespace.quickstart_table")
df.show()
Upload the script to your Cloud Storage bucket
After you create the quickstart.py script, upload it to the Cloud Storage
bucket.
In the Google Cloud console, go to Cloud Storage buckets.
Click the name of your bucket.
On the Objects tab, click Upload > Upload files.
In the file browser, select the
quickstart.pyfile, and then click Open.
Run the Spark job
After you upload the quickstart.py script, run it as a
Serverless for Apache Spark Spark batch job.
In Cloud Shell, run the following Serverless for Apache Spark batch job using the
quickstart.pyscript. This job creates a namespace and a table, inserts data, and displays the table contents in the job output. This command configures the Iceberg REST catalog using BigQuery catalog federation.gcloud dataproc batches submit pyspark gs://BUCKET_NAME/quickstart.py \ --project=PROJECT_ID \ --region=REGION \ --version=2.2 \ --properties="spark.sql.defaultCatalog=quickstart-catalog,spark.sql.catalog.quickstart-catalog=org.apache.iceberg.spark.SparkCatalog,spark.sql.catalog.quickstart-catalog.type=rest,spark.sql.catalog.quickstart-catalog.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog,spark.sql.catalog.quickstart-catalog.warehouse=bq://projects/PROJECT_ID,spark.sql.catalog.quickstart-catalog.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO,spark.sql.catalog.quickstart-catalog.header.x-goog-user-project=PROJECT_ID,spark.sql.catalog.quickstart-catalog.rest.auth.type=org.apache.iceberg.gcp.auth.GoogleAuthManager,spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,spark.sql.catalog.quickstart-catalog.rest-metrics-reporting-enabled=false,spark.sql.catalog.quickstart-catalog.header.X-Iceberg-Access-Delegation=vended-credentials"Replace the following:
BUCKET_NAME: the name of the Cloud Storage bucket that contains your PySpark application file.PROJECT_ID: your Google Cloud project ID.REGION: the region to run the Dataproc batch workload in.
When the job completes, it displays an output similar to the following:
Batch [cb9d84e9489d408baca4f9e7ab4c64ff] finished. metadata: '@type': type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata batch: projects/your-project/locations/us-central1/batches/cb9d84e9489d408baca4f9e7ab4c64ff batchUuid: 54b0b9d2-f0a1-4fdf-ae44-eead3f8e60e9 createTime: '2026-01-24T00:10:50.224097Z' description: Batch labels: goog-dataproc-batch-id: cb9d84e9489d408baca4f9e7ab4c64ff goog-dataproc-batch-uuid: 54b0b9d2-f0a1-4fdf-ae44-eead3f8e60e9 goog-dataproc-drz-resource-uuid: batch-54b0b9d2-f0a1-4fdf-ae44-eead3f8e60e9 goog-dataproc-location: us-central1 operationType: BATCH name: projects/your-project/regions/us-central1/operations/32287926-5f61-3572-b54a-fbad8940d6ef
Query the table from BigQuery
Because you configured the catalog with BigQuery catalog federation, you can query the table directly from BigQuery.
In the Google Cloud console, go to BigQuery.
In the query editor, enter the following statement:
SELECT * FROM `quickstart_namespace.quickstart_table`;Click Run.
The query results show the data that you inserted with the Spark job.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.
In the Google Cloud console, go to BigQuery.
In the query editor, run the following statement to delete the table:
DROP TABLE `quickstart_namespace.quickstart_table`;Go to Biglake.
Select the
quickstart-catalogcatalog, and then click Delete.Go to Cloud Storage Buckets.
Select your bucket and click Delete.
In Cloud Shell, delete the firewall rule:
gcloud compute firewall-rules delete allow-s8s-egress --quiet
Delete the subnet. Replace
REGIONwith the region you used to create the subnet.gcloud compute networks subnets delete spark-s8s --region=REGION --quiet
What's next
- Learn how to manage catalogs.
- Learn about BigLake tables for Apache Iceberg.