Google Cloud Lakehouse hosts high-quality public datasets served through the Apache Iceberg REST Catalog, making them available to the general public as part of the Google Cloud Public Dataset Program.
These datasets are available for read-only access, and you can access and integrate them into your applications using Apache Spark, Trino, Flink, or BigQuery. Google pays for the storage of these datasets and provides public access to the data through Lakehouse. You pay only for the queries that you perform on the data.
The goal of these public datasets is to lower the barrier to entry for Iceberg. You don't need to manage infrastructure to learn Iceberg, you need to connect. You can use these datasets to:
- Use BigQuery (through Lakehouse) to query these tables directly using SQL, combining them with your private data.
- Test your OSS engine (for example, Spark, Trino, or Flink) configurations against a live REST Catalog.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Lakehouse API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Lakehouse API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.
Before you connect with Apache Spark, you must have the following:
- Application Default Credentials (ADC) set up in your environment.
Public dataset locations
Each public dataset is stored in a specific location such as US or EU. The
Lakehouse public datasets are stored in the US multi-region
location. When you query a public dataset, ensure
your processing location is compatible with the dataset location.
Access public datasets using Apache Spark
Because Lakehouse public datasets are served through the Iceberg REST catalog, you can access them from Apache Spark and other compatible engines. You can connect to the public dataset using any standard Spark environment such as on-premises, Managed Service for Apache Spark, or other cloud vendors.
Connect with Apache Spark
Use the following configuration flags when starting your Spark SQL session.
These flags configure a catalog named lakehouse-sample pointing to the public
REST endpoint.
spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0,org.apache.iceberg:iceberg-gcp-bundle:1.10.0 \
--conf spark.hadoop.hive.cli.print.header=true \
--conf spark.sql.catalog.bqms=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.bqms.type=rest \
--conf spark.sql.catalog.bqms.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog \
--conf spark.sql.catalog.bqms.warehouse=gs://biglake-public-nyc-taxi-iceberg \
--conf spark.sql.catalog.bqms.header.x-goog-user-project=PROJECT_ID \
--conf spark.sql.catalog.bqms.rest.auth.type=google \
--conf spark.sql.catalog.bqms.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO \
--conf spark.sql.catalog.bqms.header.X-Iceberg-Access-Delegation=vended-credentials \
--conf spark.sql.defaultCatalog=lakehouse-sample
Replace PROJECT_ID with your Google Cloud project ID.
Example queries
Once connected, you have full SQL access to the datasets. The NYC Taxi dataset is available, modeled as an Iceberg table to demonstrate partitioning and metadata capabilities.
The following query aggregates millions of records to find the average fare and trip distance by passenger count. It demonstrates how Iceberg efficiently scans data files without needing to list directories by using partition pruning:
SELECT
passenger_count,
COUNT(1) AS num_trips,
ROUND(AVG(total_amount), 2) AS avg_fare,
ROUND(AVG(trip_distance), 2) AS avg_distance
FROM
lakehouse-sample.public_data.nyc_taxicab
WHERE
data_file_year = 2021
AND passenger_count > 0
GROUP BY
passenger_count
ORDER BY
num_trips DESC;
One of Iceberg's most powerful features is Time Travel. You can query the table as it existed at a specific point in the past. The following query lets you to audit changes by comparing the row count of the current version versus a specific snapshot:
-- Compare the row count of the current version vs. a specific snapshot
SELECT
'Current State' AS version,
COUNT(*) AS count
FROM lakehouse-sample.public_data.nyc_taxicab
UNION ALL
SELECT
'Past State' AS version,
COUNT(*) AS count
FROM lakehouse-sample.public_data.nyc_taxicab VERSION AS OF 2943559336503196801Q;
By querying the history metadata table (for example, SELECT * FROM
bqms.public_data.nyc_taxicab.history), you can find snapshot IDs and travel
back to see how the dataset grew over time.
For more information, see Use Lakehouse runtime catalog with Spark, Iceberg REST Catalog & Cloud Storage.
Available datasets
Lakehouse provides sample tables that you can query as Lakehouse tables.
The biglake-public-nyc-taxi-iceberg dataset includes the following tables in
Apache Iceberg format:
| Name | Description |
|---|---|
nyc_taxicab |
NYC Taxi and Limousine Commission (TLC) Trip Record Data. |
What's next
- Learn more about Lakehouse Iceberg tables.
- Learn more about Lakehouse runtime catalog Iceberg REST Catalog.