Lakehouse public datasets

Google Cloud Lakehouse hosts high-quality public datasets served through the Apache Iceberg REST Catalog, making them available to the general public as part of the Google Cloud Public Dataset Program.

These datasets are available for read-only access, and you can access and integrate them into your applications using Apache Spark, Trino, Flink, or BigQuery. Google pays for the storage of these datasets and provides public access to the data through Lakehouse. You pay only for the queries that you perform on the data.

The goal of these public datasets is to lower the barrier to entry for Iceberg. You don't need to manage infrastructure to learn Iceberg, you need to connect. You can use these datasets to:

  • Use BigQuery (through Lakehouse) to query these tables directly using SQL, combining them with your private data.
  • Test your OSS engine (for example, Spark, Trino, or Flink) configurations against a live REST Catalog.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Enable the Lakehouse API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the Lakehouse API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

Before you connect with Apache Spark, you must have the following:

Public dataset locations

Each public dataset is stored in a specific location such as US or EU. The Lakehouse public datasets are stored in the US multi-region location. When you query a public dataset, ensure your processing location is compatible with the dataset location.

Access public datasets using Apache Spark

Because Lakehouse public datasets are served through the Iceberg REST catalog, you can access them from Apache Spark and other compatible engines. You can connect to the public dataset using any standard Spark environment such as on-premises, Managed Service for Apache Spark, or other cloud vendors.

Connect with Apache Spark

Use the following configuration flags when starting your Spark SQL session. These flags configure a catalog named lakehouse-sample pointing to the public REST endpoint.

spark-sql \
  --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0,org.apache.iceberg:iceberg-gcp-bundle:1.10.0 \
  --conf spark.hadoop.hive.cli.print.header=true \
  --conf spark.sql.catalog.bqms=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.bqms.type=rest \
  --conf spark.sql.catalog.bqms.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog \
  --conf spark.sql.catalog.bqms.warehouse=gs://biglake-public-nyc-taxi-iceberg \
  --conf spark.sql.catalog.bqms.header.x-goog-user-project=PROJECT_ID \
  --conf spark.sql.catalog.bqms.rest.auth.type=google \
  --conf spark.sql.catalog.bqms.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO \
  --conf spark.sql.catalog.bqms.header.X-Iceberg-Access-Delegation=vended-credentials \
  --conf spark.sql.defaultCatalog=lakehouse-sample

Replace PROJECT_ID with your Google Cloud project ID.

Example queries

Once connected, you have full SQL access to the datasets. The NYC Taxi dataset is available, modeled as an Iceberg table to demonstrate partitioning and metadata capabilities.

The following query aggregates millions of records to find the average fare and trip distance by passenger count. It demonstrates how Iceberg efficiently scans data files without needing to list directories by using partition pruning:

SELECT
    passenger_count,
    COUNT(1) AS num_trips,
    ROUND(AVG(total_amount), 2) AS avg_fare,
    ROUND(AVG(trip_distance), 2) AS avg_distance
FROM
    lakehouse-sample.public_data.nyc_taxicab
WHERE
    data_file_year = 2021
    AND passenger_count > 0
GROUP BY
    passenger_count
ORDER BY
    num_trips DESC;

One of Iceberg's most powerful features is Time Travel. You can query the table as it existed at a specific point in the past. The following query lets you to audit changes by comparing the row count of the current version versus a specific snapshot:

-- Compare the row count of the current version vs. a specific snapshot
SELECT
    'Current State' AS version,
    COUNT(*) AS count
FROM lakehouse-sample.public_data.nyc_taxicab
UNION ALL
SELECT
    'Past State' AS version,
    COUNT(*) AS count
FROM lakehouse-sample.public_data.nyc_taxicab VERSION AS OF 2943559336503196801Q;

By querying the history metadata table (for example, SELECT * FROM bqms.public_data.nyc_taxicab.history), you can find snapshot IDs and travel back to see how the dataset grew over time.

For more information, see Use Lakehouse runtime catalog with Spark, Iceberg REST Catalog & Cloud Storage.

Available datasets

Lakehouse provides sample tables that you can query as Lakehouse tables.

The biglake-public-nyc-taxi-iceberg dataset includes the following tables in Apache Iceberg format:

Name Description
nyc_taxicab NYC Taxi and Limousine Commission (TLC) Trip Record Data.

What's next