You can use a Dataflow job builder blueprint to add existing Apache Parquet files from cloud-based storage (Cloud Storage or Amazon S3) to an Apache Iceberg table in Lakehouse.
This process uses the IcebergAddFiles
transform.
If your Parquet files are in Cloud Storage, this transform registers the
files with Lakehouse without moving or rewriting the underlying
data. If your files are in an external storage system like Amazon S3, they are
copied to Cloud Storage for faster querying through Lakehouse,
and then registered.
Use the following connection details to add Parquet files from cloud-based storage to an Apache Iceberg table in Lakehouse.
Before you begin
Enable the Dataflow, BigQuery, and Lakehouse APIs.
To get the permissions that you need to create the resources, ask your administrator to grant you the required Identity and Access Management (IAM) roles on your project.
Create a Google Cloud Lakehouse catalog, namespace, and table to import data into.
Create a cloud-based storage bucket (Cloud Storage or Amazon S3) and upload your Parquet files to the bucket.
If the cloud-based storage bucket you're using isn't Google's Cloud Storage, create a Cloud Storage bucket to store your job error logs.
Support and limitations
Importing Parquet files in cloud-based storage to Google Cloud Lakehouse using Dataflow has the following limitations:
- The source data must be in Apache Parquet format and stored in Cloud Storage or Amazon S3.
- This feature supports only batch pipelines.
Import Parquet files to Lakehouse
Use the following steps to import Parquet files from cloud-based storage to an Iceberg table in Lakehouse using the Dataflow job builder UI.
In the Google Cloud console, go to the Google Cloud Lakehouse page.
Select the catalog, namespace, and table that you want to import data into.
On the Table details page, click Import table.
In the Import configuration dialog, select Import a table from an Apache Parquet files into Lakehouse (Batch).
The Dataflow Job builder page opens.
In the Sources section:
Open the CreateGlobalInput source entry that is already created.
In the YAML source configuration editor section, enter one or more paths to your Parquet files in the
elementssequence.To improve import efficiency, specify multiple sets of files (globs) when you're registering a large number of files. For example:
reshuffle: true elements: - gs://BUCKET_NAME/restaurant-data/2023/*.parquet - gs://BUCKET_NAME/restaurant-data/2024/*.parquetClick Done.
In the Transforms section:
Click the IcebergAddFiles transform section to open it.
In the Iceberg table field, enter the namespace and table name. For example: NAMESPACE .TABLE_NAME .
Under Catalog properties, configure the following items:
warehouse: The Cloud Storage location of your catalog. For example,
gs://CATALOG_PATH.header.x-goog-user-project: Your Google Cloud project ID: PROJECT_ID.
Click Done.
In the Sinks section:
Click the Write results sink to open it.
In the JSON location field, specify the Cloud Storage location and filename to write error results. For example:
gs://BUCKET_NAME/errors/errors.jsonClick Done.
In the Dataflow Options section, click Run job.
If you need to further customize the Dataflow pipeline used to register Parquet files, you can do that using the job builder form or the YAML editor.
Examine the job output
After the job completes, you can verify that the data was registered with the Iceberg table by querying it in BigQuery.
In the Dataflow job list, check that the job status is Succeeded.
If the job fails or has errors, check the JSON error log file in Cloud Storage for details.
In the Google Cloud console, go to the BigQuery Studio page.
In the query editor, enter a SQL query to inspect the table. You can use
PROJECT_ID.CATALOG>NAMESPACE.TABLE_NAMEconvention to query.SELECT * FROM `PROJECT_ID.CATALOG>NAMESPACE.TABLE_NAME` LIMIT 10Click Run.
Review the Query results to ensure the data was processed correctly.
What's next
- Learn more in About Lakehouse runtime catalog .
- Learn more in the Dataflow Job builder UI overview.