You can install additional components like Hudi when you create a Managed Service for Apache Spark cluster using the Optional components feature. This page describes how you can optionally install the Hudi component on a Managed Service for Apache Spark cluster.
When installed on a Managed Service for Apache Spark cluster, the Apache Hudi component installs Hudi libraries and configures Spark and Hive in the cluster to work with Hudi.
Compatible Managed Service for Apache Spark image versions
You can install the Hudi component on Managed Service for Apache Spark clusters created with the following Managed Service for Apache Spark image versions:
Hudi related properties
When you create a Managed Service for Apache Spark with Hudi cluster, the following Spark and Hive properties are configured to work with Hudi.
| Config file | Property | Default value | /etc/spark/conf/spark-defaults.conf |
spark.serializer |
org.apache.spark.serializer.KryoSerializer |
|---|---|---|
spark.sql.catalog.spark_catalog |
org.apache.spark.sql.hudi.catalog.HoodieCatalog |
|
spark.sql.extensions |
org.apache.spark.sql.hudi.HoodieSparkSessionExtension |
|
spark.driver.extraClassPath |
/usr/lib/hudi/lib/hudi-sparkspark-version-bundle_scala-version-hudi-version.jar |
|
spark.executor.extraClassPath |
/usr/lib/hudi/lib/hudi-sparkspark-version-bundle_scala-version-hudi-version.jar |
|
/etc/hive/conf/hive-site.xml |
hive.aux.jars.path |
file:///usr/lib/hudi/lib/hudi-hadoop-mr-bundle-version.jar |
Install the component
Install the Hudi component when you create a Managed Service for Apache Spark cluster.
The Managed Service for Apache Spark image release version pages list the Hudi component version included in each Managed Service for Apache Spark image release.
Console
- Enable the component.
- In the Google Cloud console, open the Managed Service for Apache Spark Create a cluster page. The Set up cluster panel is selected.
- In the Components section:
- Under Optional components, select the Hudi component.
gcloud command
To create a Managed Service for Apache Spark cluster that includes the Hudi component,
use the command with the --optional-components flag.
gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION \ --optional-components=HUDI \ --image-version=DATAPROC_VERSION \ --properties=PROPERTIES
Replace the following:
- CLUSTER_NAME: Required. The new cluster name.
- REGION: Required. The cluster region.
- DATAPROC_IMAGE: Optional. You can use this optional this flag to specify a non-default Managed Service for Apache Spark image version (see Default Managed Service for Apache Spark image version).
- PROPERTIES: Optional. You can use this optional this flag to
set Hudi component properties,
which are specified with the
hudi:file-prefix Example:properties=hudi:hoodie.datasource.write.table.type=COPY_ON_WRITE).- Hudi component version property: You can optionally specify the
dataproc:hudi.versionproperty. Note: The Hudi component version is set by Managed Service for Apache Spark to be compatible with the Managed Service for Apache Spark cluster image version. If you set this property, cluster creation can fail if the specified version is not compatible with the cluster image. - Spark and Hive properties: Managed Service for Apache Spark sets Hudi-related Spark and Hive properties when the cluster is created. You do not need to set them when creating the cluster or submitting jobs.
- Hudi component version property: You can optionally specify the
REST API
The Hudi component
can be installed through the Managed Service for Apache Spark API using
SoftwareConfig.Component
as part of a
clusters.create
request.
Submit a job to read and write Hudi tables
After creating a cluster with the Hudi component, you can submit Spark and Hive jobs that read and write Hudi tables.
gcloud CLI example:
gcloud dataproc jobs submit pyspark \ --cluster=CLUSTER_NAME \ --region=region \ JOB_FILE \ -- JOB_ARGS
Sample PySpark job
The following PySpark file creates, reads, and writes a Hudi table.
The following gcloud CLI command submits the sample PySpark file to Managed Service for Apache Spark.
gcloud dataproc jobs submit pyspark \ --cluster=CLUSTER_NAME \ gs://BUCKET_NAME/pyspark_hudi_example.py \ -- TABLE_NAME gs://BUCKET_NAME/TABLE_NAME
Use the Hudi CLI
The Hudi CLI is located at /usr/lib/hudi/cli/hudi-cli.sh on the
Managed Service for Apache Spark cluster master node. You can use the Hudi CLI
to view Hudi table schemas, commits, and statistics, and to manually perform
administrative operations, such as schedule compactions (see
Using hudi-cli).
To start the Hudi CLI and connect to a Hudi table:
- SSH into the master node.
- Run
/usr/lib/hudi/cli/hudi-cli.sh. The command prompt changes tohudi->. - Run
connect --path gs://my-bucket/my-hudi-table. - Run commands, such as
desc, which describes the table schema, orcommits show, which shows the commit history. - To stop the CLI session, run
exit.
What's next
- See the Hudi Quick Start guide.