"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Run Spark jobs with DataprocFileOutputCommitter

The DataprocFileOutputCommitter feature is an enhanced version of the open source FileOutputCommitter. It enables concurrent writes by Apache Spark jobs to an output location.

Limitations

The DataprocFileOutputCommitter feature supports Spark jobs run on Managed Service for Apache Spark Compute Engine clusters created with the following image versions:

2.1 image versions 2.1.10 and higher
2.0 image versions 2.0.62 and higher

Use `DataprocFileOutputCommitter`

To use this feature:

Create a Managed Service for Apache Spark on Compute Engine cluster using image versions 2.1.10 or 2.0.62 or higher.

Set spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory and spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false as a job property when you submit a Spark job to the cluster.

Google Cloud CLI example:

gcloud dataproc jobs submit spark \
    --properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
    --region=REGION \
    other args ...

Code example:

sc.hadoopConfiguration.set("spark.hadoop.mapreduce.outputcommitter.factory.class","org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory")
sc.hadoopConfiguration.set("spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs","false")

The Managed Service for Apache Spark file output committer must set spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false to avoid conflicts between success marker files created during concurrent writes. You can also set this property in spark-defaults.conf.

Run Spark jobs with DataprocFileOutputCommitter Stay organized with collections Save and categorize content based on your preferences.

Limitations

Use DataprocFileOutputCommitter

Run Spark jobs with DataprocFileOutputCommitter

Use `DataprocFileOutputCommitter`