The DataprocFileOutputCommitter feature is an enhanced
version of the open source FileOutputCommitter. It
enables concurrent writes by Apache Spark jobs to an output location.
Limitations
The DataprocFileOutputCommitter feature supports Spark jobs run on
Dataproc Compute Engine clusters created with
the following image versions:
2.1 image versions 2.1.10 and higher
2.0 image versions 2.0.62 and higher
Use DataprocFileOutputCommitter
To use this feature:
Create a Dataproc on Compute Engine cluster using image versions
2.1.10or2.0.62or higher.Set
spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactoryandspark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=falseas a job property when you submit a Spark job to the cluster.- Google Cloud CLI example:
gcloud dataproc jobs submit spark \ --properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \ --region=REGION \ other args ...
- Code example:
sc.hadoopConfiguration.set("spark.hadoop.mapreduce.outputcommitter.factory.class","org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory") sc.hadoopConfiguration.set("spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs","false")