The DataprocFileOutputCommitterfeature is an enhanced
version of the open source FileOutputCommitter
. It
enables concurrent writes by Apache Spark jobs to an output location.
Limitations
The DataprocFileOutputCommitter
feature supports Spark jobs run on
Dataproc Compute Engine clusters created with
the following image versions:
-
2.1 image versions 2.1.10 and higher
-
2.0 image versions 2.0.62 and higher
Use DataprocFileOutputCommitter
To use this feature:
-
Create a Dataproc on Compute Engine cluster using image versions
2.1.10or2.0.62or higher. -
Set
spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactoryandspark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=falseas a job property when you submit a Spark job to the cluster.- Google Cloud CLI example:
gcloud dataproc jobs submit spark \ --properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \ --region= REGION \ other args ...
- Code example:
sc.hadoopConfiguration.set("spark.hadoop.mapreduce.outputcommitter.factory.class","org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory") sc.hadoopConfiguration.set("spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs","false")

