The DataprocFileOutputCommitterfeature is an enhanced
version of the open source FileOutputCommitter 
. It
enables concurrent writes by Apache Spark jobs to an output location.
Limitations
The DataprocFileOutputCommitter 
feature supports Spark jobs run on
Dataproc Compute Engine clusters created with
the following image versions:
-  2.1 image versions 2.1.10 and higher 
-  2.0 image versions 2.0.62 and higher 
Use DataprocFileOutputCommitter 
 
 To use this feature:
-  Create a Dataproc on Compute Engine cluster using image versions 2.1.10or2.0.62or higher.
-  Set spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactoryandspark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=falseas a job property when you submit a Spark job to the cluster.- Google Cloud CLI example:
 gcloud dataproc jobs submit spark \ --properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \ --region= REGION \ other args ... - Code example:
 sc.hadoopConfiguration.set("spark.hadoop.mapreduce.outputcommitter.factory.class","org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory") sc.hadoopConfiguration.set("spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs","false")

