Run Spark jobs with DataprocFileOutputCommitter

The DataprocFileOutputCommitterfeature is an enhanced version of the open source FileOutputCommitter . It enables concurrent writes by Apache Spark jobs to an output location.

Limitations

The DataprocFileOutputCommitter feature supports Spark jobs run on Dataproc Compute Engine clusters created with the following image versions:

  • 2.1 image versions 2.1.10 and higher

  • 2.0 image versions 2.0.62 and higher

Use DataprocFileOutputCommitter

To use this feature:

  1. Create a Dataproc on Compute Engine cluster using image versions 2.1.10 or 2.0.62 or higher.

  2. Set spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory and spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false as a job property when you submit a Spark job to the cluster.

    • Google Cloud CLI example:
    gcloud dataproc jobs submit spark \
        --properties=spark.hadoop.mapreduce.outputcommitter.factory.class=org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory,spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs=false \
        --region= REGION 
    \
         other args ... 
    
    • Code example:
    sc.hadoopConfiguration.set("spark.hadoop.mapreduce.outputcommitter.factory.class","org.apache.hadoop.mapreduce.lib.output.DataprocFileOutputCommitterFactory")
    sc.hadoopConfiguration.set("spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs","false")
Design a Mobile Site
View Site in Mobile | Classic
Share by: