Cloud Profiler

Cloud Profiler continuously gathers and reports application CPU usage and memory-allocation information.

Requirements:

  • Profiler supports only Dataproc Hadoop and Spark job types (Spark, PySpark, SparkSql, and SparkR).

  • Jobs must run longer than 3 minutes to allow Profiler to collect and upload data to your project.

Dataproc recognizes cloud.profiler.enable and the other cloud.profiler.* properties (see Profiler options ), and then appends the relevant profiler JVM options to the following configurations:

  • Spark: spark.driver.extraJavaOptions and spark.executor.extraJavaOptions
  • MapReduce: mapreduce.task.profile and other mapreduce.task.profile.* properties

Enable profiling

Complete the following steps to enable and use the Profiler on your Dataproc Spark and Hadoop jobs.

  1. Enable the Profiler .

  2. Create a Dataproc cluster with service account scopes set to monitoring to allow the cluster to talk to the profiler service.

gcloud

gcloud dataproc clusters create cluster-name 
\
    --scopes=cloud-platform \
    --region= region 
\
     other args ... 

Submit a Dataproc job with Profiler options

  1. Submit a Dataproc Spark or Hadoop job with one or more of the following Profiler options:
    Option Description Value Required/Optional Default Notes
    cloud.profiler.enable
    Enable profiling of the job true or false Required false
    cloud.profiler.name
    Name used to create profile on the Profiler Service profile-name Optional Dataproc job UUID
    cloud.profiler.service.version
    A user-supplied string to identify and distinguish profiler results. Profiler Service Version Optional Dataproc job UUID
    mapreduce.task.profile.maps
    Numeric range of map tasks to profile (example: for up to 100, specify "0-100") number range Optional 0-10000 Applies to Hadoop mapreduce jobs only
    mapreduce.task.profile.reduces
    Numeric range of reducer tasks to profile (example: for up to 100, specify "0-100") number range Optional 0-10000 Applies to Hadoop mapreduce jobs only

PySpark Example

gcloud

PySpark job submit with profiling example:

gcloud dataproc jobs submit pyspark python-job-file 
\
    --cluster= cluster-name 
\
    --region= region 
\
    --properties=cloud.profiler.enable=true,cloud.profiler.name= profiler_name 
,cloud.profiler.service.version= version 
\
    --   job args 

Two profiles will be created:

  1. profiler_name -driver to profile spark driver tasks
  2. profiler_name -executor to profile spark executor tasks

For example, if the profiler_name is "spark_word_count_job", spark_word_count_job-driver and spark_word_count_job-executor profiles are created.

Hadoop Example

gcloud

Hadoop (teragen mapreduce) job submit with profiling example:

gcloud dataproc jobs submit hadoop \
    --cluster= cluster-name 
\
    --region= region 
\
    --jar= jar-file 
\
    --properties=cloud.profiler.enable=true,cloud.profiler.name= profiler_name 
,cloud.profiler.service.version= version 
\
    --  teragen 100000 gs:// bucket-name 

View profiles

View profiles from the Profiler on the Google Cloud console.

Whats next