Cloud Profiler

Cloud Profiler continuously gathers and reports application CPU usage and memory-allocation information.

Requirements:

Profiler supports only Dataproc Hadoop and Spark job types (Spark, PySpark, SparkSql, and SparkR).
Jobs must run longer than 3 minutes to allow Profiler to collect and upload data to your project.

Dataproc recognizes cloud.profiler.enable and the other cloud.profiler.* properties (see Profiler options ), and then appends the relevant profiler JVM options to the following configurations:

Spark: spark.driver.extraJavaOptions and spark.executor.extraJavaOptions
MapReduce: mapreduce.task.profile and other mapreduce.task.profile.* properties

Enable profiling

Complete the following steps to enable and use the Profiler on your Dataproc Spark and Hadoop jobs.

Enable the Profiler .
Create a Dataproc cluster with service account scopes set to monitoring to allow the cluster to talk to the profiler service.
If you are using a custom VM service account , grant the Cloud Profiler Agent role to the custom VM service account. This role contains required profiler service permissions.

gcloud

gcloud dataproc clusters create cluster-name 
\
    --scopes=cloud-platform \
    --region= region 
\
     other args ...

Submit a Dataproc job with Profiler options

Submit a Dataproc Spark or Hadoop job with one or more of the following Profiler options:

Option	Description	Value	Required/Optional	Default	Notes
`cloud.profiler.enable`	Enable profiling of the job	`true` or `false`	Required	`false`
`cloud.profiler.name`	Name used to create profile on the Profiler Service	`profile-name`	Optional	Dataproc job UUID
`cloud.profiler.service.version`	A user-supplied string to identify and distinguish profiler results.	`Profiler Service Version`	Optional	Dataproc job UUID
`mapreduce.task.profile.maps`	Numeric range of map tasks to profile (example: for up to 100, specify "0-100")	`number range`	Optional	0-10000	Applies to Hadoop mapreduce jobs only
`mapreduce.task.profile.reduces`	Numeric range of reducer tasks to profile (example: for up to 100, specify "0-100")	`number range`	Optional	0-10000	Applies to Hadoop mapreduce jobs only

PySpark Example

Google Cloud CLI

PySpark job submit with profiling example:

gcloud dataproc jobs submit pyspark python-job-file 
\
    --cluster= cluster-name 
\
    --region= region 
\
    --properties=cloud.profiler.enable=true,cloud.profiler.name= profiler_name 
,cloud.profiler.service.version= version 
\
    --   job args

Two profiles will be created:

profiler_name -driver to profile spark driver tasks
profiler_name -executor to profile spark executor tasks

For example, if the profiler_name is "spark_word_count_job", spark_word_count_job-driver and spark_word_count_job-executor profiles are created.

Hadoop Example

gcloud CLI

Hadoop (teragen mapreduce) job submit with profiling example:

gcloud dataproc jobs submit hadoop \
    --cluster= cluster-name 
\
    --region= region 
\
    --jar= jar-file 
\
    --properties=cloud.profiler.enable=true,cloud.profiler.name= profiler_name 
,cloud.profiler.service.version= version 
\
    --  teragen 100000 gs:// bucket-name

View profiles

View profiles from the Profiler on the Google Cloud console.