Cloud Profiler continuously gathers and reports application CPU usage and memory-allocation information.
Requirements:
-
Profiler supports only Dataproc Hadoop and Spark job types (Spark, PySpark, SparkSql, and SparkR).
-
Jobs must run longer than 3 minutes to allow Profiler to collect and upload data to your project.
Dataproc recognizes cloud.profiler.enable
and the other cloud.profiler.*
properties (see Profiler options
), and then appends
the relevant profiler JVM options to the following configurations:
- Spark:
spark.driver.extraJavaOptions
andspark.executor.extraJavaOptions
- MapReduce:
mapreduce.task.profile
and othermapreduce.task.profile.*
properties
Enable profiling
Complete the following steps to enable and use the Profiler on your Dataproc Spark and Hadoop jobs.
-
Create a Dataproc cluster with service account scopes set to
monitoring
to allow the cluster to talk to the profiler service.
gcloud
gcloud dataproc clusters create cluster-name \ --scopes=cloud-platform \ --region= region \ other args ...
Submit a Dataproc job with Profiler options
- Submit a Dataproc Spark or Hadoop job
with one or more of the following Profiler options:
Option Description Value Required/Optional Default Notes cloud.profiler.enable
Enable profiling of the job true
orfalse
Required false
cloud.profiler.name
Name used to create profile on the Profiler Service profile-name Optional Dataproc job UUID cloud.profiler.service.version
A user-supplied string to identify and distinguish profiler results. Profiler Service Version Optional Dataproc job UUID mapreduce.task.profile.maps
Numeric range of map tasks to profile (example: for up to 100, specify "0-100") number range Optional 0-10000 Applies to Hadoop mapreduce jobs only mapreduce.task.profile.reduces
Numeric range of reducer tasks to profile (example: for up to 100, specify "0-100") number range Optional 0-10000 Applies to Hadoop mapreduce jobs only
PySpark Example
gcloud
PySpark job submit with profiling example:
gcloud dataproc jobs submit pyspark python-job-file \ --cluster= cluster-name \ --region= region \ --properties=cloud.profiler.enable=true,cloud.profiler.name= profiler_name ,cloud.profiler.service.version= version \ -- job args
Two profiles will be created:
-
profiler_name -driver
to profile spark driver tasks -
profiler_name -executor
to profile spark executor tasks
For example, if the profiler_name
is "spark_word_count_job", spark_word_count_job-driver
and spark_word_count_job-executor
profiles are created.
Hadoop Example
gcloud
Hadoop (teragen mapreduce) job submit with profiling example:
gcloud dataproc jobs submit hadoop \ --cluster= cluster-name \ --region= region \ --jar= jar-file \ --properties=cloud.profiler.enable=true,cloud.profiler.name= profiler_name ,cloud.profiler.service.version= version \ -- teragen 100000 gs:// bucket-name
View profiles
View profiles from the Profiler on the Google Cloud console.
Whats next
- See the Monitoring documentation
- See the Logging documentation
- Explore Google Cloud Observability