"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Cloud Profiler

Cloud Profiler continuously gathers and reports application CPU usage and memory-allocation information.

Requirements:

Profiler supports only Managed Service for Apache Spark Hadoop and Spark job types (Spark, PySpark, SparkSql, and SparkR).
Jobs must run longer than 3 minutes to allow Profiler to collect and upload data to your project.

Managed Service for Apache Spark recognizes cloud.profiler.enable and the other cloud.profiler.* properties (see Profiler options), and then appends the relevant profiler JVM options to the following configurations:

Spark: spark.driver.extraJavaOptions and spark.executor.extraJavaOptions
MapReduce: mapreduce.task.profile and other mapreduce.task.profile.* properties

Enable profiling

Complete the following steps to enable and use the Profiler on your Managed Service for Apache Spark and Hadoop jobs.

Enable the Profiler.
Create a Managed Service for Apache Spark cluster with service account scopes set to monitoring to allow the cluster to talk to the profiler service.
If you are using a custom VM service account, grant the Cloud Profiler Agent role to the custom VM service account. This role contains required profiler service permissions.

gcloud

gcloud dataproc clusters create cluster-name \
    --scopes=cloud-platform \
    --region=region \
    other args ...

Submit a Managed Service for Apache Spark job with Profiler options

Submit a Spark or Hadoop job with one or more of the following Profiler options:

Option	Description	Value	Required/Optional	Default	Notes
`cloud.profiler.enable`	Enable profiling of the job	`true` or `false`	Required	`false`
`cloud.profiler.name`	Name used to create profile on the Profiler Service	`profile-name`	Optional	Managed Service for Apache Spark job UUID
`cloud.profiler.service.version`	A user-supplied string to identify and distinguish profiler results.	`Profiler Service Version`	Optional	Managed Service for Apache Spark job UUID
`mapreduce.task.profile.maps`	Numeric range of map tasks to profile (example: for up to 100, specify "0-100")	`number range`	Optional	0-10000	Applies to Hadoop mapreduce jobs only
`mapreduce.task.profile.reduces`	Numeric range of reducer tasks to profile (example: for up to 100, specify "0-100")	`number range`	Optional	0-10000	Applies to Hadoop mapreduce jobs only

PySpark Example

Google Cloud CLI

PySpark job submit with profiling example:

gcloud dataproc jobs submit pyspark python-job-file \
    --cluster=cluster-name \
    --region=region \
    --properties=cloud.profiler.enable=true,cloud.profiler.name=profiler_name,cloud.profiler.service.version=version \
    --  job args

Two profiles will be created:

profiler_name-driver to profile spark driver tasks
profiler_name-executor to profile spark executor tasks

For example, if the profiler_name is "spark_word_count_job", spark_word_count_job-driver and spark_word_count_job-executor profiles are created.

Hadoop Example

gcloud CLI

Hadoop (teragen mapreduce) job submit with profiling example:

gcloud dataproc jobs submit hadoop \
    --cluster=cluster-name \
    --region=region \
    --jar=jar-file \
    --properties=cloud.profiler.enable=true,cloud.profiler.name=profiler_name,cloud.profiler.service.version=version \
    --  teragen 100000 gs://bucket-name

View profiles

View profiles from the Profiler on the Google Cloud console.