This document describes how to create a Managed Service for Apache Spark zero-scale cluster.
Managed Service for Apache Spark zero-scale clusters provide a cost-effective way to use Managed Service for Apache Spark clusters. Unlike standard Managed Service for Apache Spark clusters that require at least two primary workers, Managed Service for Apache Spark zero-scale clusters use only secondary workers that can be scaled down to zero.
Managed Service for Apache Spark zero-scale clusters are ideal for use as long-running clusters that experience idle periods, such as a cluster that hosts a Jupyter notebook. They provide improved resource utilization through the use of zero-scale autoscaling policies.
Requirements and limitations
A Managed Service for Apache Spark zero-scale cluster has the following requirements and limitations:
- Requires image version
2.2.61or later. - Requires using Cloud Storage, not the HDFS file system.
- Supports secondary workers only, not primary workers.
- Can't be converted to or from a standard cluster.
- Doesn't support the Oozie component.
Create a Managed Service for Apache Spark zero-scale cluster
You can create a zero-scale cluster using the gcloud CLI or the Managed Service for Apache Spark API.
gcloud
Run
gcloud dataproc clusters create
command locally in a terminal window or in
Cloud Shell.
gcloud dataproc clusters create CLUSTER_NAME \
--region=REGION \
--cluster-type=zero-scale \
--autoscaling-policy=AUTOSCALING_POLICY \
--properties=core:fs.defaultFS=gs://BUCKET_NAME \
...other args
Replace the following:
- CLUSTER_NAME: name of the Managed Service for Apache Spark zero-scale cluster.
- REGION: an available Compute Engine region.
- AUTOSCALING_POLICY (Optional): If you
create an autoscaling policy
to apply to the zero-scale cluster, use this flag to specify the ID or
resource URI of the autoscaling policy. When creating the policy:
- Set the
clusterTypetoZERO_SCALE. - Configure an autoscaling policy for the
secondaryWorkerConfigonly.
- Set the
- Cloud Storage file system: You must set
core:fs.defaultFSto a Cloud Storage bucket to set the zero-scale cluster file system to Cloud Storage instead of the default HDFS.- BUCKET_NAME: name of a Cloud Storage bucket. The bucket name must be unique for each zero-scale cluster.
REST
- Cluster type: Set
ClusterConfig.ClusterTypetoZERO_SCALE. - Autoscaling policy (Optional): If you
create an autoscaling policy
to apply to the zero-scale cluster, set the
AutoscalingConfig.policyUriwith theZERO_SCALEautoscaling policy ID. When creating the policy:- Set the
clusterTypetoZERO_SCALE. - Configure an autoscaling policy for the
secondaryWorkerConfigonly.
- Set the
- Cloud Storage file system: You must set
core:fs.defaultFSto a Cloud Storage bucket to set the zero-scale cluster file system to Cloud Storage instead of the default HDFS.- Add the
core:fs.defaultFS:gs://BUCKET_NAMESoftwareConfig.property. Replace BUCKET_NAME with the name of your Cloud Storage bucket. Specify a unique bucket name for each zero-scale cluster.
- Add the
What's next
- Learn more about Managed Service for Apache Spark autoscaling.