This document describes how to create a Managed Service for Apache Spark zero-scale cluster.
Managed Service for Apache Spark zero-scale clusters provide a cost-effective way to use Managed Service for Apache Spark clusters. Unlike standard Managed Service for Apache Spark clusters that require at least two primary workers, Managed Service for Apache Spark zero-scale clusters use only secondary workers that can be scaled down to zero.
Managed Service for Apache Spark zero-scale clusters are ideal for use as long-running clusters that experience idle periods, such as a cluster that hosts a Jupiter notebook. They provide improved resource utilization through the use of zero-scale autoscaling policies.
Characteristics and limitations
A Managed Service for Apache Spark zero-scale cluster shares similarities with a standard cluster, but has the following unique characteristics and limitations:
- Requires image version
2.2.53or later. - Supports only secondary workers, not primary workers.
Includes services such as YARN, but doesn't support the HDFS file system.
- To use Cloud Storage as the default file system, set the
core:fs.defaultFScluster property to a Cloud Storage bucket location (gs://BUCKET_NAME). - If you disable a component during cluster creation, also disable HDFS.
- To use Cloud Storage as the default file system, set the
Can't be converted to or from a standard cluster.
Requires an autoscaling policy for
ZERO_SCALEcluster types.Requires selecting flexible VMs as machine type.
Doesn't support the Oozie component.
Can't be created from the Google Cloud console.
Optional: Configure an autoscaling policy
You can configure an autoscaling policy to define secondary working scaling for a zero-scale cluster. When doing so, note the following:
- Set the cluster type to
ZERO_SCALE. - Configure an autoscaling policy to the secondary worker config only.
For more information, see Create an autoscaling policy.
Create a Managed Service for Apache Spark zero-scale cluster
Create a zero-scale cluster using the gcloud CLI or the Dataproc API.
gcloud
Run
gcloud dataproc clusters create
command locally in a terminal window or in
Cloud Shell.
gcloud dataproc clusters create CLUSTER_NAME \
--region=REGION \
--cluster-type=zero-scale \
--autoscaling-policy=AUTOSCALING_POLICY \
--properties=core:fs.defaultFS=gs://BUCKET_NAME \
--secondary-worker-machine-types="type=MACHINE_TYPE1[,type=MACHINE_TYPE2...][,rank=RANK]"
...other args
Replace the following:
- CLUSTER_NAME: name of the Managed Service for Apache Spark zero-scale cluster.
- REGION: an available Compute Engine region.
- AUTOSCALING_POLICY: the ID or resource URI of the autoscaling policy.
- BUCKET_NAME: name of your Cloud Storage bucket.
- MACHINE_TYPE: specific Compute Engine
machine type, such as
n1-standard-4,e2-standard-8. - RANK: defines the priority of a list of machine types.
REST
Create a zero-scale cluster using a Managed Service for Apache Spark REST API cluster.create request:
- Set
ClusterConfig.ClusterTypefor thesecondaryWorkerConfigtoZERO_SCALE. - Set the
AutoscalingConfig.policyUriwith theZERO_SCALEautoscaling policy ID. - Add the
core:fs.defaultFS:gs://BUCKET_NAMESoftwareConfig.property. Replace BUCKET_NAME with the name of your Cloud Storage bucket.
What's next
- Learn more about Managed Service for Apache Spark autoscaling.