Create a Managed Service for Apache Spark zero-scale cluster

This document describes how to create a Managed Service for Apache Spark zero-scale cluster.

Managed Service for Apache Spark zero-scale clusters provide a cost-effective way to use Managed Service for Apache Spark clusters. Unlike standard Managed Service for Apache Spark clusters that require at least two primary workers, Managed Service for Apache Spark zero-scale clusters use only secondary workers that can be scaled down to zero.

Managed Service for Apache Spark zero-scale clusters are ideal for use as long-running clusters that experience idle periods, such as a cluster that hosts a Jupyter notebook. They provide improved resource utilization through the use of zero-scale autoscaling policies.

Requirements and limitations

A Managed Service for Apache Spark zero-scale cluster has the following requirements and limitations:

  • Requires image version 2.2.61 or later.
  • Requires using Cloud Storage, not the HDFS file system.
  • Supports secondary workers only, not primary workers.
  • Can't be converted to or from a standard cluster.
  • Doesn't support the Oozie component.

Create a Managed Service for Apache Spark zero-scale cluster

You can create a zero-scale cluster using the gcloud CLI or the Managed Service for Apache Spark API.

gcloud

Run gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell.

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --cluster-type=zero-scale \
    --autoscaling-policy=AUTOSCALING_POLICY \
    --properties=core:fs.defaultFS=gs://BUCKET_NAME \
    ...other args

Replace the following:

  • CLUSTER_NAME: name of the Managed Service for Apache Spark zero-scale cluster.
  • REGION: an available Compute Engine region.
  • AUTOSCALING_POLICY (Optional): If you create an autoscaling policy to apply to the zero-scale cluster, use this flag to specify the ID or resource URI of the autoscaling policy. When creating the policy:
    • Set the clusterType to ZERO_SCALE.
    • Configure an autoscaling policy for the secondaryWorkerConfig only.
  • Cloud Storage file system: You must set core:fs.defaultFS to a Cloud Storage bucket to set the zero-scale cluster file system to Cloud Storage instead of the default HDFS.
    • BUCKET_NAME: name of a Cloud Storage bucket. The bucket name must be unique for each zero-scale cluster.

REST

  • Cluster type: Set ClusterConfig.ClusterType to ZERO_SCALE.
  • Autoscaling policy (Optional): If you create an autoscaling policy to apply to the zero-scale cluster, set the AutoscalingConfig.policyUri with the ZERO_SCALE autoscaling policy ID. When creating the policy:
    • Set the clusterType to ZERO_SCALE.
    • Configure an autoscaling policy for the secondaryWorkerConfig only.
  • Cloud Storage file system: You must set core:fs.defaultFS to a Cloud Storage bucket to set the zero-scale cluster file system to Cloud Storage instead of the default HDFS.
    • Add the core:fs.defaultFS:gs://BUCKET_NAME SoftwareConfig.property. Replace BUCKET_NAME with the name of your Cloud Storage bucket. Specify a unique bucket name for each zero-scale cluster.

What's next