Create a Managed Service for Apache Spark zero-scale cluster

This document describes how to create a Managed Service for Apache Spark zero-scale cluster.

Managed Service for Apache Spark zero-scale clusters provide a cost-effective way to use Managed Service for Apache Spark clusters. Unlike standard Managed Service for Apache Spark clusters that require at least two primary workers, Managed Service for Apache Spark zero-scale clusters use only secondary workers that can be scaled down to zero.

Managed Service for Apache Spark zero-scale clusters are ideal for use as long-running clusters that experience idle periods, such as a cluster that hosts a Jupiter notebook. They provide improved resource utilization through the use of zero-scale autoscaling policies.

Characteristics and limitations

A Managed Service for Apache Spark zero-scale cluster shares similarities with a standard cluster, but has the following unique characteristics and limitations:

  • Requires image version 2.2.53 or later.
  • Supports only secondary workers, not primary workers.
  • Includes services such as YARN, but doesn't support the HDFS file system.

    • To use Cloud Storage as the default file system, set the core:fs.defaultFS cluster property to a Cloud Storage bucket location (gs://BUCKET_NAME).
    • If you disable a component during cluster creation, also disable HDFS.
  • Can't be converted to or from a standard cluster.

  • Requires an autoscaling policy for ZERO_SCALE cluster types.

  • Requires selecting flexible VMs as machine type.

  • Doesn't support the Oozie component.

  • Can't be created from the Google Cloud console.

Optional: Configure an autoscaling policy

You can configure an autoscaling policy to define secondary working scaling for a zero-scale cluster. When doing so, note the following:

  • Set the cluster type to ZERO_SCALE.
  • Configure an autoscaling policy to the secondary worker config only.

For more information, see Create an autoscaling policy.

Create a Managed Service for Apache Spark zero-scale cluster

Create a zero-scale cluster using the gcloud CLI or the Dataproc API.

gcloud

Run gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell.

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --cluster-type=zero-scale \
    --autoscaling-policy=AUTOSCALING_POLICY \
    --properties=core:fs.defaultFS=gs://BUCKET_NAME \
    --secondary-worker-machine-types="type=MACHINE_TYPE1[,type=MACHINE_TYPE2...][,rank=RANK]"
    ...other args

Replace the following:

  • CLUSTER_NAME: name of the Managed Service for Apache Spark zero-scale cluster.
  • REGION: an available Compute Engine region.
  • AUTOSCALING_POLICY: the ID or resource URI of the autoscaling policy.
  • BUCKET_NAME: name of your Cloud Storage bucket.
  • MACHINE_TYPE: specific Compute Engine machine type, such as n1-standard-4, e2-standard-8.
  • RANK: defines the priority of a list of machine types.

REST

Create a zero-scale cluster using a Managed Service for Apache Spark REST API cluster.create request:

What's next