Use Dynamic Workload Scheduler

You can use Dynamic Workload Scheduler (DWS) Flex Start mode to improve the obtainability of scarce GPU resources for your Managed Service for Apache Spark batch workloads.

Overview

Dynamic Workload Scheduler is a capacity-aware scheduler that manages scarce compute resources globally. In Flex Start mode, Managed Service for Apache Spark can queue your request for a configurable duration when GPUs are not immediately available due to regional stockouts.

Flex Start is the default behavior for all GPU-enabled workloads submitted to Managed Service for Apache Spark Spark runtime version 3.0+. When resources are unavailable, the request is queued instead of failing with a stockout error. Once capacity is obtained, DWS provisions the entire cluster "all-at-once" before execution begins. For more information, see Introducing Dynamic Workload Scheduler.

Benefits

  • Increased availability: Substantial reduction in job failures caused by transient GPU capacity shortages.
  • Atomic provisioning: Provisioning of cluster workers as a single unit, ensuring cluster integrity and preventing scenarios where only a partial set of workers is created.
  • Default reliability: Improved resource acquisition without manual parameter configuration.

Quota requirements

To use DWS Flex Start, your project must have sufficient quota:

  • Preemptible GPU quota: DWS Flex Start consumes the preemptible version of the GPU quota pool rather than the standard pool (for example, PREEMPTIBLE_NVIDIA_L4_GPUS). For more information, see Managed Service for Apache Spark resource quotas.
  • Local SSD quota: When using the performance storage class with GPU-accelerated machines, Managed Service for Apache Spark provisions Local SSDs for high-speed shuffle and temporary storage. Your project must have sufficient local SSD quota in the target region.

Configuration

DWS Flex Start is enabled by default for GPU workloads on runtime version 3.0+. You can use the following Spark properties to customize the timeout or disable the feature:

Property Description Values Default
spark.dataproc.[driver|executor].provisioning.mode Specifies the provisioning model. Use queue for DWS Flex Start (default for GPUs) or default to disable DWS and use on-demand provisioning. queue, default queue (for GPUs on 3.0+)
spark.dataproc.[driver|executor].provisioning.allocationTimeout The maximum duration for the node pool to wait in the queue for capacity. The default is 1 hour (3600s) and the maximum is 2 hours (7200s). Note: Values must end with an 's' suffix. duration in seconds (for example, 1800s) 3600s

Example: GPU batch workload with DWS Flex Start

The following example submits a PySpark batch job using NVIDIA L4 GPUs. With DWS Flex Start active by default, the command sets a 30-minute (1800s) queuing timeout for both the driver and executors:

gcloud dataproc batches submit pyspark \
    gs://my-bucket/path/to/your-script.py \
    --project="PROJECT_ID" \
    --region="REGION" \
    --version="3.0" \
    --properties="spark.dataproc.driver.resource.accelerator.type=l4,\
spark.dataproc.driver.provisioning.allocationTimeout=1800s,\
spark.dataproc.executor.resource.accelerator.type=l4,\
spark.dataproc.executor.provisioning.allocationTimeout=1800s,\
spark.dataproc.executor.compute.tier=premium,\
spark.dataproc.executor.disk.tier=premium"