You can use Dynamic Workload Scheduler (DWS) Flex Start mode to improve the obtainability of scarce GPU resources for your Managed Service for Apache Spark batch workloads.
Overview
Dynamic Workload Scheduler is a capacity-aware scheduler that manages scarce compute resources globally. In Flex Start mode, Managed Service for Apache Spark can queue your request for a configurable duration when GPUs are not immediately available due to regional stockouts.
Flex Start is the default behavior for all GPU-enabled workloads submitted to
Managed Service for Apache Spark Spark runtime version 3.0+. When resources are
unavailable, the request is queued instead of failing with a stockout error.
Once capacity is obtained, DWS provisions the entire cluster "all-at-once" before
execution begins. For more information, see Introducing Dynamic Workload
Scheduler.
Benefits
- Increased availability: Substantial reduction in job failures caused by transient GPU capacity shortages.
- Atomic provisioning: Provisioning of cluster workers as a single unit, ensuring cluster integrity and preventing scenarios where only a partial set of workers is created.
- Default reliability: Improved resource acquisition without manual parameter configuration.
Quota requirements
To use DWS Flex Start, your project must have sufficient quota:
- Preemptible GPU quota: DWS Flex Start consumes the preemptible
version of the GPU quota pool rather than the standard pool (for example,
PREEMPTIBLE_NVIDIA_L4_GPUS). For more information, see Managed Service for Apache Spark resource quotas. - Local SSD quota: When using the
performancestorage class with GPU-accelerated machines, Managed Service for Apache Spark provisions Local SSDs for high-speed shuffle and temporary storage. Your project must have sufficient local SSD quota in the target region.
Configuration
DWS Flex Start is enabled by default for GPU workloads on runtime version 3.0+.
You can use the following Spark properties to customize the timeout or disable
the feature:
| Property | Description | Values | Default |
|---|---|---|---|
spark.dataproc.[driver|executor].provisioning.mode |
Specifies the provisioning model. Use queue for DWS Flex Start (default for GPUs) or default to disable DWS and use on-demand provisioning. |
queue, default |
queue (for GPUs on 3.0+) |
spark.dataproc.[driver|executor].provisioning.allocationTimeout |
The maximum duration for the node pool to wait in the queue for capacity. The default is 1 hour (3600s) and the maximum is 2 hours (7200s). Note: Values must end with an 's' suffix. |
duration in seconds (for example, 1800s) |
3600s |
Example: GPU batch workload with DWS Flex Start
The following example submits a PySpark batch job using NVIDIA L4 GPUs. With DWS
Flex Start active by default, the command sets a 30-minute (1800s) queuing
timeout for both the driver and executors:
gcloud dataproc batches submit pyspark \
gs://my-bucket/path/to/your-script.py \
--project="PROJECT_ID" \
--region="REGION" \
--version="3.0" \
--properties="spark.dataproc.driver.resource.accelerator.type=l4,\
spark.dataproc.driver.provisioning.allocationTimeout=1800s,\
spark.dataproc.executor.resource.accelerator.type=l4,\
spark.dataproc.executor.provisioning.allocationTimeout=1800s,\
spark.dataproc.executor.compute.tier=premium,\
spark.dataproc.executor.disk.tier=premium"