When you enable Dataproc cluster caching, the cluster caches Cloud Storage data frequently accessed by your Spark jobs.
Benefits
- Improved performance: Caching can improve job performance by reducing the amount of time spent retrieving data from storage.
- Reduced storage costs: Since hot data is cached on local disk, fewer API calls are made to storage to retrieve data.
- Spark job applicability: When cluster caching is enabled on a cluster, it applies to all Spark jobs run on the cluster, whether submitted to the Dataproc service or run independently on the cluster.
Limitations and requirements
- Caching applies to Dataproc Spark jobs only.
- Only Cloud Storage data is cached.
- Caching only applies to clusters that meet the following requirements:
- The cluster has one master and nworkers (High Availability (HA) and single node clusters are not supported).
- This feature is available in Dataproc on Compute Engine
image versions
2.0.72+,2.1.20+, and2.2.0+.
- Each cluster node must have local SSDs attached with the NVME (Non-Volatile Memory Express) interface (Persistent Disks (PDs) are not supported). Data is cached on NVME local SSDs only.
- The cluster uses the default VM service account for authentication. Custom VM service accounts are not supported.
 
- The cluster has one master and 
Enable cluster caching
You can enable cluster caching when you create a Dataproc cluster using the Google Cloud console, Google Cloud CLI, or the Dataproc API.
Google Cloud console
- Open the Dataproc Create a cluster on Compute Engine page in the Google Cloud console.
- The Set up cluster panel is selected. In the Spark performance enhancements section, select Enable Google Cloud Storage caching.
- After confirming and specifying cluster details in the cluster create panels, click Create.
gcloud CLI
Run the gcloud dataproc clusters create
  command locally in a terminal window or in
  Cloud Shell
  using the dataproc:dataproc.cluster.caching.enabled=true
  cluster property.
Example:
gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION \ --properties dataproc:dataproc.cluster.caching.enabled=true \ --num-master-local-ssds=2 \ --master-local-ssd-interface=NVME \ --num-worker-local-ssds=2 \ --worker-local-ssd-interface=NVME \ other args ...
REST API
Set SoftwareConfig.properties
    to include the "dataproc:dataproc.cluster.caching.enabled": "true"
    cluster property
    as part of a
    clusters.create
    request.