Attach hyperdisks to Managed Service for Apache Spark VMs

This page shows you how to attach hyperdisks to virtual machines (VMs) in a Managed Service for Apache Spark cluster. You can configure the disks independently for master, primary worker, and secondary worker node groups. These disks are attached to cluster nodes in addition to the boot disk and any local SSDs attached to cluster nodes.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  3. Verify that you have the permissions required to complete this guide.

  4. Verify that billing is enabled for your Google Cloud project.

  5. Enable the Managed Service for Apache Spark API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

  6. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  7. Verify that you have the permissions required to complete this guide.

  8. Verify that billing is enabled for your Google Cloud project.

  9. Enable the Managed Service for Apache Spark API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

Required roles

Certain IAM roles are required to run the examples on this page. Depending on organization policies, these roles may have already been granted. To check role grants, see Do you need to grant roles?.

For more information about granting roles, see Manage access to projects, folders, and organizations.

User roles

To get the permissions that you need to create a Managed Service for Apache Spark cluster, ask your administrator to grant you the following IAM roles:

Service account role

To ensure that the Compute Engine default service account has the necessary permissions to create a Managed Service for Apache Spark cluster, ask your administrator to grant the Dataproc Worker (roles/dataproc.worker) IAM role to the Compute Engine default service account on the project.

Disk characteristics

Attached disks have the following characteristics:

  • Lifecycle: The lifetime of an attached disk matches the lifetime of the VM that it is attached to. Managed Service for Apache Spark creates the disk when it creates the VM and deletes the disk when it deletes the VM.
  • Immutability: You cannot update the properties of an attached disk, such as size, IOPS, or throughput, after the cluster is created.
  • Mounting and usage: Managed Service for Apache Spark mounts disks at /mnt/N, where N is a positive integer (for example, /mnt/1, /mnt/2). HDFS and scratch data, such as shuffle outputs, use the attached disks instead of the boot persistent disk.

Disk configuration

You can specify the following disk configuration parameters when you attach disks to Managed Service for Apache Spark cluster nodes:

  • Disk type - required: The type of disk to attach to VM instances. The following hyperdisks are supported:

    • hyperdisk-balanced
    • hyperdisk-extreme
    • hyperdisk-ml
    • hyperdisk-throughput

    The hyperdisk balanced high availability type and persistent disks can't be attached to cluster nodes.

  • Size - optional: The size of the disk. The value must be a whole number followed by a GB for gigabyte or TB for terabyte. For example, 10GB attaches a 10 gigabyte disk. For more information, see Hyperdisk size limits.

  • IOPs - optional: Indicates the IOPS to provision for the attached disk. This parameter sets the limit for disk I/O operations per second. For more information, see Default performance levels.

  • Throughput - optional: Indicates the throughput to provision for the attached disk. This parameter sets the limit for throughput in MiB per second. For more information, see Default performance levels.

Attach disks to a cluster

You can attach disks when you create a Managed Service for Apache Spark cluster using the gcloud CLI or the Dataproc API to specify the disk configurations.

gcloud CLI

  • To attach disks when you create a cluster, use the --master-attached-disks, --worker-attached-disks, or --secondary-worker-attached-disks flag with the gcloud dataproc clusters create command.

  • Each flag accepts a list of disk configurations separated by semicolons. Each disk configuration is a comma-separated list of key-value pairs for type, size, iops, and throughput (see Disk configuration).

Example: The following command creates a cluster and attaches two hyperdisks to each primary worker node.

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --worker-attached-disks='type=hyperdisk-balanced,size=100GB,iops=5000,throughput=200;type=hyperdisk-throughput,size=9000GB'

API

  • To attach disks, include an attachedDiskConfigs array in the diskConfig object for the masterConfig, workerConfig, or secondaryWorkerConfig instance group.

  • Provide the configuration in the body of a clusters.create API request.

Example: The following JSON snippet shows an attachedDiskConfigs array that attaches two hyperdisks:

[
  {
    "diskType": "HYPERDISK_BALANCED",
    "diskSizeGb": 100,
    "provisionedIops": 5000,
    "provisionedThroughput": 200
  },
  {
    "diskType": "HYPERDISK_THROUGHPUT",
    "diskSizeGb": 9000
  }
]