This page shows you how to attach hyperdisks to virtual machines (VMs) in a Managed Service for Apache Spark cluster. You can configure the disks independently for master, primary worker, and secondary worker node groups. These disks are attached to cluster nodes in addition to the boot disk and any local SSDs attached to cluster nodes.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that you have the permissions required to complete this guide.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Managed Service for Apache Spark API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that you have the permissions required to complete this guide.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Managed Service for Apache Spark API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.
Required roles
Certain IAM roles are required to run the examples on this page. Depending on organization policies, these roles may have already been granted. To check role grants, see Do you need to grant roles?.
For more information about granting roles, see Manage access to projects, folders, and organizations.
User roles
To get the permissions that you need to create a Managed Service for Apache Spark cluster, ask your administrator to grant you the following IAM roles:
-
Dataproc Editor (
roles/dataproc.editor) on the project -
Service Account User (
roles/iam.serviceAccountUser) on the Compute Engine default service account
Service account role
To ensure that the Compute Engine default service account has the necessary
permissions to create a Managed Service for Apache Spark cluster,
ask your administrator to grant the
Dataproc Worker (roles/dataproc.worker)
IAM role to the Compute Engine default service account on the project.
Disk characteristics
Attached disks have the following characteristics:
- Lifecycle: The lifetime of an attached disk matches the lifetime of the VM that it is attached to. Managed Service for Apache Spark creates the disk when it creates the VM and deletes the disk when it deletes the VM.
- Immutability: You cannot update the properties of an attached disk, such as size, IOPS, or throughput, after the cluster is created.
- Mounting and usage: Managed Service for Apache Spark mounts disks at
/mnt/N, whereNis a positive integer (for example,/mnt/1,/mnt/2). HDFS and scratch data, such as shuffle outputs, use the attached disks instead of the boot persistent disk.
Disk configuration
You can specify the following disk configuration parameters when you attach disks to Managed Service for Apache Spark cluster nodes:
Disk type - required: The type of disk to attach to VM instances. The following hyperdisks are supported:
hyperdisk-balancedhyperdisk-extremehyperdisk-mlhyperdisk-throughput
The
hyperdisk balanced high availabilitytype and persistent disks can't be attached to cluster nodes.Size - optional: The size of the disk. The value must be a whole number followed by a
GBfor gigabyte orTBfor terabyte. For example,10GBattaches a 10 gigabyte disk. For more information, see Hyperdisk size limits.IOPs - optional: Indicates the IOPS to provision for the attached disk. This parameter sets the limit for disk I/O operations per second. For more information, see Default performance levels.
Throughput - optional: Indicates the throughput to provision for the attached disk. This parameter sets the limit for throughput in
MiBper second. For more information, see Default performance levels.
Attach disks to a cluster
You can attach disks when you create a Managed Service for Apache Spark cluster using the gcloud CLI or the Dataproc API to specify the disk configurations.
gcloud CLI
To attach disks when you create a cluster, use the
--master-attached-disks,--worker-attached-disks, or--secondary-worker-attached-disksflag with thegcloud dataproc clusters createcommand.Each flag accepts a list of disk configurations separated by semicolons. Each disk configuration is a comma-separated list of key-value pairs for
type,size,iops, andthroughput(see Disk configuration).
Example: The following command creates a cluster and attaches two hyperdisks to each primary worker node.
gcloud dataproc clusters create CLUSTER_NAME \
--region=REGION \
--worker-attached-disks='type=hyperdisk-balanced,size=100GB,iops=5000,throughput=200;type=hyperdisk-throughput,size=9000GB'
API
To attach disks, include an
attachedDiskConfigsarray in thediskConfigobject for themasterConfig,workerConfig, orsecondaryWorkerConfiginstance group.Provide the configuration in the body of a
clusters.createAPI request.
Example: The following JSON snippet shows an attachedDiskConfigs array
that attaches two hyperdisks:
[
{
"diskType": "HYPERDISK_BALANCED",
"diskSizeGb": 100,
"provisionedIops": 5000,
"provisionedThroughput": 200
},
{
"diskType": "HYPERDISK_THROUGHPUT",
"diskSizeGb": 9000
}
]