"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Customer managed encryption keys (CMEK)

When you use Managed Service for Apache Spark, cluster and job data is stored on persistent disks associated with the Compute Engine VMs in your cluster and in a Cloud Storage staging bucket. By default, this persistent disk and bucket data is encrypted using a Google-generated data encryption key (DEK) and key encryption key (KEK).

If you want to control and manage the key encryption key (KEK), you can use Customer-Managed Encryption Keys (CMEK) (Google continues to control the data encryption key (DEK)). For more information on Google data encryption keys, see Encryption at Rest.

CMEK cluster data encryption

You can use customer-managed encryption keys (CMEK) to encrypt the following cluster data:

Data on persistent disks attached to Managed Service for Apache Spark cluster VMs
Job argument data submitted to your cluster, such as a query string submitted with a Spark SQL job
Cluster metadata, job driver output, and other data written to your Managed Service for Apache Spark cluster staging bucket

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc, Cloud Key Management Service, Compute Engine, and Cloud Storage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Install the Google Cloud CLI.

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running gcloud components update.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

To initialize the gcloud CLI, run the following command:

gcloud init

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc, Cloud Key Management Service, Compute Engine, and Cloud Storage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Install the Google Cloud CLI.

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running gcloud components update.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

To initialize the gcloud CLI, run the following command:

gcloud init

Create keys

To protect your Managed Service for Apache Spark resources with CMEK, you can automate the creation of keys or create keys manually.

Automated key creation

Use Autokey to automate CMEK provisioning and assignment. Autokey generates key rings and keys on demand when resources are created. Service agents use the keys in encrypt and decrypt operations. If needed, Autokey creates the agents and grants the required Identity and Access Management (IAM) (IAM) roles to them. For more information, see Autokey overview.

Manual key creation

Follow these steps to manually create keys for CMEK encryption of cluster data:

Create one or more keys using the Cloud KMS. The resource name, also called the resource ID of a key, which you use in the next steps, is constructed as follows:
```
projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
```
Use the Cryptographic Keys page in the Google Cloud console to copy a key resource ID to the clipboard.
The key (CMEK) must be located in the same location as the encrypted resource. For example, the CMEK used to encrypt a resource in the us-central1 region must also be located in the us-central1 region.
To ensure that each of the following service accounts, Compute Engine Service Agent service account, Cloud Storage Service Agent service account, and Managed Service for Apache Spark Service Agent service account, has the necessary permissions to protect resources by using Cloud KMS keys, ask your administrator to grant the Cloud KMS CryptoKey Encrypter/Decrypter (roles/cloudkms.cryptoKeyEncrypterDecrypter) IAM role to each of the following service accounts, Compute Engine Service Agent service account, Cloud Storage Service Agent service account, and Managed Service for Apache Spark Service Agent service account, on your project.

Example assignment of Cloud KMS CryptoKey Encrypter/Decrypter role to the Managed Service for Apache Spark service Agent service account using Google Cloud CLI:
```
gcloud projects add-iam-policy-binding KMS_PROJECT_ID \
--member serviceAccount:service-PROJECT_NUMBER@dataproc-accounts.iam.gserviceaccount.com \
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
```
Replace the following:

KMS_PROJECT_ID: the ID of your Google Cloud project that contains the Cloud KMS key.

PROJECT_NUMBER: the project number (not the project ID) of your Google Cloud project that runs Managed Service for Apache Spark resources.
If the Managed Service for Apache Spark Service Agent role is not attached to the Managed Service for Apache Spark Service Agent service account, add the serviceusage.services.use permission to a custom role attached to the Managed Service for Apache Spark Service Agent service account.

Create a cluster with CMEK

Pass the resource ID of your key when you create the Managed Service for Apache Spark cluster.

gcloud CLI

To encrypt cluster persistent disk data using your key, pass the resource ID of your key to the --gce-pd-kms-key flag when you create the cluster.

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --gce-pd-kms-key='projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME' \
    other arguments ...

You can verify the key setting from the gcloud command-line tool.

gcloud dataproc clusters describe CLUSTER_NAME \
    --region=REGION

Command output snippet:

...
configBucket: dataproc- ...
  encryptionConfig:
    gcePdKmsKeyName: projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name
...

To encrypt cluster persistent disk data and job argument data using your key, pass the resource ID of the key to the --kms-key flag when you create the cluster. See [Cluster.EncryptionConfig.kmsKey](/managed-spark/docs/reference/rest/v1/ClusterConfig#EncryptionConfig.FIELDS.kms_key) for a list of job types and arguments that are encrypted with the `--kms-key` flag.
```
gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --kms-key='projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME' \
    other arguments ...
      
```
You can verify key settings with the gcloud CLI dataproc clusters describe command. The key resource ID is set on gcePdKmsKeyName and kmsKey to use your key with the encryption of cluster persistent disk and job argument data.
```
gcloud dataproc clusters describe CLUSTER_NAME \
    --region=REGION
      
```
Command output snippet:
```
...
configBucket: dataproc- ...
  encryptionConfig:
  gcePdKmsKeyName: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
  kmsKey: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
...
    
```
You can use either the --gce-pd-kms-key or the --kms-key flag, but not both, to encrypt cluster data using your key.
To encrypt cluster metadata, job driver, and other output data written to your Managed Service for Apache Spark staging bucket in Cloud Storage:
- Create your own bucket with CMEK. When adding the key to the bucket, use a key that you created in Step 1.
- Pass the bucket name to the --bucket flag when you create the cluster.
```
gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --bucket=CMEK_BUCKET_NAME \
    other arguments ...
        
```
You can also pass CMEK-enabled buckets to the `gcloud dataproc jobs submit` command if your job takes bucket arguments, as shown in the following `cmek-bucket` example:
```
gcloud dataproc jobs submit pyspark gs://cmek-bucket/wordcount.py \
    --region=region \
    --cluster=cluster-name \
    -- gs://cmek-bucket/shakespeare.txt gs://cmek-bucket/counts
      
```
- Managed Service for Apache Spark doesn't manage customer managed encryption keys on your Cloud Storage bucket.
- Using a bucket with a customer managed encryption key can slow write times to large files.

REST API

To encrypt cluster VM persistent disk data using your key, include the ClusterConfig.EncryptionConfig.gcePdKmsKeyName field as part of a cluster.create request.
You can verify the key setting with the gcloud CLI dataproc clusters describe command.
```
gcloud dataproc clusters describe CLUSTER_NAME \
    --region=REGION
    
```
Command output snippet:
```
...
configBucket: dataproc- ...
  encryptionConfig:
    gcePdKmsKeyName: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
...
    
```
To encrypt cluster VM persistent disk data and job argument data using your key, include the Cluster.EncryptionConfig.kmsKey field as part of a cluster.create request. See Cluster.EncryptionConfig.kmsKey for a list of job types and arguments that are encrypted with the --kms-key field.
You can include either the Cluster.EncryptionConfig.gcePdKmsKeyName field or the Cluster.EncryptionConfig.kmsKey field, but not both, with your cluster creation request.

You can verify key settings with the gcloud CLI dataproc clusters describe command. The key resource ID is set on gcePdKmsKeyName and kmsKey to use your key with the encryption of cluster persistent disk and job argument data.
```
gcloud dataproc clusters describe CLUSTER_NAME \
    --region=REGION
    
```
Command output snippet:
```
...
configBucket: dataproc- ...
  encryptionConfig:
    gcePdKmsKeyName: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
    kmsKey: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
...
    
```
To encrypt cluster metadata, job driver, and other output data written to your Managed Service for Apache Spark staging bucket in Cloud Storage:
- Create your own bucket with CMEK. When adding the key to the bucket, use a key that you created in Step 1.
- Pass the bucket name to the ClusterConfig.configBucket field as part of a cluster.create request.
```
gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --bucket=CMEK_BUCKET_NAME \
    other arguments ...
    
```
- Managed Service for Apache Spark doesn't manage customer managed encryption keys on your Cloud Storage bucket.
- Using a bucket with a customer managed encryption key can slow write times to large files.
You can also pass CMEK-enabled buckets to the `gcloud dataproc jobs submit` command if your job takes bucket arguments, as shown in the following `cmek-bucket` example:
```
gcloud dataproc jobs submit pyspark gs://cmek-bucket/wordcount.py \
    --region=region \
    --cluster=cluster-name \
    -- gs://cmek-bucket/shakespeare.txt gs://cmek-bucket/counts
      
```

Use CMEK with workflow template data

Managed Service for Apache Spark workflow template job argument data, such as the query string of a Spark SQL job, can be encrypted using CMEK. Follow steps 1, 2, and 3 in this section to use CMEK with your Managed Service for Apache Spark workflow template. See WorkflowTemplate.EncryptionConfig.kmsKey for a list of workflow template job types and arguments that are encrypted using CMEK when this feature is enabled.

Create a key using the Cloud KMS. The resource name of the key, which you use in the next steps, is constructed as follows:
```
projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name
```
Use the Cryptographic Keys page of the Google Cloud console to copy a key resource ID to the clipboard.
To enable the Managed Service for Apache Spark service accounts to use your key:
1. Assign the Cloud KMS CryptoKey Encrypter/Decrypter role to the Managed Service for Apache Spark Service Agent service account. You can use the gcloud CLI to assign the role:
```
 gcloud projects add-iam-policy-binding KMS_PROJECT_ID \
 --member serviceAccount:service-PROJECT_NUMBER@dataproc-accounts.iam.gserviceaccount.com \
 --role roles/cloudkms.cryptoKeyEncrypterDecrypter
```
  Replace the following:
  
  KMS_PROJECT_ID: the ID of your Google Cloud project that runs Cloud KMS. This project can also be the project that runs Managed Service for Apache Spark resources.
  
  PROJECT_NUMBER: the project number (not the project ID) of your Google Cloud project that runs Managed Service for Apache Spark resources.
2. Enable the Cloud KMS API on the project that runs Managed Service for Apache Spark resources.
3. If the Managed Service for Apache Spark Service Agent role is not attached to the Managed Service for Apache Spark Service Agent service account, then add the serviceusage.services.use permission to the custom role attached to the Managed Service for Apache Spark Service Agent service account. If the Managed Service for Apache Spark Service Agent role is attached to the Managed Service for Apache Spark Service Agent service account, you can skip this step.
You can use the gcloud CLI or the Dataproc API to set the key you created in Step 1 on a workflow. Once the key is set on a workflow, all the workflow job arguments and queries are encrypted using the key for any of the job types and arguments listed in WorkflowTemplate.EncryptionConfig.kmsKey.
gcloud CLI

Pass resource ID of your key to the --kms-key flag when you create the workflow template with the gcloud dataproc workflow-templates create command.

Example:
```
gcloud dataproc workflow-templates create my-template-name \
    --region=region \
    --kms-key='projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name' \
    other arguments ...
```
You can verify the key setting from the gcloud command-line tool.
```
gcloud dataproc workflow-templates describe TEMPLATE_NAME \
    --region=REGION
```
```
...
id: my-template-name
encryptionConfig:
kmsKey: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
...
```
REST API

Use WorkflowTemplate.EncryptionConfig.kmsKey as part of a workflowTemplates.create request.

You can verify the key setting by issuing a workflowTemplates.get request. The returned JSON contains the kmsKey:
```
...
"id": "my-template-name",
"encryptionConfig": {
  "kmsKey": "projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name"
},
```

Cloud External Key Manager

Cloud External Key Manager (Cloud EKM) lets you protect Managed Service for Apache Spark data using keys managed by a supported external key management partner. The steps you follow to use Cloud EKM in Managed Service for Apache Spark are the same as those you use to set up CMEK keys, with the following difference: your key points to a URI for the externally managed key (see Cloud EKM Overview).

Cloud EKM errors

When you use Cloud EKM, an attempt to create a cluster can fail due to errors associated with inputs, Cloud EKM, the external key management partner system, or communications between Cloud EKM and the external system. If you use the REST API or the Google Cloud console, errors are logged in Cloud Logging. You can examine the failed cluster's errors from the View Log tab.

Customer managed encryption keys (CMEK) Stay organized with collections Save and categorize content based on your preferences.

CMEK cluster data encryption

Before you begin

Create keys

Automated key creation

Manual key creation

Create a cluster with CMEK

gcloud CLI

REST API

Use CMEK with workflow template data

gcloud CLI

REST API

Cloud External Key Manager

Cloud EKM errors

Customer managed encryption keys (CMEK)