Customer managed encryption keys (CMEK)

When you use Dataproc, cluster and job data is stored on persistent disks associated with the Compute Engine VMs in your cluster and in a Cloud Storage staging bucket. By default, this persistent disk and bucket data is encrypted using a Google-generated data encryption key (DEK) and key encryption key (KEK).

If you want to control and manage the key encryption key (KEK), you can use Customer-Managed Encryption Keys (CMEK) (Google continues to control the data encryption key (DEK)). For more information on Google data encryption keys, see Encryption at Rest.

CMEK cluster data encryption

You can use customer-managed encryption keys (CMEK) to encrypt the following cluster data:

  • Data on persistent disks attached to Dataproc cluster VMs
  • Job argument data submitted to your cluster, such as a query string submitted with a Spark SQL job
  • Cluster metadata, job driver output, and other data written to your Dataproc cluster staging bucket

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Enable the Dataproc, Cloud Key Management Service, Compute Engine, and Cloud Storage APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

  5. Install the Google Cloud CLI.

  6. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  7. To initialize the gcloud CLI, run the following command:

    gcloud init
  8. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  9. Verify that billing is enabled for your Google Cloud project.

  10. Enable the Dataproc, Cloud Key Management Service, Compute Engine, and Cloud Storage APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

  11. Install the Google Cloud CLI.

  12. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  13. To initialize the gcloud CLI, run the following command:

    gcloud init

Create keys

To protect your Dataproc resources with CMEK, you can automate the creation of keys or create keys manually.

Automated key creation

Use Autokey to automate CMEK provisioning and assignment. Autokey generates key rings and keys on demand when resources are created. Service agents use the keys in encrypt and decrypt operations. If needed, Autokey creates the agents and grants the required Identity and Access Management (IAM) (IAM) roles to them. For more information, see Autokey overview.

Manual key creation

Follow these steps to manually create keys for CMEK encryption of cluster data:

  1. Create one or more keys using the Cloud KMS. The resource name, also called the resource ID of a key, which you use in the next steps, is constructed as follows:

    projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
    
    The key (CMEK) must be located in the same location as the encrypted resource. For example, the CMEK used to encrypt a resource in the us-central1 region must also be located in the us-central1 region.

  2. To ensure that each of the following service accounts, Compute Engine Service Agent service account, Cloud Storage Service Agent service account, and Dataproc Service Agent service account, has the necessary permissions to protect resources by using Cloud KMS keys, ask your administrator to grant each of the following service accounts, Compute Engine Service Agent service account, Cloud Storage Service Agent service account, and Dataproc Service Agent service account, the Cloud KMS CryptoKey Encrypter/Decrypter (roles/cloudkms.cryptoKeyEncrypterDecrypter) IAM role on your project.

    Example assignment of Cloud KMS CryptoKey Encrypter/Decrypter role to the Dataproc service Agent service account using Google Cloud CLI:

    gcloud projects add-iam-policy-binding KMS_PROJECT_ID \
    --member serviceAccount:service-PROJECT_NUMBER@dataproc-accounts.iam.gserviceaccount.com \
    --role roles/cloudkms.cryptoKeyEncrypterDecrypter
    

    Replace the following:

    KMS_PROJECT_ID: the ID of your Google Cloud project that contains the Cloud KMS key.

    PROJECT_NUMBER: the project number (not the project ID) of your Google Cloud project that runs Dataproc resources.

  3. If the Dataproc Service Agent role is not attached to the Dataproc Service Agent service account, add the serviceusage.services.use permission to a custom role attached to the Dataproc Service Agent service account.

Create a cluster with CMEK

Pass the resource ID of your key when you create the Dataproc cluster.

gcloud CLI

  • To encrypt cluster persistent disk data using your key, pass the resource ID of your key to the --gce-pd-kms-key flag when you create the cluster.
    gcloud dataproc clusters create CLUSTER_NAME \
        --region=REGION \
        --gce-pd-kms-key='projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME' \
        other arguments ...
        

    You can verify the key setting from the gcloud command-line tool.

    gcloud dataproc clusters describe CLUSTER_NAME \
        --region=REGION
        

    Command output snippet:

    ...
    configBucket: dataproc- ...
      encryptionConfig:
        gcePdKmsKeyName: projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name
    ...
        
  • To encrypt cluster persistent disk data and job argument data using your key, pass the resource ID of the key to the --kms-key flag when you create the cluster. See [Cluster.EncryptionConfig.kmsKey](/dataproc/docs/reference/rest/v1/ClusterConfig#EncryptionConfig.FIELDS.kms_key) for a list of job types and arguments that are encrypted with the `--kms-key` flag.
    gcloud dataproc clusters create CLUSTER_NAME \
        --region=REGION \
        --kms-key='projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME' \
        other arguments ...
          

    You can verify key settings with the gcloud CLI dataproc clusters describe command. The key resource ID is set on gcePdKmsKeyName and kmsKey to use your key with the encryption of cluster persistent disk and job argument data.

    gcloud dataproc clusters describe CLUSTER_NAME \
        --region=REGION
          

    Command output snippet:

    ...
    configBucket: dataproc- ...
      encryptionConfig:
      gcePdKmsKeyName: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
      kmsKey: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
    ...
        
  • To encrypt cluster metadata, job driver, and other output data written to your Dataproc staging bucket in Cloud Storage:
    gcloud dataproc clusters create CLUSTER_NAME \
        --region=REGION \
        --bucket=CMEK_BUCKET_NAME \
        other arguments ...
            

    You can also pass CMEK-enabled buckets to the `gcloud dataproc jobs submit` command if your job takes bucket arguments, as shown in the following `cmek-bucket` example:

    gcloud dataproc jobs submit pyspark gs://cmek-bucket/wordcount.py \
        --region=region \
        --cluster=cluster-name \
        -- gs://cmek-bucket/shakespeare.txt gs://cmek-bucket/counts
          

REST API

  • To encrypt cluster VM persistent disk data using your key, include the ClusterConfig.EncryptionConfig.gcePdKmsKeyName field as part of a cluster.create request.

    You can verify the key setting with the gcloud CLI dataproc clusters describe command.

    gcloud dataproc clusters describe CLUSTER_NAME \
        --region=REGION
        

    Command output snippet:

    ...
    configBucket: dataproc- ...
      encryptionConfig:
        gcePdKmsKeyName: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
    ...
        
  • To encrypt cluster VM persistent disk data and job argument data using your key, include the Cluster.EncryptionConfig.kmsKey field as part of a cluster.create request. See Cluster.EncryptionConfig.kmsKey for a list of job types and arguments that are encrypted with the --kms-key field.

    You can verify key settings with the gcloud CLI dataproc clusters describe command. The key resource ID is set on gcePdKmsKeyName and kmsKey to use your key with the encryption of cluster persistent disk and job argument data.

    gcloud dataproc clusters describe CLUSTER_NAME \
        --region=REGION
        

    Command output snippet:

    ...
    configBucket: dataproc- ...
      encryptionConfig:
        gcePdKmsKeyName: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
        kmsKey: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
    ...
        
  • To encrypt cluster metadata, job driver, and other output data written to your Dataproc staging bucket in Cloud Storage:
    gcloud dataproc clusters create CLUSTER_NAME \
        --region=REGION \
        --bucket=CMEK_BUCKET_NAME \
        other arguments ...
        

    You can also pass CMEK-enabled buckets to the `gcloud dataproc jobs submit` command if your job takes bucket arguments, as shown in the following `cmek-bucket` example:

    gcloud dataproc jobs submit pyspark gs://cmek-bucket/wordcount.py \
        --region=region \
        --cluster=cluster-name \
        -- gs://cmek-bucket/shakespeare.txt gs://cmek-bucket/counts
          

Use CMEK with workflow template data

Dataproc workflow template job argument data, such as the query string of a Spark SQL job, can be encrypted using CMEK. Follow steps 1, 2, and 3 in this section to use CMEK with your Dataproc workflow template. See WorkflowTemplate.EncryptionConfig.kmsKey for a list of workflow template job types and arguments that are encrypted using CMEK when this feature is enabled.

  1. Create a key using the Cloud KMS. The resource name of the key, which you use in the next steps, is constructed as follows:
    projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name
    
  2. To enable the Dataproc service accounts to use your key:

    1. Assign the Cloud KMS CryptoKey Encrypter/Decrypter role to the Dataproc Service Agent service account. You can use the gcloud CLI to assign the role:

       gcloud projects add-iam-policy-binding KMS_PROJECT_ID \
       --member serviceAccount:service-PROJECT_NUMBER@dataproc-accounts.iam.gserviceaccount.com \
       --role roles/cloudkms.cryptoKeyEncrypterDecrypter
      

      Replace the following:

      KMS_PROJECT_ID: the ID of your Google Cloud project that runs Cloud KMS. This project can also be the project that runs Dataproc resources.

      PROJECT_NUMBER: the project number (not the project ID) of your Google Cloud project that runs Dataproc resources.

    2. Enable the Cloud KMS API on the project that runs Dataproc resources.

    3. If the Dataproc Service Agent role is not attached to the Dataproc Service Agent service account, then add the serviceusage.services.use permission to the custom role attached to the Dataproc Service Agent service account. If the Dataproc Service Agent role is attached to the Dataproc Service Agent service account, you can skip this step.

  3. You can use the gcloud CLI or the Dataproc API to set the key you created in Step 1 on a workflow. Once the key is set on a workflow, all the workflow job arguments and queries are encrypted using the key for any of the job types and arguments listed in WorkflowTemplate.EncryptionConfig.kmsKey.

    gcloud CLI

    Pass resource ID of your key to the --kms-key flag when you create the workflow template with the gcloud dataproc workflow-templates create command.

    Example:

    gcloud dataproc workflow-templates create my-template-name \
        --region=region \
        --kms-key='projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name' \
        other arguments ...
    
    You can verify the key setting from the gcloud command-line tool.
    gcloud dataproc workflow-templates describe TEMPLATE_NAME \
        --region=REGION
    
    ...
    id: my-template-name
    encryptionConfig:
    kmsKey: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
    ...
    

    REST API

    Use WorkflowTemplate.EncryptionConfig.kmsKey as part of a workflowTemplates.create request.

    You can verify the key setting by issuing a workflowTemplates.get request. The returned JSON contains the kmsKey:

    ...
    "id": "my-template-name",
    "encryptionConfig": {
      "kmsKey": "projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name"
    },
    

Cloud External Key Manager

Cloud External Key Manager (Cloud EKM) lets you protect Dataproc data using keys managed by a supported external key management partner. The steps you follow to use Cloud EKM in Dataproc are the same as those you use to set up CMEK keys, with the following difference: your key points to a URI for the externally managed key (see Cloud EKM Overview).

Cloud EKM errors

When you use Cloud EKM, an attempt to create a cluster can fail due to errors associated with inputs, Cloud EKM, the external key management partner system, or communications between Cloud EKM and the external system. If you use the REST API or the Google Cloud console, errors are logged in Cloud Logging. You can examine the failed cluster's errors from the View Log tab.