Redact confidential data

This tutorial shows you how to use the Cloud Data Fusion plugin for Cloud DLP to redact sensitive data.

Scenario

Consider the following scenario, in which some sensitive customer information must be redacted:

Your support team documents the details of each support case they handle in a support ticket. All of the information in the support ticket is pulled into a CSV file. The support technicians are not supposed to document any customer information that's considered sensitive, but sometimes they mistakenly do so. You notice that in the CSV file some customers' phone numbers appear.

You want to go through the CSV file and hide all phone numbers. You create a Cloud Data Fusion pipeline that redacts the sensitive customer data by using the Cloud DLP plugin.

In this tutorial, you create a pipeline that does the following:

  • Redacts customer phone numbers by masking them with the # character.
  • Stores the masked sensitive data and the non-sensitive data in a Cloud Storage bucket.

Create the pipeline

Create a pipeline that redacts sensitive customer data. The pipeline you build does the following:

  • Reads the input data using the Cloud Storage source plugin.
  • Deploys the Cloud DLP plugin from the Hub.
  • Writes the output data using a Cloud Storage sink plugin.

Load the customer data

This tutorial uses the input dataset, CallCenterRecords.csv, provided in a publicly available Cloud Storage bucket.

  1. Open your Cloud Data Fusion instance and click Menu > Studio.

  2. In the Source menu, click the Cloud Storage plugin.

    Select the plugin.

  3. On the Cloud Storage node, click Properties.

  4. In the Reference name field, enter a name.

  5. In the Path field, enter gs://datafusion-sample-datasets/CallCenterRecords.csv.

  6. In the Format field, select CSV.

  7. For the Output Schema, delete the offset and body fields. Click Add and enter the following fields:

    • Date
    • Bank
    • State
    • Zip
    • Notes

    Enter source properties.

  8. Click Validate to check for errors.

  9. Click Close.

Redact sensitive data

The Cloud DLP Redact plugin identifies sensitive records in your input stream of data and applies transformations that you define to those records. A record of data is considered sensitive if it matches predefined Cloud DLP filters you choose or a custom template you define.

In this tutorial, you want to redact customer phone numbers that some support technicians on your team accidentally took note of. They entered the sensitive information in the Notes section of the support tickets, which appears as the Notes column in the CSV file. You create a custom Cloud DLP template, and then provide the template ID in the properties menu of the plugin.

Deploy the Cloud DLP plugin

  1. In your Cloud Data Fusion instance, click Hub.

  2. Click the Cloud DLP plugin.

  3. Click Deploy.

  4. Click Finish.

  5. Click Close to exit the Cloud DLP dialog.

  6. Click Close to exit the Hub.

Create a custom template

  1. In the Google Cloud console, go to the Cloud DLP page.

    Go to Cloud DLP

  2. From the Create menu, choose Template. image

  3. In the Template ID field, enter an ID for your template.

  4. Click Continue.

  5. In the Configure detection field, click Manage infotypes.

  6. In the Built-in tab, use the filter to search for "phone number".

    Filter.

  7. Select PHONE_NUMBER.

  8. Click Done > Create.

Learn more about creating Cloud DLP templates.

Apply the Cloud DLP Redact transformation

  1. Go to the Cloud Data Fusion Studio page and click to expand the Transform menu.

  2. Click the Cloud DLP Redact plugin.

    Click the plugin to add it to your pipeline.

  3. Drag a connection arrow from the Cloud Storage node to the Redact node.

    Connect the two nodes.

  4. Hold the pointer over the Redact node and click Properties.

    1. Set Custom Template to Yes.

    2. In the Template ID field, enter the template ID of the custom template you created.

    3. In the Matchingfield , apply Masking on Custom template within Notes.

    4. In the Masking Character field, enter #.

      Mask.

    5. Click Validate to check for errors.

    6. Click Close.

Store the output data

Store the results of your pipeline in a Cloud Storage file.

  1. From the Studio page, click to expand the Sink menu.

  2. Click Cloud Storage.

  3. Drag a connection arrow from the Redact node to the Cloud Storage2 node.

    Connect the Redact node to the second Cloud Storage node.

  4. Hold the pointer over the Cloud Storage2 node and click Properties.

    1. In the Reference name field, enter a name.

    2. In the Path field, enter the path of a Cloud Storage bucket where you'd like to store the pipeline results. Cloud Data Fusion creates the bucket for you. Be sure to follow the bucket naming guidelines.

    3. In the Format field, select CSV.

    4. Click Validate to ensure that there are no errors.

    5. Click Close.

Run the pipeline in preview mode

Run the pipeline in preview mode before you deploy it.

  1. Click Preview, and then click Run.

    Run the pipeline.

    Clicking Run displays the pipeline status, which starts with Starting, then turns to Stop, and then to Run.

  2. When the preview run completes, on the Redact node, click Preview Data to see a side-by-side comparison of the input and output data. Check that phone numbers have been masked with the # character.

    Check that the phone numbers are masked.

Redact another data type

While examining the preview run results, you notice that there's still sensitive information that appears in the Notes column: email addresses. You go back and edit the Cloud DLP template to redact email addresses as well.

  1. In the Google Cloud console, go to the Cloud DLP page.

    Open the Cloud DLP page

  2. On the Configuration tab, select your template.

  3. Click Edit.

  4. Click Manage infotypes.

  5. In the Built-in tab, use the filter to search for "OR" "email address".

    Filter.

  6. Select all and click Done.

  7. Click Save.

  8. Once again, run your pipeline in preview mode. Cloud Data Fusion will automatically use the updated Cloud DLP template.

  9. Check that both phone numbers and email addresses have been masked with the # character.

    Check that data is masked.

Deploy and run the pipeline

  1. Make sure Preview mode is unchecked.

  2. Click Save. Clicking Save prompts you to name your pipeline. Then, click OK.

  3. Click Deploy.

  4. When deployment completes, click Run. Running your pipeline can take a few minutes. While you wait, you can observe the Status of the pipeline transition from Provisioning to Starting to Running to Deprovisioning to Succeeded.

View the results

  1. In the Google Cloud console, go to the Cloud Storage page.

    Go to Cloud Storage

  2. In the Storage browser, navigate to the sink Cloud Storage bucket you specified in the sink Cloud Storage plugin properties.

  3. In Link URL, click the link to download the CSV file with the results. Check that the phone numbers and email addresses have been masked with the # character.

    Check that data is masked.