This tutorial shows how to implement an automated data quarantine and classification system using Cloud Storage and other Google Cloud products. The tutorial assumes that you are familiar with Google Cloud and basic shell programming.
In every organization, data protection officers like you face an ever-increasing amount of data, data that must be protected and treated appropriately. Quarantining and classifying that data can be complicated and time consuming, especially given hundreds or thousands of files a day.
What if you could take each file, upload it to a quarantine location, and have it automatically classified and moved to the appropriate location based on the classification result? This tutorial shows you how to implement such a system by using Cloud Run functions, Cloud Storage, and Sensitive Data Protection.
Granting permissions to service accounts
Your first step is to grant permissions to two service accounts: the Cloud Run functions service account and the Cloud Data Loss Prevention Service Agent.
Grant permissions to the App Engine default service account
In the Google Cloud console, open the IAM & Admin page and select the project you created:
Locate the App Engine service account. This account has the format
[PROJECT_ID]@appspot.gserviceaccount.com
. Replace[PROJECT_ID]
with your project ID.Select the edit icon edit next to the service account.
Add the following roles:
- DLP Administrator
- DLP API Service Agent
Click Save.
Grant permissions to the Cloud Data Loss Prevention Service Agent
The Cloud Data Loss Prevention Service Agent is created the first time it is needed.
In Cloud Shell, create the Cloud Data Loss Prevention Service Agent by making a call to
InspectContent
:curl --request POST
"https://dlp.googleapis.com/v2/projects/PROJECT_ID/locations/us-central1/content:inspect"
--header "X-Goog-User-Project: PROJECT_ID"
--header "Authorization: Bearer $(gcloud auth print-access-token)"
--header 'Accept: application/json'
--header 'Content-Type: application/json'
--data '{"item":{"value":"google@google.com"}}'
--compressedReplace
PROJECT_ID
with your project ID.In the Google Cloud console, open the IAM & Admin page and select the project you created:
Select the include Google-provided role grants checkbox
Locate the Cloud Data Loss Prevention Service Agent. This account has the format
service-[PROJECT_NUMBER]@dlp-api.iam.gserviceaccount.com
. Replace[PROJECT_NUMBER]
with your project number.Select the edit icon edit next to the service account.
Add the role Project > Viewer, and then click Save.
Building the quarantine and classification pipeline
In this section, you build the quarantine and classification pipeline shown in the following diagram.
The numbers in this pipeline correspond to these steps:
- You upload files to Cloud Storage.
- You invoke a Cloud Function.
- Sensitive Data Protection inspects and classifies the data.
- The file is moved to the appropriate bucket.
Create Cloud Storage buckets
Following the guidance outlined in the bucket naming guidelines, create three uniquely named buckets, which you use throughout this tutorial:
- Bucket 1: Replace
[YOUR_QUARANTINE_BUCKET]
with a unique name. - Bucket 2: Replace
[YOUR_SENSITIVE_DATA_BUCKET]
with a unique name. - Bucket 3: Replace
[YOUR_NON_SENSITIVE_DATA_BUCKET]
with a unique name.
console
In the Google Cloud console, open the Cloud Storage browser:
Click Create bucket.
In the Bucket name text box, enter the name you selected for
[YOUR_QUARANTINE_BUCKET]
, and then click Create.Repeat for the
[YOUR_SENSITIVE_DATA_BUCKET]
and[YOUR_NON_SENSITIVE_DATA_BUCKET]
buckets.
gcloud
Open Cloud Shell:
Create three buckets using the following commands:
gcloud storage buckets create gs://[YOUR_QUARANTINE_BUCKET] gcloud storage buckets create gs://[YOUR_SENSITIVE_DATA_BUCKET] gcloud storage buckets create gs://[YOUR_NON_SENSITIVE_DATA_BUCKET]
Create a Pub/Sub topic and subscription
console
Open the Pub/Sub Topics page:
Click Create topic.
In the text box enter a topic name.
Select the Add a default subscription check box.
Click Create Topic.
gcloud
Open Cloud Shell:
Create a topic, replacing
[PUB/SUB_TOPIC]
with a name of your choosing:gcloud pubsub topics create [PUB/SUB_TOPIC]
Create a subscription, replacing
[PUB/SUB_SUBSCRIPTION]
with a name of your choosing:gcloud pubsub subscriptions create [PUB/SUB_SUBSCRIPTION] --topic [PUB/SUB_TOPIC]
Create the Cloud Run functions
This section steps through deploying the Python script containing the following two Cloud Run functions:
- A function that is invoked when an object is uploaded to Cloud Storage.
- A function that is invoked when a message is received in the Pub/Sub queue.
The Python script that you use to complete this tutorial is contained in a GitHub repository To create the first Cloud Function, you must enable the correct APIs.
To enable the APIs, do the following:
- If you are working in the console, when you click Create function, you will see a guide on how to enable the APIs that you need to use Cloud Functions.
- If you are working in gcloud CLI, you must manually enable the
following APIs:
- Artifact Registry API
- Eventarc API
- Cloud Run Admin API
Creating the first function
console
Open the Cloud Run functions Overview page:
Select the project for which you enabled Cloud Run functions.
Click Create function.
In the Function name box, replace the default name with
create_DLP_job
.In the Trigger field, select Cloud Storage.
In Event type field select Finalize/Create.
In the Bucket field, click browse, select your quarantine bucket by highlighting the bucket in the drop-down list, and then click Select.
Click Save
Click Next
Under Runtime, select Python 3.7.
Under Source code, check Inline editor.
Replace the text in the main.py box with the contents of the following file
https://github.com/GoogleCloudPlatform/dlp-cloud-functions-tutorials/blob/master/gcs-dlp-classification-python/main.py
.Replace the following:
[PROJECT_ID_DLP_JOB & TOPIC]
: the project ID that is hosting your Cloud Run function and Pub/Sub topic.[YOUR_QUARANTINE_BUCKET]
the name of the bucket you will be uploading the files to be processed to .[YOUR_SENSITIVE_DATA_BUCKET]
: the name of the bucket you will be moving sensitive files to.[YOUR_NON_SENSITIVE_DATA_BUCKET]
: the name of the bucket you will be uploading the files to be processed to.[PUB/SUB_TOPIC]
: the name of the Pub/Sub topic that you created earlier.
In the Entry point text box, replace the default text with the following:
create_DLP_job
.Replace the text in the requirements.txt text box with the contents of the following file:
https://github.com/GoogleCloudPlatform/dlp-cloud-functions-tutorials/blob/master/gcs-dlp-classification-python/requirements.txt.
Click Deploy.
A green checkmark beside the function indicates a successful deployment.
gcloud
Open a Cloud Shell session and clone the GitHub repository that contains the code and some sample data files:
Change directories to the folder the repository has been cloned to:
cd ~dlp-cloud-functions-tutorials/gcs-dlp-classification-python/
Make the following replacements in the main.py file
[PROJECT_ID_DLP_JOB & TOPIC]
: the project ID that is hosting your Cloud Run function and Pub/Sub topic.[YOUR_QUARANTINE_BUCKET]
: the name of the bucket you will be uploading the files to be processed to .[YOUR_SENSITIVE_DATA_BUCKET]
: the name of the bucket you will be moving sensitive files to.[YOUR_NON_SENSITIVE_DATA_BUCKET]
: the name of the bucket you will be uploading the files to be processed to.[PUB/SUB_TOPIC
: the name of the Pub/Sub topic that you created earlier.
Deploy the function, replacing
[YOUR_QUARANTINE_BUCKET]
with your bucket name:gcloud functions deploy create_DLP_job --runtime python37 \ --trigger-resource [YOUR_QUARANTINE_BUCKET] \ --trigger-event google.storage.object.finalize
Validate that the function has successfully deployed:
gcloud functions describe create_DLP_job
A successful deployment is indicated by a ready status similar to the following:
status: READY timeout: 60s
When the Cloud Function has successfully deployed, continue to the next section to create the second Cloud Function.
Creating the second function
console
Open the Cloud Run functions Overview page:
Select the project for which you enabled Cloud Run functions.
Click Create function.
In the Function Name box, replace the default name with
resolve_DLP
.In the Trigger field, select Pub/Sub.
In the Select a Cloud Pub/Sub Topic field, search for the Pub/Sub topic you created earlier.
Click Save
Click Next
Under Runtime, select Python 3.7.
Under Source code, select Inline editor.
In the Entry point text box, replace the default text with
resolve_DLP
.Replace the text in the main.py box with the contents of the following file: https://github.com/GoogleCloudPlatform/dlp-cloud-functions-tutorials/blob/master/gcs-dlp-classification-python/main.py. Make the following replacements
[PROJECT_ID_DLP_JOB & TOPIC]
: the project ID that is hosting your Cloud Run function and Pub/Sub topic.[YOUR_QUARANTINE_BUCKET]
: the name of the bucket you will be uploading the files to be processed to .[YOUR_SENSITIVE_DATA_BUCKET]
: the name of the bucket you will be moving sensitive files to.[YOUR_NON_SENSITIVE_DATA_BUCKET]
: the name of the bucket you will be uploading the files to be processed to.[PUB/SUB_TOPIC
: the name of the Pub/Sub topic that you created earlier.
Click Deploy.
A green checkmark beside the function indicates a successful deployment.
gcloud
Open (or reopen) a Cloud Shell session and clone the GitHub repository that contains the code and some sample data files:
Change directories to the folder with the Python code:
cd gcs-dlp-classification-python/
Make the following replacements in the
main.py
file:[PROJECT_ID_DLP_JOB & TOPIC]
: the project ID that is hosting your Cloud Run function and Pub/Sub topic.[YOUR_QUARANTINE_BUCKET]
: the name of the bucket you will be uploading the files to be processed to.[YOUR_SENSITIVE_DATA_BUCKET]
: the name of the bucket you will be moving sensitive files to.[YOUR_NON_SENSITIVE_DATA_BUCKET]
: the name of the bucket you will be uploading the files to be processed to.[PUB/SUB_TOPIC
: the name of the Pub/Sub topic that you created earlier.
Deploy the function, replacing
[PUB/SUB_TOPIC]
with your Pub/Sub topic:gcloud functions deploy resolve_DLP --runtime python37 --trigger-topic [PUB/SUB_TOPIC]
Validate that the function has successfully deployed:
gcloud functions describe resolve_DLP
A successful deployment is indicated by a ready status similar to the following:
status: READY timeout: 60s
When the Cloud Function has successfully deployed, continue to the next section.
Upload sample files to the quarantine bucket
The GitHub repository associated with this article includes sample data files. The
folder contains some files that have sensitive data and other files that have
nonsensitive data. Sensitive data is classified as containing one or more of the
following INFO_TYPES
values:
US_SOCIAL_SECURITY_NUMBER EMAIL_ADDRESS PERSON_NAME LOCATION PHONE_NUMBER
The data types that are used to classify the sample files are defined in the
INFO_TYPES
constant in the main.py
file, which is initially set to
'FIRST_NAME,PHONE_NUMBER,EMAIL_ADDRESS,US_SOCIAL_SECURITY_NUMBER'
.
If you have not already cloned the repository, open Cloud Shell and clone the GitHub repository that contains the code and some sample data files:
Change folders to the sample data files:
cd ~/dlp-cloud-functions-tutorials/sample_data/
Copy the sample data files to the quarantine bucket by using the
cp
command, replacing[YOUR_QUARANTINE_BUCKET]
with the name of your quarantine bucket:gcloud storage cp * gs://[YOUR_QUARANTINE_BUCKET]/
Sensitive Data Protection inspects and classifies each file uploaded to the quarantine bucket and moves it to the appropriate target bucket based on its classification.
In the Cloud Storage console, open the Storage Browser page:
Select one of the target buckets that you created earlier and review the uploaded files. Also review the other buckets that you created.