This page describes the sensitive data discovery service. This service helps you determine where sensitive and high-risk data reside in your organization.
Overview
The discovery service lets you protect data across your organization by identifying where sensitive and high-risk data reside. When you create a discovery scan configuration, Sensitive Data Protection scans your resources to identify the data in scope for profiling. Then, it generates profiles of your data. As long as the discovery configuration is active, Sensitive Data Protection automatically profiles data that you add and modify. You can generate data profiles across the entire organization, individual folders, and individual projects.
Each data profile is a set of insights and metadata that the discovery service gathers from scanning a supported resource. Insights include the predicted infoTypes and the calculated data risk and sensitivity levels of your data. Use these insights to make informed decisions about how you protect, share, and use your data.
Data profiles are generated at various levels of detail. For example, when you profile BigQuery data, profiles are generated at the project, table, and column levels.
The following image shows a list of column-level data profiles. Click the image to enlarge it.
For a list of insights and metadata included in each data profile, see Metrics reference.
For more information about the Google Cloud resource hierarchy, see Resource hierarchy.
Data profile generation
To start generating data profiles, you create a discovery scan configuration (also called a data profile configuration). This scan configuration is where you set the scope of the discovery operation and the type of data that you want to profile. In the scan configuration, you can set filters to specify subsets of data that you want to profile or skip. You can also set the profiling schedule.
When creating a scan configuration, you also set the inspection template to use. The inspection template is where you specify the types of sensitive data (also called infoTypes) that Sensitive Data Protection must scan for.
When Sensitive Data Protection creates data profiles, it analyzes your data based on your scan configuration and inspection template.
Sensitive Data Protection reprofiles data as described in Frequency of data profile generation. You can customize the profiling frequency in your scan configuration by creating a schedule. To force the discovery service to reprofile your data, see Force a reprofile operation.
Discovery types
This section describes the types of discovery operations that you can perform and the supported data resources.
Discovery for BigQuery and BigLake
When you profile BigQuery data, data profiles are generated at the project, table, and column levels. After profiling a BigQuery table, you can further investigate the findings by performing a deep inspection.
Sensitive Data Protection profiles tables that are supported by the BigQuery Storage Read API, including the following:
- Standard BigQuery tables
- Table snapshots
- BigLake tables stored in Cloud Storage
The following aren't supported:
- BigQuery Omni tables.
- Tables where the serialized data size of individual rows exceed the maximum serialized data size that the BigQuery Storage Read API supports—128 MB.
- Non-BigLake external tables, like Google Sheets.
For information about how to profile BigQuery data, see the following:
For more information about BigQuery, see the BigQuery documentation.
Discovery for Cloud SQL
When you profile Cloud SQL data, data profiles are generated at the project, table, and column levels. Before discovery can begin, you need to provide the connection details for each Cloud SQL instance to be profiled.
For information about how to profile Cloud SQL data, see the following:
For more information about Cloud SQL, see the Cloud SQL documentation.
Discovery for Cloud Storage
When you profile Cloud Storage data, data profiles are generated at the bucket level. Sensitive Data Protection groups the detected files into file clusters and provides a summary for each cluster.
For information about how to profile Cloud Storage data, see the following:
- Profile Cloud Storage data in a single project
- Profile Cloud Storage data in an organization or folder
For more information about Cloud Storage, see the Cloud Storage documentation.
Discovery for Vertex AI
When you profile a Vertex AI dataset, Sensitive Data Protection generates a file store data profile or a table data profile, depending on where your training data is stored: Cloud Storage or BigQuery.
For more information, see the following:
- Sensitive data discovery for Vertex AI
- Profile Vertex AI data in a single project
- Profile Vertex AI data in an organization or folder
For more information about Vertex AI, see the Vertex AI documentation.
Discovery for other cloud providers
When you profile S3 data, data profiles are generated at the bucket level. When you profile Azure Blob Storage data, data profiles are generated at the container level.
In both cases, Sensitive Data Protection groups the detected files into file clusters and provides a summary for each cluster.
For more information, see the following:
Cloud Run environment variables
The discovery service can detect the presence of secrets in Cloud Run functions and Cloud Run service revision environment variables, and send any findings to Security Command Center. No data profiles are generated.
For more information, see Report secrets in environment variables to Security Command Center.
Roles required to configure and view data profiles
The following sections list the required user roles, categorized according to their purpose. Depending on how your organization is set up, you might decide to have different people perform different tasks. For example, the person who configures data profiles might be different from the person who regularly monitors them.
Roles required to work with data profiles at the organization or folder level
These roles let you configure and view data profiles at the organization or folder level.
Make sure these roles are granted to the proper people at the organization level. Alternatively, your Google Cloud administrator can create custom roles that only have the relevant permissions.
| Purpose | Predefined role | Relevant permissions | 
|---|---|---|
| Create a discovery scan configuration and view data profiles | DLP Administrator ( roles/dlp.admin) | 
 | 
| Create a project to be used as the service agent container1 | Project Creator ( roles/resourcemanager.projectCreator) | 
 | 
| Grant discovery access2 | One of the following: 
 | 
 | 
| View data profiles (read-only) | DLP Data Profiles Reader ( roles/dlp.dataProfilesReader) | 
 | 
| DLP Reader ( roles/dlp.reader) | 
 | 
 1 If you don't have the Project
Creator (roles/resourcemanager.projectCreator) role, you can still create a scan
configuration, but the service agent
container that you use must be an existing project.
 2 If you don't have the Organization
Administrator (roles/resourcemanager.organizationAdmin) or Security Admin
(roles/iam.securityAdmin) role, you can still create a scan configuration. After you
create the scan configuration, someone in your organization who has one of these roles must grant discovery access to the
service agent.
Roles required to work with data profiles at the project level
These roles let you configure and view data profiles at the project level.
Make sure these roles are granted to the proper people at the project level. Alternatively, your Google Cloud administrator can create custom roles that only have the relevant permissions.
| Purpose | Predefined role | Relevant permissions | 
|---|---|---|
| Configure and view data profiles | DLP Administrator ( roles/dlp.admin) | 
 | 
| View data profiles (read-only) | DLP Data Profiles Reader ( roles/dlp.dataProfilesReader) | 
 | 
| DLP Reader ( roles/dlp.reader) | 
 | 
Discovery scan configuration
A discovery scan configuration (sometimes called discovery configuration or scan configuration) specifies how Sensitive Data Protection should profile your data. In includes the following settings:
- Scope (organization, folder, or project) of the discovery operation
- Type of resource to profile
- Inspection templates to use
- Scan frequency
- Specific subsets of data that should be included in or excluded from discovery
- Actions that you want Sensitive Data Protection to take after discovery—for example, which Google Cloud services to publish the profiles to
- Service agent to use for discovery operations
For information about how to create a discovery scan configuration, see the following pages:
- Discovery for BigQuery data 
- Discovery for Cloud SQL data 
- Discovery for Cloud Storage data 
- Discovery for Vertex AI data 
- Report secrets in Cloud Run environment variables to Security Command Center (no profiles generated) 
Scan configuration scopes
You can create a scan configuration at the following levels:
- Organization
- Folder
- Project
- Single data resource
At the organization and folder levels, if two or more active scan configurations have the same project in their scope, Sensitive Data Protection determines which scan configuration can generate profiles for that project. For more information, see Overriding scan configurations on this page.
A project-level scan configuration can always profile the target project and does not compete with other configurations at the level of the parent folder or organization.
A single-resource scan configuration is intended to help you explore and test profiling on a single data resource.
Scan configuration location
The first time you create a scan configuration, you specify where you want Sensitive Data Protection to store it. All subsequent scan configurations that you create are stored in that same region.
For example, if you create a scan configuration for Folder A and store it in the
us-west1 region, then any scan configuration that you later create for any
other resource is also stored in that region.
Metadata about the data to be profiled is copied to the same region as your scan configurations, but the data itself isn't moved or copied. For more information, see Data residency considerations.
Inspection template
An inspection template specifies what information types (or infoTypes) Sensitive Data Protection looks for while scanning your data. Here, you provide a combination of built-in infoTypes and optional custom infoTypes.
You can also provide a likelihood level to narrow down what Sensitive Data Protection considers to be a match. You can add rule sets to exclude unwanted findings or include additional findings.
By default, if you change an inspection template that your scan configuration uses, the changes are applied only to future scans. Your action doesn't cause a reprofile operation on your data.
If you want inspection template changes to trigger reprofile operations on the affected data, add or update a schedule in your scan configuration, and turn on the option to reprofile the data when the inspection template changes. For more information, see Frequency of data profile generation.
You must have an inspection template in each region where you have data to be
profiled. If you want to use a single template for multiple regions, you can use
a template that is stored in the global region. If organizational
policies prevent you from creating an inspection template in the global region, then
you must set a dedicated inspection template for each region. For more
information, see Data
  residency considerations.
Inspection templates are a core component of the Sensitive Data Protection platform. Data profiles use the same inspection templates that you can use across all Sensitive Data Protection services. For more information on inspection templates, see Templates.
Service agent container and service agent
When you create a scan configuration for your organization or for a folder, Sensitive Data Protection requires you to provide a service agent container. A service agent container is a Google Cloud project that Sensitive Data Protection uses to track billed charges related to organization- and folder-level profiling operations.
The service agent container contains a service agent, which Sensitive Data Protection uses to profile data on your behalf. You need a service agent to authenticate to Sensitive Data Protection and other APIs. Your service agent must have all the required permissions to access and profile your data. The service agent's ID is in the following format:
service-PROJECT_NUMBER@dlp-api.iam.gserviceaccount.com
Here, the PROJECT_NUMBER is the numerical identifier of the service agent container.
When setting the service agent container, you can choose an existing project. If the project you select contains a service agent, Sensitive Data Protection grants the required IAM permissions to that service agent. If the project doesn't have a service agent, Sensitive Data Protection creates one and automatically grants data profiling permissions to it.
Alternatively, you can choose to have Sensitive Data Protection automatically create the service agent container and service agent. Sensitive Data Protection automatically grants data profiling permissions to the service agent.
In both cases, if Sensitive Data Protection fails to grant data profiling access to your service agent, it shows an error when you view the scan configuration details.
For project-level scan configurations, you don't need a service agent container. The project you're profiling serves the service agent container's purpose. To run profiling operations, Sensitive Data Protection uses that project's own service agent.
Data profiling access at the organization or folder level
When you configure profiling at the organization or folder level, Sensitive Data Protection attempts to automatically grant data profiling access to your service agent. However, if you don't have the permissions to grant IAM roles, Sensitive Data Protection can't do this action on your behalf. Someone with those permissions in your organization, such as a Google Cloud administrator, must grant data profiling access to your service agent.
Frequency of data profile generation
After you create a discovery scan configuration for a particular resource, Sensitive Data Protection performs an initial scan, profiling the data in the scope of your scan configuration.
After the initial scan, Sensitive Data Protection continuously monitors the profiled resource. Data added in the resource is automatically profiled shortly after it is added.
Default reprofiling frequency
The default reprofiling frequency differs depending on the discovery type of your scan configuration:
- BigQuery profiling: for each table, wait 30 days and then reprofile the table if it has changes in the schema, table rows, or inspection template.
- Cloud SQL profiling: for each table, wait 30 days and then reprofile the table if it has changes in the schema or inspection template.
- Vertex AI profiling: for each dataset, wait 30 days and then reprofile the dataset if the inspection template has changes.
- File store profiling: for each file store in Google Cloud or in other clouds, wait 30 days and then reprofile the file store if the inspection template has changes. - Sensitive Data Protection uses the term file store to refer to a file storage bucket or container. 
Customizing the reprofiling frequency
In your scan configuration, you can customize the reprofiling frequency by creating one or more schedules for different subsets of your data.
The following reprofiling frequencies are available:
- Do not reprofile: Never reprofile after the initial profiles are generated.
- Reprofile daily: Wait 24 hours before reprofiling.
- Reprofile weekly: Wait 7 days before reprofiling.
- Reprofile monthly: Wait 30 days before reprofiling.
Reprofiling on a schedule
In your scan configuration, you can specify whether a subset of data should be reprofiled regularly regardless of whether the data underwent changes. The frequency you set specifies how much time must pass between profiling operations. For example, if you set the frequency to weekly, Sensitive Data Protection profiles a data resource seven days after it was last profiled.
Reprofiling on update
In your scan configuration, you can specify events that can trigger reprofiling operations. Examples of such events are inspection template updates.
When you select these events, the schedule you set specifies the longest time Sensitive Data Protection waits for updates to accumulate before it reprofiles your data. If no applicable changes—like schema changes or inspection template changes—occur within your specified period, no data is reprofiled. When the next applicable change occurs, the affected data is reprofiled at the next opportunity, which is determined by various factors (such as the available machine capacity or the subscription units purchased). Sensitive Data Protection then starts waiting for updates to accumulate again according to your set schedule.
For example, suppose your scan configuration is set to reprofile monthly on schema change. The data profiles were first created on day 0. No schema changes occur by day 30, so no data is reprofiled. On day 35, the first schema change occurs. Sensitive Data Protection reprofiles the updated data at the next opportunity. The system then waits another 30 days for schema updates to accumulate before it reprofiles any updated data.
From the time reprofiling begins, it can take up to 24 hours for the operation to complete. If the delay lasts longer than 24 hours and you're in subscription pricing mode, confirm whether you have remaining capacity for the month.
For example scenarios, see Data profiling pricing examples.
To force the discovery service to reprofile your data, see Force a reprofile operation.
Profiling performance
The time it takes to profile your data varies depending on several factors, including, but not limited to, the following:
- Number of data resources being profiled
- Sizes of the data resources
- For tables, the number of columns
- For tables, the data types in the columns
Therefore, Sensitive Data Protection's performance in a past inspection or profiling task isn't indicative of how it will perform in future profiling tasks.
Retention of data profiles
Sensitive Data Protection retains the latest version of a data profile for 13 months. When Sensitive Data Protection reprofiles a data resource, the system replaces that data resource's existing profiles with new ones.
In the following example scenarios, assume that the default profiling frequency for BigQuery is in effect:
- On January 1, Sensitive Data Protection profiles Table A. Table A does not change in over a year, and so it's not profiled again. In this case, Sensitive Data Protection retains the data profiles for Table A for 13 months before deleting them. 
- On January 1, Sensitive Data Protection profiles Table A. Within the month, someone in your organization updates the schema of that table. Because of this change, the following month, Sensitive Data Protection automatically reprofiles Table A. The newly generated data profiles overwrite the ones that were created in January. 
For information on how Sensitive Data Protection charges for profiling data, see Discovery pricing.
If you want to retain data profiles indefinitely or keep a record of the changes they undergo, consider saving the data profiles to BigQuery when you configure profiling. You choose which BigQuery dataset to save the profiles to, and you control the table expiration policy for that dataset.
Overriding scan configurations
You can create only one scan configuration for each combination of scope and discovery type. For example, you can create only one organization-level scan configuration for BigQuery data profiling and one organization-level scan configuration for secrets discovery. Similarly, you can create only one project-level scan configuration for BigQuery data profiling and one project-level scan configuration for secrets discovery.
If two or more active scan configurations have the same project and discovery type in their scope, the following rules apply:
- Among organization-level and folder-level scan configurations, the one that is closest to the project will be able to run discovery for that project. This rule applies even if a project-level scan configuration with the same discovery type also exists.
- Sensitive Data Protection treats project-level scan configurations independently of organization-level and folder-level configurations. A scan configuration that you create at the project level can't override one that you create for a parent folder or organization.
Consider the following example, where there are three active scan configurations. Assume that all of these scan configurations are for BigQuery data profiling.
Here, Scan configuration 1 applies to the entire organization, Scan configuration 2 applies to the Team B folder, and Scan configuration 3 applies to the Production project. In this example:
- Sensitive Data Protection profiles all tables in projects that aren't in the Team B folder according to Scan configuration 1.
- Sensitive Data Protection profiles all tables in projects in the Team B folder—including tables in the Production project—according to Scan configuration 2.
- Sensitive Data Protection profiles all tables in the Production project according to Scan configuration 3.
In this example, Sensitive Data Protection generates two sets of profiles for the Production project—one set for each of the following scan configurations:
- Scan configuration 2
- Scan configuration 3
However, even though there are two sets of profiles for the same project, you don't see them all together in your dashboard. You only see the profiles that were generated in the resource—organization, folder, or project—and region that you're viewing.
For more information on Google Cloud's resource hierarchy, see Resource hierarchy.
Data profile snapshots
Each data profile includes a snapshot of the scan configuration and the inspection template that were used to generate it. You can use this snapshot to check the settings that you used to generate a particular data profile.
Data residency considerations
Data residency considerations differ depending on whether you are scanning Google Cloud data or data from other cloud providers.
Data residency considerations for Google Cloud data
This section applies only to sensitive data discovery for Google Cloud resources. For data residency considerations related to resources from other cloud providers, see Data residency considerations for data from other cloud providers on this page.
Sensitive Data Protection is designed to support data residency. If you must comply with data residency requirements, consider the following points:
Regional inspection templates
This section applies only to sensitive data discovery for Google Cloud resources. For data residency considerations related to resources from other cloud providers, see Data residency considerations for data from other cloud providers on this page.
Sensitive Data Protection processes your data in the same region where that data is stored. That is, your data doesn't leave its current region.
Furthermore, an inspection template can only be used to profile data that
resides in the same region as that template. For example, if you configure
discovery to use an inspection template that is stored in the us-west1
region, Sensitive Data Protection can only profile data in that region.
You can
set a dedicated inspection template
for each region where you have data.
If you provide an inspection template that's stored in the global region,
Sensitive Data Protection uses that template for data in regions with no
dedicated inspection template.
The following table provides example scenarios:
| Scenario | Support | 
|---|---|
| Scan data in the usregion using an inspection template from
  theusregion. | Supported | 
| Scan data in the globalregion using an inspection template
  from theusregion. | Not supported | 
| Scan data in the usregion using an inspection template from
  theglobalregion. | Supported | 
| Scan data in the usregion using an inspection template from
    theus-east1region. | Not supported | 
| Scan data in the us-east1region using an inspection template
  from theusregion. | Not supported | 
| Scan data in the usregion using an inspection template from
  theasiaregion. | Not supported | 
Data profile configuration
This section applies only to sensitive data discovery for Google Cloud resources. For data residency considerations related to resources from other cloud providers, see Data residency considerations for data from other cloud providers on this page.
When Sensitive Data Protection creates data profiles, it takes a snapshot of your
scan configuration and inspection template and stores them in each table data
profile
or file store data
profile.
If you configure discovery to use an inspection template from the global
region, then Sensitive Data Protection copies that template to any region that has
data to be profiled. Similarly, it copies the scan configuration to those
regions.
Consider this example: Project A contains Table 1. Table 1 is
in the us-west1 region; the scan configuration is in the us-west2 region;
and the inspection template is in the global region.
When Sensitive Data Protection scans Project A, it creates data profiles for
Table 1 and stores them in the us-west1 region. Table 1's table data profile
contains copies of the scan configuration and the inspection template used in
the profiling operation.
If you don't want your inspection template to be copied to other regions, don't configure Sensitive Data Protection to scan data in those regions.
Regional storage of data profiles
This section applies only to sensitive data discovery for Google Cloud resources. For data residency considerations related to resources from other cloud providers, see Data residency considerations for data from other cloud providers on this page.
Sensitive Data Protection processes your data in the region or multi-region where they reside and stores the generated data profiles in the same region or multi-region.
To view data profiles in the Google Cloud console, you must first select the region where they reside. If you have data in multiple regions, then you must switch regions to view each set of profiles.
Unsupported regions
This section applies only to sensitive data discovery for Google Cloud resources. For data residency considerations related to resources from other cloud providers, see Data residency considerations for data from other cloud providers on this page.
If you have data in a region that Sensitive Data Protection doesn't support, then the discovery service skips those data resources and shows an error when you view the data profiles.
Multi-regions
Sensitive Data Protection treats a multi-region as one
region, and not a collection of regions. For example, the us multi-region
and the us-west1 region are treated as two separate regions as far as data
residency is concerned.
Zonal resources
Sensitive Data Protection is a regional and multi-regional service; it
doesn't distinguish between zones. For a supported zonal resource like a
Cloud SQL instance, the data is processed in its current region, but
not necessarily its current zone. For example, if a Cloud SQL instance
is stored in the us-central1-a zone, Sensitive Data Protection
processes and stores the data profiles in the us-central1 region.
For general information about Google Cloud locations, see Geography and regions.
Data residency considerations for data from other cloud providers
Consider the following when you plan to profile data from other cloud providers:
- The data profiles are stored alongside the discovery scan configuration. In contrast, when you profile Google Cloud data, the profiles are stored in the same region as the data to be profiled.
- If you store your inspection template in the globalregion, an in-memory copy of that template is read in the region where you store the discovery scan configuration.
- Your data is not modified. An in-memory copy of your data is read in the region where you store the discovery scan configuration. However, Sensitive Data Protection makes no guarantees about where the data passes through after it reaches the public internet. The data is encrypted with SSL.
Compliance
For information on how Sensitive Data Protection handles your data and helps you meet compliance requirements, see Data security.
What's next
- Read the Identity & Security blog post Automatic data risk management for BigQuery using Sensitive Data Protection. 
- Learn how to estimate data profiling cost. 
- Learn how to profile data at the organization, folder, or project level. 
- Learn about how Sensitive Data Protection calculates the data risk and sensitivity levels when profiling your data. 
- Learn how to remediate discovery findings. 
- Learn how to troubleshoot issues with the data profiler. 
 
  