Data profile scans for unstructured data use Vertex AI Gemini 2.5 Pro models during a Cloud Storage discovery scan to transform raw, unstructured files in Cloud Storage (such as PDFs) into structured, queryable assets in BigQuery. This automated workflow is designed for users starting with raw files in Cloud Storage. If you already have existing BigQuery object tables or want to guide the extraction using a customized prompt, see Use data profile for unstructured data.
This document describes how to set up the necessary permissions, prepare your unstructured files, create a Cloud Storage discovery scan with semantic inference enabled using the REST API, view the generated insights, curate graph profiles, and extract the data into BigQuery.
Before you begin
Before you create a discovery scan, ensure you have the required permissions and APIs enabled.
Enable APIs
Enable the following APIs in your project:
dataplex.googleapis.combigquery.googleapis.comaiplatform.googleapis.com(Vertex AI)
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM
role (roles/serviceusage.serviceUsageAdmin), which
contains the serviceusage.services.enable permission. Learn how to grant
roles.
Required roles and permissions
To configure and run data profile scans for unstructured data, you must satisfy the baseline permissions for a discovery scan and then grant additional roles for semantic inference across multiple service agents.
Baseline discovery scan roles
Ensure that you and the service accounts used by Knowledge Catalog have the baseline permissions required for a standard discovery scan. For a complete list, see Discover and catalog Cloud Storage data.
Additional roles for semantic inference
In addition to baseline discovery roles, ensure that you and the service accounts have the following additional Identity and Access Management (IAM) roles.
Summary of additional identities and roles
| Identity type | Typical principal format | Required IAM roles | Core purpose |
|---|---|---|---|
| End user | Your Google Cloud user account |
|
You use these additional roles to configure scans, view AI-generated results, curate graph profiles, and trigger the final data extraction. |
| Dataplex Universal Catalog discovery agent | service-<var>PROJECT_NUMBER</var>@gcp-sa-dataplex.iam.gserviceaccount.com |
|
This Google-managed service agent uses these additional roles to call Vertex AI to generate inferred schemas and metadata. |
| BigQuery connection service account | A unique identity associated with your connection (for example, bqcx-<var>PROJECT_NUMBER</var>-<var>ID</var>@gcp-sa-bigquery-condel.iam.gserviceaccount.com) |
|
It connects BigQuery to external storage, allowing BigQuery to read the raw files, create object tables, and run AI inference without exposing your personal user credentials. |
| Pipeline execution service account (Optional) | A user-managed service account |
|
If you choose to extract data using an automated pipeline, this identity runs the background jobs to materialize the AI-generated entities into BigQuery tables. |
| Default Dataform service account (Optional) | service-<var>PROJECT_NUMBER</var>@gcp-sa-dataform.iam.gserviceaccount.com |
|
When using the pipeline extraction method, Dataform requires permission to impersonate your pipeline execution service account to orchestrate the workflow. |
End user roles and permissions
To ensure that your user account has the necessary permissions to create scans, view insights, curate graph profiles, and extract data, ask your administrator to grant the following IAM roles to your user account on the project:
-
Create scans and view insights:
- Dataplex DataScan Editor (
roles/dataplex.dataScanEditor) - Dataplex Catalog Editor (
roles/dataplex.catalogEditor)
- Dataplex DataScan Editor (
-
Extract data using SQL or pipeline:
- BigQuery Data Editor (
roles/bigquery.dataEditor) - BigQuery Job User (
roles/bigquery.jobUser)
- BigQuery Data Editor (
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to create scans, view insights, curate graph profiles, and extract data. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to create scans, view insights, curate graph profiles, and extract data:
-
DataScans:
-
dataplex.datascans.create -
dataplex.datascans.get -
dataplex.datascans.getData -
dataplex.datascans.list -
dataplex.datascans.update
-
-
Data extraction:
-
bigquery.tables.create -
bigquery.tables.update -
bigquery.tables.getData -
bigquery.jobs.create
-
Your administrator might also be able to give your user account these permissions with custom roles or other predefined roles.
Dataplex discovery service agent roles and permissions
The Dataplex discovery service agent is a service agent that needs access to run scans and perform semantic inference using Vertex AI.
To ensure that the Dataplex discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) has the necessary
permissions to run scans and perform semantic inference using Vertex AI,
ask your administrator to grant the
following IAM roles to the Dataplex discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) on the project:
-
All:
- Vertex AI User (
roles/aiplatform.user) - Dataplex Discovery Service Agent (
roles/dataplex.discoveryServiceAgent) - BigQuery Job User (
roles/bigquery.jobUser) - BigQuery Data Viewer (
roles/bigquery.dataViewer)
- Vertex AI User (
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to run scans and perform semantic inference using Vertex AI. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to run scans and perform semantic inference using Vertex AI:
-
All:
-
aiplatform.endpoints.predict -
bigquery.datasets.create -
bigquery.datasets.get -
bigquery.tables.get -
bigquery.tables.getData -
storage.buckets.get -
storage.objects.get -
storage.objects.list
-
Your administrator might also be able to give the Dataplex discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com)
these permissions
with custom roles or
other predefined roles.
BigQuery connection service account roles and permissions
A BigQuery Cloud resource connection lets Knowledge Catalog access unstructured data stored in Cloud Storage. When you create a connection, BigQuery automatically creates a dedicated service account on your behalf. This service account serves as the identity used to connect to your external data source.
By default, this service account doesn't have any permissions. You must explicitly grant this service account the required IAM roles on the Cloud Storage buckets containing your data. You can use an existing BigQuery connection or create a new one in the same location as your source Cloud Storage bucket. For more information about sharing connections, see Share a connection with users.
To ensure that the BigQuery connection service account (retrieve the ID from the Connection info section of your connection details) has the necessary permissions to read object tables and run inference, ask your administrator to grant the following IAM roles to the BigQuery connection service account (retrieve the ID from the Connection info section of your connection details):
-
All:
- Storage Object Viewer (
roles/storage.objectViewer) on the bucket containing unstructured data - Vertex AI User (
roles/aiplatform.user) on the project
- Storage Object Viewer (
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to read object tables and run inference. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to read object tables and run inference:
-
All:
-
storage.buckets.geton the bucket containing unstructured data -
storage.objects.geton the bucket containing unstructured data -
aiplatform.endpoints.predicton the project
-
Your administrator might also be able to give the BigQuery connection service account (retrieve the ID from the Connection info section of your connection details) these permissions with custom roles or other predefined roles.
Pipeline execution service account roles and permissions (Optional)
If you choose to extract the inferred data using an automated pipeline, you must create or provide a dedicated service account to run the pipeline. This execution service account acts as the identity that authenticates and runs the background data extraction and analysis tasks in BigQuery. Additionally, you must grant the default Dataform service account permission to impersonate this execution service account.
To ensure that the pipeline execution service account has the necessary permissions to extract the inferred entities and relationships using a pipeline, ask your administrator to grant the following IAM roles to the pipeline execution service account on the project:
-
All:
- BigQuery Data Editor (
roles/bigquery.dataEditor) - BigQuery Job User (
roles/bigquery.jobUser) - BigQuery User (
roles/bigquery.user) - Vertex AI User (
roles/aiplatform.user)
- BigQuery Data Editor (
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to extract the inferred entities and relationships using a pipeline. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to extract the inferred entities and relationships using a pipeline:
-
All:
-
bigquery.tables.create -
bigquery.tables.update -
bigquery.tables.get -
bigquery.tables.getData -
bigquery.jobs.create -
aiplatform.endpoints.predict
-
Your administrator might also be able to give the pipeline execution service account these permissions with custom roles or other predefined roles.
To ensure that the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com) has the necessary
permissions to impersonate the pipeline execution service account,
ask your administrator to grant the
following IAM roles to the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com) on the pipeline execution service account:
-
All:
Service Account Token Creator (
roles/iam.serviceAccountTokenCreator)
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to impersonate the pipeline execution service account. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to impersonate the pipeline execution service account:
-
All:
iam.serviceAccounts.getAccessToken
Your administrator might also be able to give the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com)
these permissions
with custom roles or
other predefined roles.
Prepare unstructured data
Before you run a discovery scan, you must upload your unstructured data to a Cloud Storage bucket. Data profile scans for unstructured data are optimized for analyzing PDF documents.
For more information about storing and managing files in Cloud Storage, see Upload objects.
Create a Cloud resource connection
To publish discovery scan results as a BigQuery object table, you must create a Cloud resource connection and grant its service account access to your unstructured data in Cloud Storage.
- Create a Cloud resource connection.
- Grant the Storage Object Viewer (
roles/storage.objectViewer) role to the service account associated with the connection on the Cloud Storage bucket containing your unstructured data. For more information, see Grant access to the service account.
Create a discovery scan for unstructured data
To extract semantic insights from your unstructured data, you must first create a Cloud Storage discovery scan. This scan automatically locates your unstructured files in Cloud Storage and catalogs them into an object table. By enabling semantic inference during this process, Knowledge Catalog uses Vertex AI Gemini 2.5 Pro models to analyze the files and generate inferred metadata, schemas, and relationships.
You can create a Cloud Storage discovery scan with semantic inference enabled using the Google Cloud console or the REST API.
Console
In the Google Cloud console, go to the Metadata curation page.
In the Cloud Storage discovery tab, click Create.
Enter a name for the scan.
To select the Cloud Storage bucket containing your unstructured data, click Browse.
For Unstructured data options, select the Enable semantic inference checkbox.
In the Connection ID field, specify the BigQuery connection used to access the files.
The discovery scan automatically catalogs unstructured data into BigQuery by creating object tables. Because object tables securely decouple the data access credentials from the user executing queries, a connection is required to authenticate with Cloud Storage and read the files.
Click Run now (for an on-demand scan) or Create (for a scheduled scan).
For details on all available configurations, see Discover and catalog Cloud Storage data.
Knowledge Catalog creates an object table and enriches the catalog entry with AI-generated metadata. This process usually takes a few minutes for standard datasets.
REST
To create a Cloud Storage discovery scan with semantic inference enabled
using the REST API, use the
dataScans.create
method with a dataDiscoverySpec.
POST https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/dataScans?dataScanId=DATASCAN { "description": "Cloud Storage discovery scan with semantic inference", "data": { "resource": "//storage.googleapis.com/BUCKET_NAME" }, "executionSpec": { "trigger": { "onDemand": {} } }, "dataDiscoverySpec": { "bigqueryPublishingConfig": { "tableType": "OBJECT_TABLE", "connection": "projects/PROJECT_ID/locations/LOCATION/connections/CONNECTION_ID" }, "unstructuredDataEventsConfig": { "enabled": true } } }
Replace the following:
PROJECT_ID: the ID of your Google Cloud project.LOCATION: the Google Cloud region (must support Gemini 2.5 Pro).DATASCAN: the name of the discovery scan.BUCKET_NAME: the Cloud Storage bucket containing unstructured data.CONNECTION_ID: the BigQuery connection ID.
Run the discovery scan
If you configured your discovery scan to run on demand, you must manually trigger the scan to locate your unstructured data and generate insights.
You can trigger a discovery scan using the Google Cloud console or the REST API.
Console
In the Google Cloud console, go to the BigQuery page.
In the navigation menu, click Governance > Metadata curation.
In the Cloud Storage discovery pane, click the discovery scan that you want to run.
Click Run now.
REST
To run an on-demand discovery scan using the REST API, use the
dataScans.run
method:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/dataScans/DATASCAN:run"
Replace the following Variables:
PROJECT_ID: the ID of your Google Cloud project.LOCATION: the Google Cloud region where the discovery scan is located.DATASCAN: the name of the discovery scan.
Knowledge Catalog runs the discovery scan, creates an object table, and enriches the catalog entry with AI-generated metadata. This process usually takes a few minutes for standard datasets.
Locate the object table
After the discovery scan completes, Knowledge Catalog creates one or multiple object tables and populates the Knowledge Catalog with a corresponding entry enriched with AI-generated metadata. When multiple entries are created due to a discovery scan, each of the entries has its own insights tab. You can view the automated table description, the inferred schemas, and the relationship graphs.
In the Google Cloud console, go to the BigQuery page.
In the navigation menu, click Governance > Metadata curation.
In the Cloud Storage discovery pane, click the discovery scan that you ran for unstructured data.
- The Scan details section shows details about the discovery scan.
- The Scan status section shows the discovery results of the latest scan job.
Click the link for Published dataset.
In the list of tables displayed for the BigQuery dataset, select the object table generated for the discovery data scan.
Copy the table ID. You'll need it in the next section.
Explore discovery scan results
You can view the object table and its inferred semantic graphs in Knowledge Catalog.
In the Google Cloud console, go to the Knowledge Catalog Search page.
Paste and search for the object table whose ID you selected in the previous section.
In the search results, click the table to open its entry page.
On the Details tab, under Aspects, verify the presence of the Graph Profile aspect (
dataplex-types.global.graph-profile). This aspect contains the inferred schemas for entities and relationships.Click the Insights tab. On the Insights tab, you can view the following information:
Semantic extraction. A banner indicates that extractable entities and relationships were detected. It includes an Extract button to materialize the data using SQL or pipeline deployment.
Description. An AI-generated, human-readable summary explains the unstructured data contents. It describes the primary nodes (entities) discovered and how they map to each other through edges (relationships).
Pipelines. A list of previously deployed data extraction pipelines associated with this resource. You can view the display name, region, creation time, and the user who created the pipeline.
Inferred entities and relationships. A visual, interactive graph displays the discovered semantic structure of your unstructured data. The graph contains nodes representing distinct entities, for example,
RecipeandIngredient, and edges representing the connections between them, for example,HasAllergenStatus. You can use the legend to filter and explore specific nodes and edges.Entities. A detailed list of the discovered primary entities. You can expand each entity to view its AI-generated description and its inferred schema, which includes field names, data types, and field descriptions.
Relationships. A detailed list of the discovered connections between entities. You can expand each relationship to view its description and the schema defining how the entities map to one another.
Update inferred insights
Inferred insights are stored in Knowledge Catalog Catalog as an aspect attached to the object table. You can update these insights manually using the REST API.
REST
To update inferred insights using the REST API, follow these steps:
Create a file named
payload.jsonand add the JSON content of the aspect you want to update. For example:{ "aspects": { "dataplex-types.global.graph-profile": { "data": { "nodeTypes": [], "edgeTypes": [] } } } }Run the following command in your terminal:
curl -X PATCH \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ -d @payload.json \ "https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/entryGroups/ENTRY_GROUP_ID/entries/ENTRY_ID?updateMask=aspects"Replace the following:
PROJECT_ID: the ID of your project—for example,example-projectLOCATION: the location of the entry—for example,us-central1ENTRY_GROUP_ID: the ID of the entry group—for example,example-entry-group(for BigQuery object tables, use@bigquery)ENTRY_ID: the ID of the entry—for example,example-entry(retrieve this from the Overview tab of the entry details page in the Google Cloud console)
For more information and code samples in other languages, see Update an entry aspect.
Extract data to BigQuery
You can materialize the inferred entities and relationships into structured tables or views in BigQuery using SQL or an automated pipeline.
In the Google Cloud console, go to the Knowledge Catalog Search page.
Search for the object table generated by your scan.
In the search results, click the table to open its entry page.
Click the Insights tab.
On the Insights tab, click Extraction.
Choose one of the following methods based on your analytical needs and the scale of your unstructured data:
Extract by SQL: Choose this option for rapid, ad hoc analysis, small-to-medium datasets, or when you want a zero-infrastructure approach using BigQuery remote models.
To extract using SQL, follow these steps:
- Select Extract by SQL.
- In the Extract with SQL pane, select a destination dataset. The dataset must be in the same location as the source.
- Click Extract.
- In the BigQuery Editor, a pre-populated query opens
utilizing the
ML.PROCESS_DOCUMENTfunction. Run the query to create standard tables and views.
For more information about using SQL to extract document insights, see Process documents with the
ML.PROCESS_DOCUMENTfunction.Extract by pipeline: Choose this option for massive-scale data processing or when you require robust retry logic, error handling, and automated orchestration to handle large volumes of documents.
To extract using a pipeline, follow these steps:
- Select Extract by pipeline.
- In the Extract with pipeline pane, enter a display name for the pipeline.
- Select a region.
- Select a destination dataset. The dataset must be in the same location as the source.
- Click Extract. This creates a BigQuery pipeline that orchestrates the data materialization using Dataform.
- Run all tasks in the pipeline to generate structured node and edge views.
For more information about running data workflows, see Introduction to Dataform.
After you extract and materialize the semantic insights into BigQuery, you can perform the following tasks:
Query the structured data. Run standard SQL queries against the newly created tables to analyze the extracted entities and relationships.
Join with existing data. Combine the qualitative insights extracted from your unstructured files with your existing structured BigQuery datasets (such as joining parsed invoice data with your accounting tables).
Explore data insights. Use the Data insights feature in BigQuery Studio to automatically generate natural language questions and SQL queries for your new structured assets.
Analyze with Gemini. Use Gemini in BigQuery to perform conversational analysis, summarize trends, or create dashboards in Data Studio based on the extracted data.
What's next
- Learn how to use data profile for unstructured data.
- Learn more about Discovering data.
- Read About data profiling.