As of April 10, 2026, Dataplex Universal Catalog is now called Knowledge Catalog. The API, client library, CLI, and IAM names remain unchanged. For more information, see Introducing the Google Cloud Knowledge Catalog.

Use discovery scan for unstructured data

Data profile scans for unstructured data use Vertex AI Gemini 2.5 Pro models during a Cloud Storage discovery scan to transform raw, unstructured files in Cloud Storage (such as PDFs) into structured, queryable assets in BigQuery. This automated workflow is designed for users starting with raw files in Cloud Storage. If you already have existing BigQuery object tables or want to guide the extraction using a customized prompt, see Use data profile for unstructured data.

This document describes how to set up the necessary permissions, prepare your unstructured files, create a Cloud Storage discovery scan with semantic inference enabled using the REST API, view the generated insights, curate graph profiles, and extract the data into BigQuery.

Before you begin

Before you create a discovery scan, ensure you have the required permissions and APIs enabled.

Enable APIs

Enable the following APIs in your project:

dataplex.googleapis.com
bigquery.googleapis.com
aiplatform.googleapis.com (Vertex AI)

Roles required to enable APIs

To enable APIs, you need the serviceusage.services.enable permission. If you created the project, then you likely already have this permission through the Owner role (roles/owner). Otherwise, you can get this permission through the Service Usage Admin role (roles/serviceusage.serviceUsageAdmin). Learn how to grant roles.

Enable the APIs

Required roles and permissions

To configure and run data profile scans for unstructured data, you must satisfy the baseline permissions for a discovery scan and then grant additional roles for semantic inference across multiple service agents.

Baseline discovery scan roles

Ensure that you and the service accounts used by Knowledge Catalog have the baseline permissions required for a standard discovery scan. For a complete list, see Discover and catalog Cloud Storage data.

Additional roles for semantic inference

In addition to baseline discovery roles, ensure that you and the service accounts have the following additional Identity and Access Management (IAM) roles.

Summary of additional identities and roles

Identity type	Typical principal format	Required IAM roles	Core purpose
End user	Your Google Cloud user account	Dataplex DataScan Editor Dataplex Catalog Editor BigQuery Data Editor BigQuery Job User	You use these additional roles to configure scans, view AI-generated results, curate graph profiles, and trigger the final data extraction.
Dataplex Universal Catalog discovery agent	`service-<var>PROJECT_NUMBER</var>@gcp-sa-dataplex.iam.gserviceaccount.com`	Agent Platform User BigQuery Job User BigQuery Data Viewer	This Google-managed service agent uses these additional roles to call Vertex AI to generate inferred schemas and metadata.
BigQuery connection service account	A unique identity associated with your connection (for example, `bqcx-<var>PROJECT_NUMBER</var>-<var>ID</var>@gcp-sa-bigquery-condel.iam.gserviceaccount.com`)	Storage Object Viewer (on the source bucket) Agent Platform User (on the project)	It connects BigQuery to external storage, allowing BigQuery to read the raw files, create object tables, and run AI inference without exposing your personal user credentials.
Pipeline execution service account (Optional)	A user-managed service account	BigQuery Data Editor BigQuery Job User BigQuery User Agent Platform User	If you choose to extract data using an automated pipeline, this identity runs the background jobs to materialize the AI-generated entities into BigQuery tables.
Default Dataform service account (Optional)	`service-<var>PROJECT_NUMBER</var>@gcp-sa-dataform.iam.gserviceaccount.com`	Service Account Token Creator (granted on the pipeline execution service account)	When using the pipeline extraction method, Dataform requires permission to impersonate your pipeline execution service account to orchestrate the workflow.

End user roles and permissions

To ensure that your user account has the necessary permissions to create scans, view insights, curate graph profiles, and extract data, ask your administrator to grant the following IAM roles to your user account on the project:

Create scans and view insights:
- Dataplex DataScan Editor (roles/dataplex.dataScanEditor)
- Dataplex Catalog Editor (roles/dataplex.catalogEditor)
Extract data using SQL or pipeline:
- BigQuery Data Editor (roles/bigquery.dataEditor)
- BigQuery Job User (roles/bigquery.jobUser)

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to create scans, view insights, curate graph profiles, and extract data. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to create scans, view insights, curate graph profiles, and extract data:

DataScans:
- dataplex.datascans.create
- dataplex.datascans.get
- dataplex.datascans.getData
- dataplex.datascans.list
- dataplex.datascans.update
Data extraction:
- bigquery.tables.create
- bigquery.tables.update
- bigquery.tables.getData
- bigquery.jobs.create

Your administrator might also be able to give your user account these permissions with custom roles or other predefined roles.

Dataplex discovery service agent roles and permissions

The Dataplex discovery service agent is a service agent that needs access to run scans and perform semantic inference using Vertex AI.

To ensure that the Dataplex discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) has the necessary permissions to run scans and perform semantic inference using Vertex AI, ask your administrator to grant the following IAM roles to the Dataplex discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) on the project:

Important: You must grant these roles to the Dataplex discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com), not to your user account. Failure to grant the roles to the correct principal might result in permission errors.

All:
- Agent Platform User (roles/aiplatform.user)
- Dataplex Discovery Service Agent (roles/dataplex.discoveryServiceAgent)
- BigQuery Job User (roles/bigquery.jobUser)
- BigQuery Data Viewer (roles/bigquery.dataViewer)

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to run scans and perform semantic inference using Vertex AI. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to run scans and perform semantic inference using Vertex AI:

All:
- aiplatform.endpoints.predict
- bigquery.datasets.create
- bigquery.datasets.get
- bigquery.tables.get
- bigquery.tables.getData
- storage.buckets.get
- storage.objects.get
- storage.objects.list

Your administrator might also be able to give the Dataplex discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) these permissions with custom roles or other predefined roles.

BigQuery connection service account roles and permissions

A BigQuery Cloud resource connection lets Knowledge Catalog access unstructured data stored in Cloud Storage. When you create a connection, BigQuery automatically creates a dedicated service account on your behalf. This service account serves as the identity used to connect to your external data source.

By default, this service account doesn't have any permissions. You must explicitly grant this service account the required IAM roles on the Cloud Storage buckets containing your data. You can use an existing BigQuery connection or create a new one in the same location as your source Cloud Storage bucket. For more information about sharing connections, see Share a connection with users.

To ensure that the BigQuery connection service account (retrieve the ID from the Connection info section of your connection details) has the necessary permissions to read object tables and run inference, ask your administrator to grant the following IAM roles to the BigQuery connection service account (retrieve the ID from the Connection info section of your connection details):

All:
- Storage Object Viewer (roles/storage.objectViewer) on the bucket containing unstructured data
- Agent Platform User (roles/aiplatform.user) on the project

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to read object tables and run inference. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to read object tables and run inference:

All:
- storage.buckets.get on the bucket containing unstructured data
- storage.objects.get on the bucket containing unstructured data
- aiplatform.endpoints.predict on the project

Your administrator might also be able to give the BigQuery connection service account (retrieve the ID from the Connection info section of your connection details) these permissions with custom roles or other predefined roles.

Pipeline execution service account roles and permissions (Optional)

If you choose to extract the inferred data using an automated pipeline, you must create or provide a dedicated service account to run the pipeline. This execution service account acts as the identity that authenticates and runs the background data extraction and analysis tasks in BigQuery. Additionally, you must grant the default Dataform service account permission to impersonate this execution service account.

To ensure that the pipeline execution service account has the necessary permissions to extract the inferred entities and relationships using a pipeline, ask your administrator to grant the following IAM roles to the pipeline execution service account on the project:

All:
- BigQuery Data Editor (roles/bigquery.dataEditor)
- BigQuery Job User (roles/bigquery.jobUser)
- BigQuery User (roles/bigquery.user)
- Agent Platform User (roles/aiplatform.user)

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to extract the inferred entities and relationships using a pipeline. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to extract the inferred entities and relationships using a pipeline:

All:
- bigquery.tables.create
- bigquery.tables.update
- bigquery.tables.get
- bigquery.tables.getData
- bigquery.jobs.create
- aiplatform.endpoints.predict

Your administrator might also be able to give the pipeline execution service account these permissions with custom roles or other predefined roles.

To ensure that the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com) has the necessary permissions to impersonate the pipeline execution service account, ask your administrator to grant the following IAM roles to the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com) on the pipeline execution service account:

Important: You must grant these roles to the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com), not to your user account. Failure to grant the roles to the correct principal might result in permission errors.

All: Service Account Token Creator (roles/iam.serviceAccountTokenCreator)

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to impersonate the pipeline execution service account. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to impersonate the pipeline execution service account:

All: iam.serviceAccounts.getAccessToken

Your administrator might also be able to give the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com) these permissions with custom roles or other predefined roles.

Prepare unstructured data

Before you run a discovery scan, you must upload your unstructured data to a Cloud Storage bucket. Data profile scans for unstructured data are optimized for analyzing PDF documents.

For more information about storing and managing files in Cloud Storage, see Upload objects.

Create a Cloud resource connection

To publish discovery scan results as a BigQuery object table, you must create a Cloud resource connection and grant its service account access to your unstructured data in Cloud Storage.

Create a Cloud resource connection.
Grant the Storage Object Viewer (roles/storage.objectViewer) role to the service account associated with the connection on the Cloud Storage bucket containing your unstructured data. For more information, see Grant access to the service account.

Create a discovery scan for unstructured data

To extract semantic insights from your unstructured data, you must first create a Cloud Storage discovery scan. This scan automatically locates your unstructured files in Cloud Storage and catalogs them into an object table. By enabling semantic inference during this process, Knowledge Catalog uses Vertex AI Gemini 2.5 Pro models to analyze the files and generate inferred metadata, schemas, and relationships.

You can create a Cloud Storage discovery scan with semantic inference enabled using the Google Cloud console or the REST API.

Console

In the Google Cloud console, go to the Metadata curation page.

Go to Metadata curation
In the Cloud Storage discovery tab, click Create.
Enter a name for the scan.
To select the Cloud Storage bucket containing your unstructured data, click Browse.
For Unstructured data options, select the Enable semantic inference checkbox.
In the Connection ID field, specify the BigQuery connection used to access the files.

The discovery scan automatically catalogs unstructured data into BigQuery by creating object tables. Because object tables securely decouple the data access credentials from the user executing queries, a connection is required to authenticate with Cloud Storage and read the files.
Click Run now (for an on-demand scan) or Create (for a scheduled scan).

For details on all available configurations, see Discover and catalog Cloud Storage data.

Knowledge Catalog creates an object table and enriches the catalog entry with AI-generated metadata. This process usually takes a few minutes for standard datasets.

REST

To create a Cloud Storage discovery scan with semantic inference enabled using the REST API, use the dataScans.create method with a dataDiscoverySpec.

POST https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/dataScans?dataScanId=DATASCAN
{
"description": "Cloud Storage discovery scan with semantic inference",
"data": {
"resource": "//storage.googleapis.com/BUCKET_NAME"
},
"executionSpec": {
"trigger": {
  "onDemand": {}
}
},
"dataDiscoverySpec": {
"bigqueryPublishingConfig": {
  "tableType": "OBJECT_TABLE",
  "connection": "projects/PROJECT_ID/locations/LOCATION/connections/CONNECTION_ID"
},
"unstructuredDataEventsConfig": {
  "enabled": true
}
}
}

Replace the following:

PROJECT_ID: the ID of your Google Cloud project.
LOCATION: the Google Cloud region (must support Gemini 2.5 Pro).
DATASCAN: the name of the discovery scan.
BUCKET_NAME: the Cloud Storage bucket containing unstructured data.
CONNECTION_ID: the BigQuery connection ID.

Run the discovery scan

If you configured your discovery scan to run on demand, you must manually trigger the scan to locate your unstructured data and generate insights.

You can trigger a discovery scan using the Google Cloud console or the REST API.

Console

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the navigation menu, click Governance > Metadata curation.
In the Cloud Storage discovery pane, click the discovery scan that you want to run.
Click Run now.

REST

To run an on-demand discovery scan using the REST API, use the dataScans.run method:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/dataScans/DATASCAN:run"

Replace the following Variables:

PROJECT_ID: the ID of your Google Cloud project.
LOCATION: the Google Cloud region where the discovery scan is located.
DATASCAN: the name of the discovery scan.

Knowledge Catalog runs the discovery scan, creates an object table, and enriches the catalog entry with AI-generated metadata. This process usually takes a few minutes for standard datasets.

Locate the object table

After the discovery scan completes, Knowledge Catalog creates one or multiple object tables and populates the Knowledge Catalog with a corresponding entry enriched with AI-generated metadata. When multiple entries are created due to a discovery scan, each of the entries has its own insights tab. You can view the automated table description, the inferred schemas, and the relationship graphs.

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the navigation menu, click Governance > Metadata curation.
In the Cloud Storage discovery pane, click the discovery scan that you ran for unstructured data.
- The Scan details section shows details about the discovery scan.
- The Scan status section shows the discovery results of the latest scan job.
Click the link for Published dataset.
In the list of tables displayed for the BigQuery dataset, select the object table generated for the discovery data scan.
Copy the table ID. You'll need it in the next section.

Explore discovery scan results

You can view the object table and its inferred semantic graphs in Knowledge Catalog.

In the Google Cloud console, go to the Knowledge Catalog Search page.

Go to Search
Paste and search for the object table whose ID you selected in the previous section.
In the search results, click the table to open its entry page.
On the Details tab, under Aspects, verify the presence of the Graph Profile aspect (dataplex-types.global.graph-profile). This aspect contains the inferred schemas for entities and relationships.
Click the Insights tab. On the Insights tab, you can view the following information:
- Semantic extraction. A banner indicates that extractable entities and relationships were detected. It includes an Extract button to materialize the data using SQL or pipeline deployment.
- Description. An AI-generated, human-readable summary explains the unstructured data contents. It describes the primary nodes (entities) discovered and how they map to each other through edges (relationships).
- Pipelines. A list of previously deployed data extraction pipelines associated with this resource. You can view the display name, region, creation time, and the user who created the pipeline.
- Inferred entities and relationships. A visual, interactive graph displays the discovered semantic structure of your unstructured data. The graph contains nodes representing distinct entities, for example, Recipe and Ingredient, and edges representing the connections between them, for example, HasAllergenStatus. You can use the legend to filter and explore specific nodes and edges.
- Entities. A detailed list of the discovered primary entities. You can expand each entity to view its AI-generated description and its inferred schema, which includes field names, data types, and field descriptions.
- Relationships. A detailed list of the discovered connections between entities. You can expand each relationship to view its description and the schema defining how the entities map to one another.

Update inferred insights

Inferred insights are stored in Knowledge Catalog Catalog as an aspect attached to the object table. You can update these insights manually using the REST API.

REST

To update inferred insights using the REST API, follow these steps:

Create a file named payload.json and add the JSON content of the aspect you want to update. For example:

{
  "aspects": {
    "dataplex-types.global.graph-profile": {
      "data": {
        "nodeTypes": [],
        "edgeTypes": []
      }
    }
  }
}

Run the following command in your terminal:
```
curl -X PATCH \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-d @payload.json \
"https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/entryGroups/ENTRY_GROUP_ID/entries/ENTRY_ID?updateMask=aspects"
```
Replace the following:
- PROJECT_ID: the ID of your project—for example, example-project
- LOCATION: the location of the entry—for example, us-central1
- ENTRY_GROUP_ID: the ID of the entry group—for example, example-entry-group (for BigQuery object tables, use @bigquery)
- ENTRY_ID: the ID of the entry—for example, example-entry (retrieve this from the Overview tab of the entry details page in the Google Cloud console)

For more information and code samples in other languages, see Update an entry aspect.

Extract data to BigQuery

You can materialize the inferred entities and relationships into structured tables or views in BigQuery using SQL or an automated pipeline.

In the Google Cloud console, go to the Knowledge Catalog Search page.

Go to Search
Search for the object table generated by your scan.
In the search results, click the table to open its entry page.
Click the Insights tab.
On the Insights tab, click Extraction.
Choose one of the following methods based on your analytical needs and the scale of your unstructured data:
- Extract by SQL: Choose this option for rapid, ad hoc analysis, small-to-medium datasets, or when you want a zero-infrastructure approach using BigQuery remote models.
  
  To extract using SQL, follow these steps:
  1. Select Extract by SQL.
  2. In the Extract with SQL pane, select a destination dataset. The dataset must be in the same location as the source.
  3. Click Extract.
  4. In the BigQuery Editor, a pre-populated query opens utilizing the ML.PROCESS_DOCUMENT function. Run the query to create standard tables and views.
  For more information about using SQL to extract document insights, see Process documents with the ML.PROCESS_DOCUMENT function.
- Extract by pipeline: Choose this option for massive-scale data processing or when you require robust retry logic, error handling, and automated orchestration to handle large volumes of documents.
  
  To extract using a pipeline, follow these steps:
  1. Select Extract by pipeline.
  2. In the Extract with pipeline pane, enter a display name for the pipeline.
  3. Select a region.
  4. Select a destination dataset. The dataset must be in the same location as the source.
  5. Click Extract. This creates a BigQuery pipeline that orchestrates the data materialization using Dataform.
  6. Run all tasks in the pipeline to generate structured node and edge views.
  For more information about running data workflows, see Introduction to Dataform.

After you extract and materialize the semantic insights into BigQuery, you can perform the following tasks:

Query the structured data. Run standard SQL queries against the newly created tables to analyze the extracted entities and relationships.
Join with existing data. Combine the qualitative insights extracted from your unstructured files with your existing structured BigQuery datasets (such as joining parsed invoice data with your accounting tables).
Explore data insights. Use the Data insights feature in BigQuery Studio to automatically generate natural language questions and SQL queries for your new structured assets.
Analyze with Gemini. Use Gemini in BigQuery to perform conversational analysis, summarize trends, or create dashboards in Data Studio based on the extracted data.

What's next

Learn how to use data profile for unstructured data.
Learn more about Discovering data.
Read About data profiling.

Use discovery scan for unstructured data Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Enable APIs

Required roles and permissions

Baseline discovery scan roles

Additional roles for semantic inference

Summary of additional identities and roles

End user roles and permissions

Required permissions

Dataplex discovery service agent roles and permissions

Required permissions

BigQuery connection service account roles and permissions

Required permissions

Pipeline execution service account roles and permissions (Optional)

Required permissions

Required permissions

Prepare unstructured data

Create a Cloud resource connection

Create a discovery scan for unstructured data

Console

REST

Run the discovery scan

Console

REST

Locate the object table

Explore discovery scan results

Update inferred insights

REST

Extract data to BigQuery

What's next

Use discovery scan for unstructured data