As of April 10, 2026, Dataplex Universal Catalog is now called Knowledge Catalog. The API, client library, CLI, and IAM names remain unchanged. For more information, see Introducing the Google Cloud Knowledge Catalog.

Use data insights for unstructured data

Data insights for unstructured data uses Vertex AI to transform raw, unstructured files in Cloud Storage into structured, queryable assets in BigQuery. Data insights for unstructured data is optimized for PDF files.

This document describes how to set up the necessary permissions, discover unstructured data, view the generated insights, and extract the data into BigQuery.

Before you begin

Before you use data insights for unstructured data, ensure you have the required permissions and APIs enabled.

Enable APIs

Enable the following APIs in your project:

dataplex.googleapis.com
bigquery.googleapis.com
aiplatform.googleapis.com (Vertex AI)

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Required roles and permissions

Unstructured data semantic inference is an advanced feature that builds upon the standard BigQuery discovery scan. To configure and run data insights for unstructured data, you must satisfy the baseline permissions for a discovery scan and then grant additional roles for semantic inference.

Step 1: Baseline discovery scan roles

Ensure that you and the service accounts used by Knowledge Catalog have the baseline permissions required for a standard discovery scan. For a complete list, see Discover and catalog Cloud Storage data.

Step 2: Additional roles for semantic inference

In addition to the baseline roles, ensure that you and the service accounts have the following additional IAM roles.

Summary of additional identities and roles

Identity type	Typical principal format	Required IAM roles	Core purpose
End user	Your Google Cloud user account	Dataplex DataScan DataViewer BigQuery Data Editor BigQuery Job User	You use these additional roles to view AI-generated results and trigger the final data extraction.
Knowledge Catalog Universal Catalog discovery agent	`service-<var>PROJECT_NUMBER</var>@gcp-sa-dataplex.iam.gserviceaccount.com`	Vertex AI User	This Google-managed service agent uses this additional role to call Vertex AI to generate inferred schemas and metadata.
BigQuery connection service account	`service-<var>PROJECT_NUMBER</var>@gcp-sa-bigqueryconnection.iam.gserviceaccount.com`	Storage Object Viewer (on the source bucket) Vertex AI User (on the project)	It connects BigQuery to external storage, allowing BigQuery to read the raw files, create object tables, and run AI inference without exposing your personal user credentials.
Pipeline execution service account (Optional)	A user-managed service account	BigQuery Data Editor BigQuery Job User BigQuery User Vertex AI User	If you choose to extract data using an automated pipeline, this identity runs the background jobs to materialize the AI-generated entities into BigQuery tables.
Default Dataform service account (Optional)	`service-<var>PROJECT_NUMBER</var>@gcp-sa-dataform.iam.gserviceaccount.com`	Service Account Token Creator (granted on the pipeline execution service account)	When using the pipeline extraction method, Dataform requires permission to impersonate your pipeline execution service account to orchestrate the workflow.

End user roles and permissions

To ensure that your user account has the necessary permissions to view insights and extract data, ask your administrator to grant the following IAM roles to your user account on the project:

View discovery scans and insights: Dataplex DataScan DataViewer (roles/dataplex.dataScanDataViewer)
Extract data using SQL or pipeline:
- BigQuery Data Editor (roles/bigquery.dataEditor)
- BigQuery Job User (roles/bigquery.jobUser)

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to view insights and extract data. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to view insights and extract data:

Discovery scans:
- dataplex.datascans.create
- dataplex.datascans.get
- dataplex.datascans.getData
- dataplex.datascans.list
Data extraction:
- bigquery.tables.create
- bigquery.tables.update
- bigquery.tables.getData
- bigquery.jobs.create

Your administrator might also be able to give your user account these permissions with custom roles or other predefined roles.

Knowledge Catalog discovery service agent roles and permissions

The Knowledge Catalog discovery service agent is a service agent that needs access to run discovery scans and perform inference using Vertex AI.

To ensure that the Knowledge Catalog discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) has the necessary permissions to run discovery scans and perform inference using Vertex AI, ask your administrator to grant the following IAM roles to the Knowledge Catalog discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) on the project:

Important: You must grant these roles to the Knowledge Catalog discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com), not to your user account. Failure to grant the roles to the correct principal might result in permission errors.

Vertex AI User (roles/aiplatform.user)
Discovery service agent (roles/dataplex.discoveryServiceAgent)

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to run discovery scans and perform inference using Vertex AI. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to run discovery scans and perform inference using Vertex AI:

aiplatform.endpoints.predict
bigquery.datasets.create
bigquery.datasets.get
storage.buckets.get
storage.objects.get
storage.objects.list

Your administrator might also be able to give the Knowledge Catalog discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) these permissions with custom roles or other predefined roles.

BigQuery connection service account roles and permissions

A BigQuery Cloud resource connection allows Knowledge Catalog to securely access and discover unstructured data stored outside of BigQuery, such as in Cloud Storage. When you create a connection, BigQuery automatically creates a dedicated service account on your behalf. This service account serves as the identity used to connect to your external data source.

By default, this service account doesn't have any permissions. You must explicitly grant this service account the required IAM roles on the Cloud Storage buckets containing your data. You can use an existing BigQuery connection or create a new one in the same location as your source Cloud Storage bucket.

To ensure that the BigQuery connection service account (usually service-PROJECT_NUMBER@gcp-sa-bigqueryconnection.iam.gserviceaccount.com) has the necessary permissions to create object tables and run inference, ask your administrator to grant the following IAM roles to the BigQuery connection service account (usually service-PROJECT_NUMBER@gcp-sa-bigqueryconnection.iam.gserviceaccount.com):

Important: You must grant these roles to the BigQuery connection service account (usually service-PROJECT_NUMBER@gcp-sa-bigqueryconnection.iam.gserviceaccount.com), not to your user account. Failure to grant the roles to the correct principal might result in permission errors.

Storage Object Viewer (roles/storage.objectViewer) on the bucket containing unstructured data
Vertex AI User (roles/aiplatform.user) on the project

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to create object tables and run inference. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to create object tables and run inference:

storage.buckets.get on the bucket containing unstructured data
storage.objects.get on the bucket containing unstructured data
aiplatform.endpoints.predict on the project

Your administrator might also be able to give the BigQuery connection service account (usually service-PROJECT_NUMBER@gcp-sa-bigqueryconnection.iam.gserviceaccount.com) these permissions with custom roles or other predefined roles.

Pipeline execution service account roles and permissions (Optional)

If you choose to extract the inferred data using an automated pipeline, you must create or provide a dedicated service account to execute the pipeline. This execution service account acts as the identity that securely authenticates and runs the background data extraction and analysis tasks in BigQuery. Additionally, you must grant the default Dataform service account permission to impersonate this execution service account.

To ensure that the pipeline execution service account has the necessary permissions to extract the inferred entities and relationships using a pipeline, ask your administrator to grant the following IAM roles to the pipeline execution service account on the project:

BigQuery Data Editor (roles/bigquery.dataEditor)
BigQuery Job User (roles/bigquery.jobUser)
BigQuery User (roles/bigquery.user)
Vertex AI User (roles/aiplatform.user)

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to extract the inferred entities and relationships using a pipeline. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to extract the inferred entities and relationships using a pipeline:

bigquery.tables.create
bigquery.tables.update
bigquery.tables.get
bigquery.tables.getData
bigquery.jobs.create
aiplatform.endpoints.predict

Your administrator might also be able to give the pipeline execution service account these permissions with custom roles or other predefined roles.

To ensure that the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com) has the necessary permission to impersonate the pipeline execution service account, ask your administrator to grant the Service Account Token Creator (roles/iam.serviceAccountTokenCreator) IAM role to the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com) on the pipeline execution service account.

Important: You must grant this role to the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com), not to your user account. Failure to grant the role to the correct principal might result in permission errors.

For more information about granting roles, see Manage access to projects, folders, and organizations.

This predefined role contains the iam.serviceAccounts.getAccessToken permission, which is required to impersonate the pipeline execution service account.

Your administrator might also be able to give the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com) this permission with custom roles or other predefined roles.

Prepare unstructured data

Before you run a discovery scan, you must upload your unstructured data to a Cloud Storage bucket. Data insights for unstructured data is optimized for analyzing PDF documents.

For more information about storing and managing files in Cloud Storage, see Upload objects.

Create a discovery scan for unstructured data

To extract semantic insights from your unstructured data, you must first create a discovery scan. This scan automatically locates your unstructured files in Cloud Storage and catalogs them into an object table. By enabling the data insights option during this process, Knowledge Catalog uses Vertex AI to analyze the files and generate inferred metadata, schemas, and relationships.

In the Google Cloud console, go to the Metadata curation page.

Go to Metadata curation
In the Cloud Storage discovery tab, click Create.
Enter a name for the scan.
To select the Cloud Storage bucket containing your unstructured data, click Browse.
For Unstructured data options, select the Enable semantic inference checkbox.
In the Connection ID field, specify the BigQuery connection used to access the files.

The discovery scan automatically catalogs unstructured data into BigQuery by creating object tables. Because object tables securely decouple the data access credentials from the user executing queries, a connection is required to authenticate with Cloud Storage and read the files.
Click Run now (for an on-demand scan) or Create (for a scheduled scan).

For full details on all available configurations, see Discover and catalog Cloud Storage data.

Knowledge Catalog creates an object table and enriches the catalog entry with AI-generated metadata. This process usually takes a few minutes for standard datasets.

Locate the object table

After the discovery scan completes, Knowledge Catalog creates one or multiple object tables and populates the Knowledge Catalog with a corresponding entry enriched with AI-generated metadata. When multiple entries are created due to a discovery scan, each of the entries has its own insights tab. You can view the automated table description, the inferred schemas, and the relationship graphs.

In the Google Cloud console, go to the BigQuery page.

Go to BigQuery
In the navigation menu, click Governance > Metadata curation.
In the Cloud Storage discovery pane, click the discovery scan that you ran for unstructured data.
- The Scan details section shows details about the discovery scan.
- The Scan status section shows the discovery results of the latest scan job.
Click the link for Published dataset.
In the list of tables displayed for the BigQuery dataset, select the object table generated for the discovery data scan.
Copy the table ID. You'll need it in the next section.

View inferred entity graphs

You can view the object table for the discovery scan in Knowledge Catalog.

In the Google Cloud console, go to the Knowledge Catalog Search page.

Go to Search
Paste and search for the object table whose ID you selected in the previous section.
In the search results, click the table to open its entry page.
On the Details tab, under Aspects, verify the presence of the Graph Profile aspect. This aspect contains the inferred schemas for entities and relationships.
Click the Insights tab. On the Insights tab, you can view the following information:
- Semantic extraction. A banner indicates that extractable entities and relationships were detected. It includes an Extract button to materialize the data using SQL or pipeline deployment.
- Description. An AI-generated, human-readable summary explains the unstructured data contents. It describes the primary nodes (entities) discovered and how they map to each other through edges (relationships).
- Pipelines. A list of previously deployed data extraction pipelines associated with this resource. You can view the display name, region, creation time, and the user who created the pipeline.
- Inferred entities and relationships. A visual, interactive graph displays the discovered semantic structure of your unstructured data. The graph contains nodes representing distinct entities, for example, 'Recipe' and 'Ingredient', and edges representing the connections between them, for example, 'HasAllergenStatus'. You can use the legend to filter and explore specific nodes and edges.
- Entities. A detailed list of the discovered primary entities. You can expand each entity to view its AI-generated description and its inferred schema, which includes field names, data types, and field descriptions.
- Relationships. A detailed list of the discovered connections between entities. You can expand each relationship to view its description and the schema defining how the entities map to one another.

Update inferred insights

Inferred insights are stored in Knowledge Catalog Catalog as an aspect attached to the object table. You can update these insights manually using the Google Cloud console or the entry.patch API.

Console

To update inferred insights in the Google Cloud console, follow these steps:

In the Google Cloud console, go to the Knowledge Catalog Search page.

Go to Search
Paste and search for the object table ID.
In the search results, click the table to open its entry page.
Click the Insights tab.
Next to Inferred entities and relationships, click Edit.
In the JSON editor, modify the graph-profile aspect.
Click Save.

REST

To update inferred insights using the REST API, follow these steps:

Create a file named payload.json and add the JSON content of the aspect you want to update. For example:

{
  "aspects": {
    "dataplex-types.global.graph-profile": {
      "data": {
        // Your updated inferred insights data
      }
    }
  }
}

Run the following command in your terminal:

curl -X PATCH \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-d @payload.json \
"https://dataplex.googleapis.com/v1/projects/<var>PROJECT_ID</var>/locations/<var>LOCATION</var>/entryGroups/<var>ENTRY_GROUP_ID</var>/entries/<var>ENTRY_ID</var>?updateMask=aspects"

Replace the following:

PROJECT_ID: the ID of your project—for example, example-project
LOCATION: the location of the entry—for example, us-central1
ENTRY_GROUP_ID: the ID of the entry group—for example, example-entry-group
ENTRY_ID: the ID of the entry—for example, example-entry

For more information and code samples in other languages, see Update an entry aspect.

Extract data to BigQuery

You can materialize the inferred entities and relationships into structured tables or views in BigQuery using SQL or an automated pipeline.

From the Insights tab, click Extraction.
Choose one of the following methods based on your analytical needs and the scale of your unstructured data:
- Extract by SQL: Choose this option for rapid, ad-hoc analysis, small-to-medium datasets, or when you want a zero-infrastructure approach using BigQuery remote models.
  
  To extract using SQL, follow these steps:
  1. Select Extract by SQL.
  2. In the Extract with SQL pane, select a destination dataset. The dataset must be in the same location as the source.
  3. Click Extract.
  4. In the BigQuery Editor, a pre-populated query opens. Run the query to create standard tables and views.
  For more information about using SQL to extract document insights, see Process documents with the ML.PROCESS_DOCUMENT function.
- Extract by pipeline: Choose this option for massive-scale data processing or when you require robust retry logic, error handling, and automated orchestration to handle large volumes of documents.
  
  To extract using a pipeline, follow these steps:
  1. Select Extract by pipeline.
  2. In the Extract with pipeline pane, enter a display name for the pipeline.
  3. Select a region.
  4. Select a destination dataset. The dataset must be in the same location as the source.
  5. Click Extract. This creates a BigQuery pipeline that orchestrates the data materialization.
  6. Run all tasks in the pipeline to generate structured node and edge views.
  For more information about running data workflows, see Introduction to Dataform.

After you extract and materialize the semantic insights into BigQuery, you can perform the following tasks:

Query the structured data. Run standard SQL queries against the newly created tables to analyze the extracted entities and relationships.
Join with existing data. Combine the qualitative insights extracted from your unstructured files with your existing structured BigQuery datasets (such as joining parsed invoice data with your accounting tables).
Explore data insights. Use the Data insights feature in BigQuery Studio to automatically generate natural language questions and SQL queries for your new structured assets.
Analyze with Gemini. Use Gemini in BigQuery to perform conversational analysis, summarize trends, or create dashboards in Data Studio based on the extracted data.

Use data insights for unstructured data Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Enable APIs

Required roles and permissions

Step 1: Baseline discovery scan roles

Step 2: Additional roles for semantic inference

Summary of additional identities and roles

End user roles and permissions

Required permissions

Knowledge Catalog discovery service agent roles and permissions

Required permissions

BigQuery connection service account roles and permissions

Required permissions

Pipeline execution service account roles and permissions (Optional)

Required permissions

Prepare unstructured data

Create a discovery scan for unstructured data

Locate the object table

View inferred entity graphs

Update inferred insights

Console

REST

Extract data to BigQuery

Use data insights for unstructured data