Data insights for unstructured data uses Vertex AI to transform raw, unstructured files in Cloud Storage into structured, queryable assets in BigQuery. Data insights for unstructured data is optimized for PDF files.
This document describes how to set up the necessary permissions, discover unstructured data, view the generated insights, and extract the data into BigQuery.
Before you begin
Before you use data insights for unstructured data, ensure you have the required permissions and APIs enabled.
Enable APIs
Enable the following APIs in your project:
dataplex.googleapis.combigquery.googleapis.comaiplatform.googleapis.com(Vertex AI)
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM
role (roles/serviceusage.serviceUsageAdmin), which
contains the serviceusage.services.enable permission. Learn how to grant
roles.
Required roles and permissions
To configure and run data insights for unstructured data, ensure that you and the service accounts used by Knowledge Catalog and BigQuery have the required Identity and Access Management (IAM) roles and permissions.
A discovery scan is required to automatically locate your unstructured files in Cloud Storage and catalog them into BigLake object tables so that they can be analyzed. For general permissions required to run discovery scans on Cloud Storage buckets, see Discover and catalog Cloud Storage data.
Summary of required identities and roles
| Identity type | Typical principal format | Required IAM roles | Core purpose |
|---|---|---|---|
| End user | Your Google Cloud user account |
|
You use these roles to enable APIs, configure and view discovery scans, and trigger the final data extraction. |
| Knowledge Catalog Universal Catalog discovery agent | service-<var>PROJECT_NUMBER</var>@gcp-sa-dataplex.iam.gserviceaccount.com |
|
This Google-managed service agent locates your unstructured files in Cloud Storage, catalogs them, and calls Vertex AI to generate inferred schemas and metadata. |
| BigQuery connection service account | service-<var>PROJECT_NUMBER</var>@gcp-sa-bigqueryconnection.iam.gserviceaccount.com |
|
It connects BigQuery to external storage, allowing BigQuery to read the raw files, create BigLake object tables, and run AI inference without exposing your personal user credentials. |
| Pipeline execution service account (Optional) | A user-managed service account |
|
If you choose to extract data using an automated pipeline, this identity runs the background jobs to materialize the AI-generated entities into BigQuery tables. |
| Default Dataform service account (Optional) | service-<var>PROJECT_NUMBER</var>@gcp-sa-dataform.iam.gserviceaccount.com |
|
When using the pipeline extraction method, Dataform requires permission to impersonate your pipeline execution service account to orchestrate the workflow. |
End user roles and permissions
To ensure that your user account has the necessary permissions to create discovery scans, view insights, and extract data, ask your administrator to grant the following IAM roles to your user account on the project:
-
Create and manage discovery scans:
Dataplex DataScan Administrator (
roles/dataplex.dataScanAdmin) -
View discovery scans and insights:
Dataplex DataScan DataViewer (
roles/dataplex.dataScanDataViewer) -
Extract data using SQL or pipeline:
-
BigQuery Data Editor (
roles/bigquery.dataEditor) -
BigQuery Job User (
roles/bigquery.jobUser)
-
BigQuery Data Editor (
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to create discovery scans, view insights, and extract data. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to create discovery scans, view insights, and extract data:
-
Discovery scans:
-
dataplex.datascans.create -
dataplex.datascans.get -
dataplex.datascans.getData -
dataplex.datascans.list
-
-
Data extraction:
-
bigquery.tables.create -
bigquery.tables.update -
bigquery.tables.getData -
bigquery.jobs.create
-
Your administrator might also be able to give your user account these permissions with custom roles or other predefined roles.
Knowledge Catalog discovery service agent roles and permissions
The Knowledge Catalog discovery service agent is a service agent that needs access to run discovery scans and perform inference using Vertex AI.
To ensure that the Knowledge Catalog discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) has the necessary
permissions to run discovery scans and perform inference using Vertex AI,
ask your administrator to grant the
following IAM roles to the Knowledge Catalog discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com) on the project:
-
Vertex AI User (
roles/aiplatform.user) -
Discovery service agent (
roles/dataplex.discoveryServiceAgent)
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to run discovery scans and perform inference using Vertex AI. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to run discovery scans and perform inference using Vertex AI:
-
aiplatform.endpoints.predict -
bigquery.datasets.create -
bigquery.datasets.get -
storage.buckets.get -
storage.objects.get -
storage.objects.list
Your administrator might also be able to give the Knowledge Catalog discovery service agent (usually service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com)
these permissions
with custom roles or
other predefined roles.
BigQuery connection service account roles and permissions
A BigQuery Cloud resource connection allows Knowledge Catalog to securely access and discover unstructured data stored outside of BigQuery, such as in Cloud Storage. When you create a connection, BigQuery automatically creates a dedicated service account on your behalf. This service account serves as the identity used to connect to your external data source.
By default, this service account doesn't have any permissions. You must explicitly grant this service account the required IAM roles on the Cloud Storage buckets containing your data. You can use an existing BigQuery connection or create a new one in the same location as your source Cloud Storage bucket.
To ensure that the BigQuery connection service account (usually service-PROJECT_NUMBER@gcp-sa-bigqueryconnection.iam.gserviceaccount.com) has the necessary
permissions to create BigLake object tables and run inference,
ask your administrator to grant the
following IAM roles to the BigQuery connection service account (usually service-PROJECT_NUMBER@gcp-sa-bigqueryconnection.iam.gserviceaccount.com):
-
Storage Object Viewer (
roles/storage.objectViewer) on the bucket containing unstructured data -
Vertex AI User (
roles/aiplatform.user) on the project
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to create BigLake object tables and run inference. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to create BigLake object tables and run inference:
-
storage.buckets.geton the bucket containing unstructured data -
storage.objects.geton the bucket containing unstructured data -
aiplatform.endpoints.predicton the project
Your administrator might also be able to give the BigQuery connection service account (usually service-PROJECT_NUMBER@gcp-sa-bigqueryconnection.iam.gserviceaccount.com)
these permissions
with custom roles or
other predefined roles.
Pipeline execution service account roles and permissions (Optional)
If you choose to extract the inferred data using an automated pipeline, you must create or provide a dedicated service account to execute the pipeline. This execution service account acts as the identity that securely authenticates and runs the background data extraction and analysis tasks in BigQuery. Additionally, you must grant the default Dataform service account permission to impersonate this execution service account.
To ensure that the pipeline execution service account has the necessary permissions to extract the inferred entities and relationships using a pipeline, ask your administrator to grant the following IAM roles to the pipeline execution service account on the project:
-
BigQuery Data Editor (
roles/bigquery.dataEditor) -
BigQuery Job User (
roles/bigquery.jobUser) -
BigQuery User (
roles/bigquery.user) -
Vertex AI User (
roles/aiplatform.user)
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to extract the inferred entities and relationships using a pipeline. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to extract the inferred entities and relationships using a pipeline:
-
bigquery.tables.create -
bigquery.tables.update -
bigquery.tables.get -
bigquery.tables.getData -
bigquery.jobs.create -
aiplatform.endpoints.predict
Your administrator might also be able to give the pipeline execution service account these permissions with custom roles or other predefined roles.
To ensure that the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com) has the necessary
permission to impersonate the pipeline execution service account,
ask your administrator to grant the
Service Account Token Creator (roles/iam.serviceAccountTokenCreator)
IAM role to the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com) on the pipeline execution service account.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the
iam.serviceAccounts.getAccessToken
permission,
which is required to
impersonate the pipeline execution service account.
Your administrator might also be able to give the default Dataform service account (service-PROJECT_NUMBER@gcp-sa-dataform.iam.gserviceaccount.com)
this permission
with custom roles or
other predefined roles.
Prepare unstructured data
Before you run a discovery scan, you must upload your unstructured data to a Cloud Storage bucket. Data insights for unstructured data is optimized for analyzing PDF documents.
For more information about storing and managing files in Cloud Storage, see Upload objects.
Create a discovery scan for unstructured data
To extract semantic insights from your unstructured data, you must first create a discovery scan. This scan automatically locates your unstructured files in Cloud Storage and catalogs them into a BigLake object table. By enabling the data insights option during this process, Knowledge Catalog uses Vertex AI to analyze the files and generate inferred metadata, schemas, and relationships.
In the Google Cloud console, go to the Metadata curation page.
In the Cloud Storage discovery tab, click Create.
Enter a name for the scan.
To select the Cloud Storage bucket containing your unstructured data, click Browse.
For Unstructured data options, select the Enable semantic inference checkbox.
In the Connection ID field, specify the BigQuery connection used to access the files.
The discovery scan automatically catalogs unstructured data into BigQuery by creating BigLake object tables. Because BigLake object tables securely decouple the data access credentials from the user executing queries, a connection is required to authenticate with Cloud Storage and read the files.
Click Run now (for an on-demand scan) or Create (for a scheduled scan).
For full details on all available configurations, see Discover and catalog Cloud Storage data.
Knowledge Catalog creates a BigLake object table and enriches the catalog entry with AI-generated metadata. This process usually takes a few minutes for standard datasets.
Locate the BigLake object table
After the discovery scan completes, Knowledge Catalog creates one or multiple BigLake object tables and populates the Knowledge Catalog with a corresponding entry enriched with AI-generated metadata. When multiple entries are created due to a discovery scan, each of the entries has its own insights tab. You can view the automated table description, the inferred schemas, and the relationship graphs.
In the Google Cloud console, go to the BigQuery page.
In the navigation menu, click Governance > Metadata curation.
In the Cloud Storage discovery pane, click the discovery scan that you ran for unstructured data.
- The Scan details section shows details about the discovery scan.
- The Scan status section shows the discovery results of the latest scan job.
Click the link for Published dataset.
In the list of tables displayed for the BigQuery dataset, select the BigLake object table generated for the discovery data scan.
Copy the table ID. You'll need it in the next section.
View inferred entity graphs
You can view the BigLake object table for the discovery scan in Knowledge Catalog.
In the Google Cloud console, go to the Knowledge Catalog Search page.
Paste and search for the BigLake object table whose ID you selected in the previous section.
In the search results, click the table to open its entry page.
On the Details tab, under Aspects, verify the presence of the Graph Profile aspect. This aspect contains the inferred schemas for entities and relationships.
Click the Insights tab. On the Insights tab, you can view the following information:
Semantic extraction. A banner indicates that extractable entities and relationships were detected. It includes an Extract button to materialize the data using SQL or pipeline deployment.
Description. An AI-generated, human-readable summary explains the unstructured data contents. It describes the primary nodes (entities) discovered and how they map to each other through edges (relationships).
Pipelines. A list of previously deployed data extraction pipelines associated with this resource. You can view the display name, region, creation time, and the user who created the pipeline.
Inferred entities and relationships. A visual, interactive graph displays the discovered semantic structure of your unstructured data. The graph contains nodes representing distinct entities, for example, 'Recipe' and 'Ingredient', and edges representing the connections between them, for example, 'HasAllergenStatus'. You can use the legend to filter and explore specific nodes and edges.
Entities. A detailed list of the discovered primary entities. You can expand each entity to view its AI-generated description and its inferred schema, which includes field names, data types, and field descriptions.
Relationships. A detailed list of the discovered connections between entities. You can expand each relationship to view its description and the schema defining how the entities map to one another.
Update inferred insights
Inferred insights are stored in Knowledge Catalog Catalog as an aspect attached
to the BigLake object table. You can update these insights
manually using the Google Cloud console or the
entry.patch
API.
Console
To update inferred insights in the Google Cloud console, follow these steps:
In the Google Cloud console, go to the Knowledge Catalog Search page.
Paste and search for the BigLake object table ID.
In the search results, click the table to open its entry page.
Click the Insights tab.
Next to Inferred entities and relationships, click Edit.
In the JSON editor, modify the
graph-profileaspect.Click Save.
REST
To update inferred insights using the REST API, follow these steps:
Create a file named
payload.jsonand add the JSON content of the aspect you want to update. For example:{ "aspects": { "dataplex-types.global.graph-profile": { "data": { // Your updated inferred insights data } } } }Run the following command in your terminal:
curl -X PATCH \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ -d @payload.json \ "https://dataplex.googleapis.com/v1/projects/<var>PROJECT_ID</var>/locations/<var>LOCATION</var>/entryGroups/<var>ENTRY_GROUP_ID</var>/entries/<var>ENTRY_ID</var>?updateMask=aspects"Replace the following:
PROJECT_ID: the ID of your project—for example,example-projectLOCATION: the location of the entry—for example,us-central1ENTRY_GROUP_ID: the ID of the entry group—for example,example-entry-groupENTRY_ID: the ID of the entry—for example,example-entry
For more information and code samples in other languages, see Update an entry aspect.
Extract data to BigQuery
You can materialize the inferred entities and relationships into structured tables or views in BigQuery using SQL or an automated pipeline.
From the Insights tab, click Extraction.
Choose one of the following methods based on your analytical needs and the scale of your unstructured data:
Extract by SQL: Choose this option for rapid, ad-hoc analysis, small-to-medium datasets, or when you want a zero-infrastructure approach using BigQuery remote models.
To extract using SQL, follow these steps:
- Select Extract by SQL.
- In the Extract with SQL pane, select a destination dataset. The dataset must be in the same location as the source.
- Click Extract.
- In the BigQuery Editor, a pre-populated query opens. Run the query to create standard tables and views.
For more information about using SQL to extract document insights, see Process documents with the
ML.PROCESS_DOCUMENTfunction.Extract by pipeline: Choose this option for massive-scale data processing or when you require robust retry logic, error handling, and automated orchestration to handle large volumes of documents.
To extract using a pipeline, follow these steps:
- Select Extract by pipeline.
- In the Extract with pipeline pane, enter a display name for the pipeline.
- Select a region.
- Select a destination dataset. The dataset must be in the same location as the source.
- Click Extract. This creates a BigQuery pipeline that orchestrates the data materialization.
- Run all tasks in the pipeline to generate structured node and edge views.
For more information about running data workflows, see Introduction to Dataform.
After you extract and materialize the semantic insights into BigQuery, you can perform the following tasks:
Query the structured data. Run standard SQL queries against the newly created tables to analyze the extracted entities and relationships.
Join with existing data. Combine the qualitative insights extracted from your unstructured files with your existing structured BigQuery datasets (such as joining parsed invoice data with your accounting tables).
Explore data insights. Use the Data insights feature in BigQuery Studio to automatically generate natural language questions and SQL queries for your new structured assets.
Analyze with Gemini. Use Gemini in BigQuery to perform conversational analysis, summarize trends, or create dashboards in Looker Studio based on the extracted data.