Generate table insights
This document describes how to generate table insights for BigQuery tables, BigLake tables, and external tables. Table insights help you understand the data within a single table by generating natural language questions and SQL queries based on table metadata.
Table insights can help you accelerate data exploration by letting you quickly understand the structure and content of a new or unfamiliar table without writing complex queries. You can also generate SQL queries based on natural language questions, which reduces the time and effort required to write queries manually. Table insights can also help non-SQL users understand and analyze data through these natural language queries.
For an overview of table and dataset insights, see Data insights overview.
Modes for generating table data insights
When generating table insights, BigQuery provides two modes:
| Mode | Description | Usage |
|---|---|---|
| Generate and publish (Preview) |
Persists generated table insights into Dataplex Universal Catalog as metadata aspects. You must have the necessary permissions to publish. When you use Generate and publish, the following actions occur:
|
Use this mode for enterprise-wide data documentation that persists and is reusable, or when building catalog-driven governance workflows. |
| Generate without publish |
Creates table insights such as descriptions, natural language questions, and SQL queries on demand. Generate without publish doesn't publish insights to Dataplex Universal Catalog. |
Use this mode for quick, ad hoc exploration to avoid cluttering the catalog. |
Before you begin
Data insights are generated using Gemini in BigQuery. To start generating insights, you must first set up Gemini in BigQuery.
Enable APIs
To use data insights, enable the following APIs in your project: Dataplex API, BigQuery API, and Gemini for Google Cloud API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM
role (roles/serviceusage.serviceUsageAdmin), which
contains the serviceusage.services.enable permission. Learn how to grant
roles.
For more information about enabling the Gemini for Google Cloud API, see Enable the Gemini for Google Cloud API in a Google Cloud project.
Roles and permissions
To create, manage, and retrieve data insights, ask your administrator to grant you the following Identity and Access Management (IAM) roles:
- Dataplex Universal Catalog DataScan Editor (
roles/dataplex.dataScanEditor) or Dataplex Universal Catalog DataScan Administrator (roles/dataplex.dataScanAdmin) on the project where you want to generate insights. - BigQuery Data Viewer (
roles/bigquery.dataViewer) on the BigQuery tables for which you want to generate insights. - BigQuery Data Editor (
roles/bigquery.dataEditor) on the BigQuery tables for which you want to generate insights. - BigQuery User
(
roles/bigquery.user) or BigQuery Studio User (roles/bigquery.studioUser) on the project where you want to generate insights.
To get read-only access to the generated insights, ask your administrator to grant you the following IAM role:
- Dataplex Universal Catalog DataScan DataViewer (
roles/dataplex.dataScanDataViewer) on the project containing the BigQuery tables for which you want to view insights.
To publish data insights to Dataplex Universal Catalog, ask your administrator to grant you the following IAM roles on the resource:
- Publish descriptions as aspects:
Dataplex Universal Catalog Catalog Editor (
roles/dataplex.catalogEditor) - Publish queries as aspects:
Dataplex Universal Catalog Entry and EntryLink Owner (
roles/dataplex.entryOwner)
To enable APIs, ask your administrator to grant you the following IAM role:
- Service Usage Admin
(
roles/serviceusage.serviceUsageAdmin) on the project where you want to generate insights.
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles. To see the exact permissions that are required generate insights, expand the Required permissions section:
Required permissions
bigquery.jobs.createbigquery.tables.getbigquery.tables.getDatadataplex.datascans.createdataplex.datascans.getdataplex.datascans.getDatadataplex.datascans.run
Generate insights for a BigQuery table
To generate insights for BigQuery tables, you can use the
Google Cloud console or the DATA_DOCUMENTATION scan type supported by the
Dataplex Universal Catalog
DataScans API.
These scans generate metadata, SQL queries for data exploration, schema
descriptions, and table-level summaries.
Console
To generate insights for a BigQuery table, you must access the table entry in BigQuery using BigQuery Studio.
In the Google Cloud console, go to BigQuery Studio.
In the Explorer pane, select the project, dataset, and then the table you want to generate insights for.
Click the Insights tab. If the tab is empty, it means that the insights for this table are not generated yet.
To generate insights and publish them to Dataplex Universal Catalog, click Generate and publish (Preview).
To generate insights without publishing them to Dataplex Universal Catalog, click Generate without publish.
For more information about the differences between the Generate and publish and Generate without publish modes, see Modes for generating insights.
Select a region to generate insights and click Generate.
It takes a few minutes for the insights to be populated.
If published data profiling results for the table are available, they're used to generate insights. Otherwise, insights are generated based on the column names and descriptions. For more information, see Best practices to improve generated insights.
In the Insights tab, explore the generated natural language questions.
To view the SQL query that answers a question, click the question.
To open a query in BigQuery, click Copy to Query.
To ask follow-up questions, do the following:
Click Ask a follow-up. The query opens in a new data canvas.
Click Run, then click Query these results.
To ask a follow-up question, enter a prompt in the Natural language prompt field or edit the SQL in the query editor.
To generate a new set of queries, click Generate insights and trigger the pipeline again.
After you have generated insights for a table, anyone with the
dataplex.datascans.getData permission and access to the table
can view those insights.
REST
To generate insights programmatically, use the Dataplex Universal Catalog DataScans API. To do this, complete the following steps:
- Optional: Create a data profile scan for the table
- Generate a data documentation datascan for the BigQuery table
- Check the data documentation scan status
- Publish the data documentation scan results to BigQuery table
Optional: Create a data profile scan for the table
The presence of data profile scans reduces hallucinations and approximations by Gemini, as they ground the output in real values present in the data.
To create and run a data profile scan, follow these steps:
Create a data profile scan using the
dataScans.createmethod.Run the data profile scan using the
dataScans.runmethod. Wait for the run to complete.Publish the scan results to the BigQuery table by attaching the following data profiling labels to the table:
dataplex-dp-published-scan:DATASCAN_IDdataplex-dp-published-project:PROJECT_IDdataplex-dp-published-location:LOCATION
For more information, see Add labels to tables and views.
Generate a data documentation datascan for the BigQuery table
You can choose to run a standard managed scan or a streamlined one-time scan.
Option A: Standard managed scan (create + run)
Use this method if you want to manage the scan resource over time.
Create a data documentation data scan using the
dataScans.createmethod. You can customize the scope of the generation to include schema, descriptions, queries, or a combination using thegeneration_scopesparameter:To generate schema, table descriptions, and SQL queries, leave
data_documentation_specempty or setgeneration_scopestoALL. For example:gcurl -X POST \ https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/\ dataScans?dataScanId=DATASCAN_ID \ -d '{ "data": { "resource": "//bigquery.googleapis.com/projects/PROJECT_ID/\ datasets/DATASET_ID/tables/TABLE_ID" }, "executionSpec": { "trigger": { "onDemand": {} } }, "type": "DATA_DOCUMENTATION", "dataDocumentationSpec": { "generation_scopes": "ALL" } }'Replace the following:
- PROJECT_ID: the ID of your Google Cloud project where the dataset resides
- LOCATION: the region where the data scan runs
- DATASCAN_ID: a unique name you provide for this scan
- DATASET_ID: the ID of the BigQuery dataset being scanned
- TABLE_ID: the ID of the BigQuery table being scanned
To generate the schema, table descriptions, and column descriptions, without SQL queries, set
generation_scopestoTABLE_AND_COLUMN_DESCRIPTIONS. For example:gcurl -X POST \ https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/\ dataScans?dataScanId=DATASCAN_ID \ -d '{ "data": { "resource": "//bigquery.googleapis.com/projects/PROJECT_ID/\ datasets/DATASET_ID/tables/TABLE_ID" }, "executionSpec": { "trigger": { "onDemand": {} } }, "type": "DATA_DOCUMENTATION", "dataDocumentationSpec": { "generation_scopes": "TABLE_AND_COLUMN_DESCRIPTIONS" } }'To generate SQL queries without descriptions, set
generation_scopestoSQL_QUERIES. For example:gcurl -X POST \ https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/\ dataScans?dataScanId=DATASCAN_ID \ -d '{ "data": { "resource": "//bigquery.googleapis.com/projects/PROJECT_ID/\ datasets/DATASET_ID/tables/TABLE_ID" }, "executionSpec": { "trigger": { "onDemand": {} } }, "type": "DATA_DOCUMENTATION", "dataDocumentationSpec": { "generation_scopes": "SQL_QUERIES" } }'
Start the data documentation scan job using the
dataScans.runmethod.For example:
gcurl -X POST \ https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/\ dataScans/DATASCAN_ID:runThis request returns a unique job ID along with the initial state.
Option B: One-time scan (streamlined)
Use this method to initiate and complete a scan in a single API call. This method removes the need to call the run method separately and allows for automatic deletion of the scan resource using Time to Live (TTL) functionality.
Create and trigger the scan using the dataScans.create method. For
example:
gcurl -X POST \
"https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/dataScans?\
dataScanId=DATASCAN_ID" \
-d '{
"data": {
"resource": "//bigquery.googleapis.com/projects/PROJECT_ID/datasets/DATASET_ID/\
tables/TABLE_ID"
},
"type": "DATA_DOCUMENTATION",
"dataDocumentationSpec": { "generation_scopes": "ALL" },
"executionSpec": {
"trigger": {
"one_time": {
"ttl_after_scan_completion": { "seconds": TTL_TIME }
}
}
}
}'
Replace the following:
- PROJECT_ID: the ID of your Google Cloud project where the dataset resides
- LOCATION: the region where the data scan runs
- DATASCAN_ID: a unique name you provide for this scan
- DATASET_ID: the ID of the BigQuery dataset being scanned
- TABLE_ID: the ID of the BigQuery table being scanned
- TTL_TIME: the duration in seconds after
which the scan resource should be automatically deleted (for example,
3600for one hour)
Check the data documentation scan status
Check completion of the scan job run using the
dataScans.get method.
Use the job ID to fetch the status of the job. For example:
gcurl -X GET https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/dataScans/DATASCAN_ID/jobs/JOB_ID
The job completes when the status is either SUCCEEDED or FAILURE.
Publish the data documentation scan results to BigQuery table
To publish the scan results to the BigQuery table, attach the following data documentation labels to the table:
dataplex-data-documentation-published-scan:DATASCAN_IDdataplex-data-documentation-published-project:PROJECT_IDdataplex-data-documentation-published-location:LOCATION
Generate insights for a BigQuery external table
BigQuery data insights supports generating insights for
BigQuery external tables with data in Cloud Storage.
You and the Dataplex Universal Catalog service account for the
current project must have the
Storage Object Viewer role (roles/storage.objectViewer)
on the Cloud Storage bucket that contains the data. For more
information, see
Add a principal to a bucket-level policy.
To generate insights for a BigQuery external table, follow the instructions described in the Generate insights for a BigQuery table section of this document.
Generate insights for a BigLake table
To generate insights for a BigLake table, follow these steps:
Enable the BigQuery Connection API in your project.
Create a BigQuery connection. For more information, see Manage connections.
Grant the Storage Object Viewer (
roles/storage.objectViewer) IAM role to the service account corresponding to the BigQuery connection that you created.You can retrieve the service account ID from the connection details.
To generate insights, follow the instructions described in the Generate insights for a BigQuery table section of this document.
Generate table and column descriptions
Gemini in BigQuery automatically generates table and column descriptions when you generate data insights. You can edit these descriptions as necessary, and then save them to the table's metadata. The saved descriptions are used to generate future insights.
Control generation language
You can guide Gemini to generate table and column descriptions in a specific language. To do this, add a short directive (for example, "Generate table and column descriptions using the French language") to your table's existing description before you generate the data insights.
When you generate insights, Gemini interprets this directive and produces the metadata in the requested language. This mechanism works because Gemini uses existing table descriptions as context when generating new ones.
For a list of supported languages, see Gemini language support.
Generate descriptions
To generate table and column descriptions, follow these steps:
Generate insights by following the instructions described in the relevant section of this document:
Click the Schema tab.
Click View column descriptions.
The table description and column descriptions that were generated are displayed.
To edit and save the generated table description, do the following:
In the Table description section, click Save to details.
To replace the current description with the generated description, click Copy suggested description.
Edit the table description as necessary, and then click Save to details.
The table description is updated immediately.
To edit and save the generated column descriptions, do the following:
In the Column descriptions section, click Save to schema.
The column descriptions that were generated are populated in the New description field for each column.
Edit the column descriptions as necessary, and then click Save.
The column descriptions are updated immediately.
To close the preview panel, click Close.
Best practices for generating data insights
To enhance the precision of your generated insights, adhere to the following recommendations:
Provide comprehensive descriptions. Ensure both tables and columns within the dataset have clear, detailed descriptions.
Ground insights with profiling. If descriptions are unavailable, ensure a profile scan is linked to each table in the dataset to aid in grounding the generated insights.
Explicitly define rules. Include any relationships or business logic that the insights module uses to influence relationship generation within the respective table's description.
Ground insights to data profiling results
In generative AI, grounding is the ability to connect model output to verifiable sources of information. You can ground generated table insights to data profiling results. Data profiling analyzes the columns in your BigQuery tables and identifies common statistical characteristics, such as typical data values and data distribution.
When you create a data profiling scan for a table, you can choose to publish the scan results to the BigQuery and Dataplex Universal Catalog pages in the Google Cloud console. Insights uses data profiling results to create more accurate, relevant queries by doing the following:
Analyzes the data profiling results to identify interesting patterns, trends, or outliers in the data.
Generates queries that focus on these patterns, trends, or outliers to uncover insights.
Validates the generated queries against the data profiling results to ensure that the queries return meaningful results.
Without data profiling scans, the following things happen:
The generated queries are more likely to include inaccurate clauses or produce meaningless results.
The generated column descriptions are based only on the column name.
Ensure that the data profiling scan for your table is up-to-date and that the results are published to BigQuery.
You can adjust your data profiling settings to increase the sampling size and filter out rows and columns. After you run a new data profiling scan, regenerate insights.
The quality of dataset insights also improves significantly if the tables in the dataset have data profiling results.
Add a table description
Detailed table descriptions that describe what you want to analyze in your table can help Gemini in BigQuery to produce more relevant insights for both table and dataset insights. After you add a table description, regenerate insights.
For example, you might add the following description to a telco_churn table
such as the one used in
Example of table data insights:
This table tracks customer churn data, including subscription details, tenure,
and service usage, to predict customer churn behavior.
If you save the table description that Gemini generates, then that description is used to generate future insights.
Add a column description
Column descriptions that explain what each column is, or how one column relates to another, can improve the quality of your insights for both table and dataset insights. After you update the column descriptions in your table, regenerate insights.
For example, you might add the following descriptions to specific columns of a
telco_churn table such as the one used in
Example of table data insights:
For the tenure column:
The number of months the customer has been with the service.
For the churn column:
Whether the customer has stopped using the service. TRUE indicates the customer
no longer uses the service, FALSE indicates the customer is active.
If you save the column descriptions that Gemini generates, then those descriptions are used to generate future insights.
What's next
- Learn about Data insights overview.
- Learn how to Generate dataset insights.
- Learn more about Dataplex Universal Catalog data profiling.