As of April 10, 2026, Dataplex Universal Catalog is now called Knowledge Catalog. The API, client library, CLI, and IAM names remain unchanged. For more information, see Introducing the Google Cloud Knowledge Catalog.

Profile and ensure quality in data

Build a policy-as-code data quality workflow. This tutorial covers how to move beyond manual, UI-driven processes by defining data quality expectations in declarative, version-controlled files.

Using a Human-in-the-Loop approach, where AI drafts the initial rules and you review, refine, and validate them, you can quickly translate profile statistics into a data quality framework.

Objectives

Flatten nested BigQuery data with materialized views to enable Knowledge Catalog profiling.
Run Knowledge Catalog profile scans using the Python client library.
Use the Gemini CLI to generate data quality rules based on profile statistics.
Validate and deploy AI-generated rules as Knowledge Catalog quality scans using a Human-in-the-Loop review process.

Before you begin

Before you begin, make sure you have a Google Cloud project with billing enabled.

Prepare your environment

The following steps use Cloud Shell, a command-line environment running in the cloud.

In the Google Cloud console, click Activate Cloud Shell in the top right toolbar. The environment takes a few moments to provision and connect.
In Cloud Shell, set up your project ID and environment variables:
```
export PROJECT_ID=$(gcloud config get-value project)
gcloud config set project $PROJECT_ID
export LOCATION="us-central1"
export BQ_LOCATION="us"
export DATASET_ID="dataplex_dq_codelab"
export TABLE_ID="ga4_transactions"
```
Use us (multi-region) as the location since the public sample data is also located in the us (multi-region). For BigQuery queries, the source data and destination table must be in the same location.

Enable the required services:

gcloud services enable dataplex.googleapis.com \
                       bigquery.googleapis.com \
                       serviceusage.googleapis.com

Create a BigQuery dataset to store sample data and results:

bq --location=us mk --dataset $PROJECT_ID:$DATASET_ID

Prepare the sample data, which comes from a public ecommerce dataset from the Google Merchandise Store.

The following bq command creates a new table, ga4_transactions, in your dataplex_dq_codelab dataset. To ensure scans run quickly, it only copies data from one day (2021-01-31).
```
bq query \
--use_legacy_sql=false \
--destination_table=$PROJECT_ID:$DATASET_ID.$TABLE_ID \
--replace=true \
'SELECT * FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`'
```

Clone the GitHub repository that contains the folder structure and supporting files for this tutorial:

# Perform a shallow clone to get only the latest repository structure without the full history
git clone --depth 1 --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/devrel-demos.git
cd devrel-demos

# Specify and download only the folder we need for this lab
git sparse-checkout set data-analytics/programmatic-dq
cd data-analytics/programmatic-dq

This directory is your active working area.

Profile nested data

With data profiling, Knowledge Catalog finds statistics for top-level columns, like null percentages, uniqueness, and value distributions in your data to help you understand it.

To get statistics for nested fields, you can flatten the data using a set of materialized views. This turns each nested field into a top-level column that Knowledge Catalog can profile.

Get the nested schema

Get the full schema of your source table, including all nested structures, and save the output as a JSON file:

bq show --schema --format=json $PROJECT_ID:$DATASET_ID.$TABLE_ID > bq_schema.json

View the schema:

jq < bq_schema.json

The bq_schema.json file reveals complex structures.

Flatten data with a materialized view

When you flatten nested data, it's important not to unnest multiple independent arrays in the same view. Doing so performs an implicit cross join (Cartesian product) between the arrays, which multiplies rows incorrectly and corrupts your data.

It's best to create multiple views instead, each built for a specific purpose. Each view should keep a single, clear level of detail. In this step, you create the following materialized views:

Session flat view (mv_ga4_user_session_flat.sql): one row per event.
Transactions view (mv_ga4_ecommerce_transactions.sql): one row per transaction.
Items view (mv_ga4_ecommerce_items.sql): one row per item.

The project repository provides three SQL files in the devrel-demos/data-analytics/programmatic-dq directory that define these views.

Run these files from the Cloud Shell using the following BigQuery commands.

envsubst < mv_ga4_user_session_flat.sql | bq query --use_legacy_sql=false
envsubst < mv_ga4_ecommerce_transactions.sql | bq query --use_legacy_sql=false
envsubst < mv_ga4_ecommerce_items.sql | bq query --use_legacy_sql=false

Run profile scans with the Python client

You can now create and run Knowledge Catalog data profile scans for each materialized view. The following Python script uses the google-cloud-dataplex client library to automate this process.

Before you run the script, create an isolated Python virtual environment in your project directory.

# Create the virtual environment
python3 -m venv dq_venv

# Activate the environment
source dq_venv/bin/activate

Install the Knowledge Catalog client library inside the virtual environment.

# Install the Dataplex client library
pip install google-cloud-dataplex

Now that you've set up the environment and installed the library, you're ready to use the 1_run_dataplex_scans.py script. This script profiles your three materialized views by creating and running a scan for each one. When it finishes, it outputs a rich statistical summary that you use in the next step to generate AI-powered data quality rules.

Run the script from your Cloud Shell terminal.

python3 1_run_dataplex_scans.py

Check your profile scans

You can check out the new profile scans in the Google Cloud console.

In the navigation menu, go to Knowledge Catalog and select Data profiling & quality in the Govern section.
Find your three profile scans listed, along with their latest job status. Click a scan to explore its detailed results.

Export profile results to JSON

In order for Gemini to read your profile scans, you need to extract their contents into a local file.

Use the 2_dq_profile_save.py script to find the latest successful scan for the mv_ga4_user_session_flat view, download the profile data, and save it to a file named dq_profile_results.json.

python3 2_dq_profile_save.py

When the script finishes, it creates a dq_profile_results.json file in the directory. This file holds the detailed statistical metadata you need to generate data quality rules. Take a look at its contents by running the following command:

cat dq_profile_results.json

Generate data quality rules with the Gemini CLI

Now you can use the Gemini CLI to read the local profile scan results.

Manually writing data quality rules for complex datasets is time-consuming and error-prone. Generative AI accelerates this workflow by generating a comprehensive initial data quality configuration in seconds. This helps you pivot from manual task execution to high-level oversight.

To start the Gemini CLI, use the following command:

gemini

Now you're ready to generate quality rules. Because the CLI can read files in your current directory, it can use your new profile scan data directly.

Prompt Gemini to create a plan

Ask Gemini to act as an expert analyst and propose a plan for creating your data quality rules. Tell Gemini not to write the YAML file yet so it focuses on analysis. Gemini analyzes the JSON file and returns a structured plan

You are an expert Google Cloud Dataplex engineer.
Your first task is to create a plan. I have a file in the current directory named ./dq_profile_results.json.
Based on the statistical data within that file, propose a step-by-step plan to create a Dataplex data quality rules file.
Your plan should identify which specific columns are good candidates for rules like nonNullExpectation, setExpectation, or rangeExpectation, and explain why based on the metrics (for example, "Plan to create a nonNullExpectation for column X because its null percentage is 0%").
Do not write the YAML file yet. Just provide the plan.

Generate data quality rules

Gemini's plan relies entirely on statistical patterns and lacks your specific business knowledge.

Review the plan and ask yourself the following questions:

Does it align with your business goals and context?
Are any statistically sound rules actually impractical (like a strict rowCount for a growing table)?

Refine the plan with Gemini, or approve it as-is using the following example prompt. The prompt starts by providing some feedback, then instructs Gemini to generate the dq_rules.yaml file in your working directory and conform to the DataQualityRule specification because Knowledge Catalog requires a precise YAML structure. This helps prevent syntax errors or the use of outdated schema versions.


- "The plan looks good. Please proceed."
- "The rowCount rule is not necessary, as the table size changes daily. The rest of the plan is approved. Please proceed."
- "For the setExpectation on the geo_continent column, please also include 'Antarctica'."

Once you have incorporated my feedback, please generate the `dq_rules.yaml` file.

You must adhere to the following strict requirements:

- Schema Compliance: The YAML structure must strictly conform to the DataQualityRule specification. For a definitive source of truth, you must refer to the sample_rule.yaml file in the current directory and the DataQualityRule class definition. Search for the `data_quality.py` file inside the `./dq_venv/lib/` directory to read this class definition.

- Data-Driven Values: All rule parameters, such as thresholds or expected values, must be derived directly from the statistical metrics in dq_profile_results.json.

- Rule Justification: For each rule, add a comment (#) on the line above explaining the justification, as you outlined in your plan.

- Output Purity: The final output must only be the raw YAML code block, perfectly formatted and ready for immediate deployment.

Create and run a data quality scan

You now have an agent-generated set of data quality rules that you can register and deploy as a scan.

Exit the Gemini CLI by entering /quit or pressing Ctrl+C twice.

Then, create a data scan in Knowledge Catalog:

export DQ_SCAN="dq-scan"
gcloud dataplex datascans create data-quality $DQ_SCAN \
    --project=$PROJECT_ID \
    --location=$LOCATION \
    --data-quality-spec-file=dq_rules.yaml \
    --data-source-resource="//bigquery.googleapis.com/projects/$PROJECT_ID/datasets/$DATASET_ID/tables/mv_ga4_user_session_flat"

Finally, run the scan:

gcloud dataplex datascans run $DQ_SCAN --location=$LOCATION --project=$PROJECT_ID

This command creates a data quality scan named dq-scan.

Check your scan's progress in the Knowledge Catalog section of the Google Cloud console.
1. In the navigation menu, go to Knowledge Catalog and select Data profiling & quality in the Govern section.
2. Find the dq-scan. When the scan completes, click the scan to see the results.
Note: Because large language models (LLMs) are non-deterministic, the initial rules that Gemini creates won't be perfect. You should iterate with Gemini to refine the rules and eliminate false alerts. Seeing failed results at this stage is completely normal, and it demonstrates why Human-in-the-Loop (HITL) review is an integral, ongoing part of effective data quality governance.

Clean up

To avoid recurring billing charges for the resources you created in this tutorial, delete them.

Delete the Knowledge Catalog scans

Delete your profile and quality scans using the specific scan names from this codelab:

# Delete the Data Quality Scan
gcloud dataplex datascans delete dq-scan \
    --location=us-central1 \
    --project=$PROJECT_ID --quiet

# Delete the Data Profile Scans
gcloud dataplex datascans delete profile-scan-mv-ga4-user-session-flat \
    --location=us-central1 \
    --project=$PROJECT_ID --quiet

gcloud dataplex datascans delete profile-scan-mv-ga4-ecommerce-transactions \
    --location=us-central1 \
    --project=$PROJECT_ID --quiet

gcloud dataplex datascans delete profile-scan-mv-ga4-ecommerce-items \
    --location=us-central1 \
    --project=$PROJECT_ID --quiet

Delete the sample dataset

Delete your temporary BigQuery dataset and its tables.

bq rm -r -f --dataset $PROJECT_ID:dataplex_dq_codelab

Delete local files

Deactivate the Python virtual environment and remove the cloned repository and its contents:

deactivate
cd ../../..
rm -rf devrel-demos

Conclusion

Congratulations, you just built an end-to-end, programmatic data governance workflow.

By pairing Gemini with Knowledge Catalog, you've built a foundation for AI-assisted governance. This approach doesn't replace the governance loop, but accelerates the process of rule creation so that you can focus on validating and refining rules based on your business logic.

What's next

Read more about the philosophy behind this architecture in AI-Assisted Governance: Accelerating Data Quality with Human Oversight.
Manage data quality as code by creating a CI/CD pipeline.
Explore using custom SQL rules to enforce business-specific logic.
Optimize your scans with filters and sampling to reduce costs.
Automate your infrastructure by provisioning Knowledge Catalog resources with Terraform to manage your data governance at scale.
Learn more about the Gemini CLI open-source AI agent.
Try other Knowledge Catalog use cases