Using a Human-in-the-Loop approach, where AI drafts the initial rules and you review, refine, and validate them, you can quickly translate profile statistics into a data quality framework.
Objectives
- Flatten nested BigQuery data with materialized views to enable Knowledge Catalog profiling.
- Run Knowledge Catalog profile scans using the Python client library.
- Use the Gemini CLI to generate data quality rules based on profile statistics.
- Validate and deploy AI-generated rules as Knowledge Catalog quality scans using a Human-in-the-Loop review process.
Before you begin
Before you begin, make sure you have a Google Cloud project with billing enabled.
Prepare your environment
The following steps use Cloud Shell, a command-line environment running in the cloud.
In the Google Cloud console, click Activate Cloud Shell in the top right toolbar. The environment takes a few moments to provision and connect.
In Cloud Shell, set up your project ID and environment variables:
export PROJECT_ID=$(gcloud config get-value project) gcloud config set project $PROJECT_ID export LOCATION="us-central1" export BQ_LOCATION="us" export DATASET_ID="dataplex_dq_codelab" export TABLE_ID="ga4_transactions"Use
us(multi-region) as the location since the public sample data is also located in theus(multi-region). For BigQuery queries, the source data and destination table must be in the same location.Enable the required services:
gcloud services enable dataplex.googleapis.com \ bigquery.googleapis.com \ serviceusage.googleapis.comCreate a BigQuery dataset to store sample data and results:
bq --location=us mk --dataset $PROJECT_ID:$DATASET_IDPrepare the sample data, which comes from a public ecommerce dataset from the Google Merchandise Store.
The following
bqcommand creates a new table,ga4_transactions, in yourdataplex_dq_codelabdataset. To ensure scans run quickly, it only copies data from one day (2021-01-31).bq query \ --use_legacy_sql=false \ --destination_table=$PROJECT_ID:$DATASET_ID.$TABLE_ID \ --replace=true \ 'SELECT * FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131`'Clone the GitHub repository that contains the folder structure and supporting files for this tutorial:
# Perform a shallow clone to get only the latest repository structure without the full history git clone --depth 1 --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/devrel-demos.git cd devrel-demos # Specify and download only the folder we need for this lab git sparse-checkout set data-analytics/programmatic-dq cd data-analytics/programmatic-dqThis directory is your active working area.
Profile nested data
With data profiling, Knowledge Catalog finds statistics for top-level columns, like null percentages, uniqueness, and value distributions in your data to help you understand it.
To get statistics for nested fields, you can flatten the data using a set of materialized views. This turns each nested field into a top-level column that Knowledge Catalog can profile.
Get the nested schema
Get the full schema of your source table, including all nested structures, and save the output as a JSON file:
bq show --schema --format=json $PROJECT_ID:$DATASET_ID.$TABLE_ID > bq_schema.json
View the schema:
jq < bq_schema.json
The bq_schema.json file reveals complex structures.
Flatten data with a materialized view
When you flatten nested data, it's important not to unnest multiple independent arrays in the same view. Doing so performs an implicit cross join (Cartesian product) between the arrays, which multiplies rows incorrectly and corrupts your data.
It's best to create multiple views instead, each built for a specific purpose. Each view should keep a single, clear level of detail. In this step, you create the following materialized views:
- Session flat view (
mv_ga4_user_session_flat.sql): one row per event. - Transactions view (
mv_ga4_ecommerce_transactions.sql): one row per transaction. - Items view (
mv_ga4_ecommerce_items.sql): one row per item.
The project repository provides three SQL files in the devrel-demos/data-analytics/programmatic-dq directory that define these views.
Run these files from the Cloud Shell using the following BigQuery commands.
envsubst < mv_ga4_user_session_flat.sql | bq query --use_legacy_sql=false
envsubst < mv_ga4_ecommerce_transactions.sql | bq query --use_legacy_sql=false
envsubst < mv_ga4_ecommerce_items.sql | bq query --use_legacy_sql=false
Run profile scans with the Python client
You can now create and run Knowledge Catalog data profile scans for each materialized view. The following Python script uses the google-cloud-dataplex client library to automate this process.
Before you run the script, create an isolated Python virtual environment in your project directory.
# Create the virtual environment
python3 -m venv dq_venv
# Activate the environment
source dq_venv/bin/activate
Install the Knowledge Catalog client library inside the virtual environment.
# Install the Dataplex client library
pip install google-cloud-dataplex
Now that you've set up the environment and installed the library, you're ready to use the 1_run_dataplex_scans.py script. This script profiles your three materialized views by creating and running a scan for each one. When it finishes, it outputs a rich statistical summary that you use in the next step to generate AI-powered data quality rules.
Run the script from your Cloud Shell terminal.
python3 1_run_dataplex_scans.py
Check your profile scans
You can check out the new profile scans in the Google Cloud console.
- In the navigation menu, go to Knowledge Catalog and select Data profiling & quality in the Govern section.
- Find your three profile scans listed, along with their latest job status. Click a scan to explore its detailed results.
Export profile results to JSON
In order for Gemini to read your profile scans, you need to extract their contents into a local file.
Use the 2_dq_profile_save.py script to find the latest successful scan for the mv_ga4_user_session_flat view, download the profile data, and save it to a file named dq_profile_results.json.
python3 2_dq_profile_save.py
When the script finishes, it creates a dq_profile_results.json file in the directory. This file holds the detailed statistical metadata you need to generate data quality rules. Take a look at its contents by running the following command:
cat dq_profile_results.json
Generate data quality rules with the Gemini CLI
Now you can use the Gemini CLI to read the local profile scan results.
Manually writing data quality rules for complex datasets is time-consuming and error-prone. Generative AI accelerates this workflow by generating a comprehensive initial data quality configuration in seconds. This helps you pivot from manual task execution to high-level oversight.
To start the Gemini CLI, use the following command:
gemini
Now you're ready to generate quality rules. Because the CLI can read files in your current directory, it can use your new profile scan data directly.
Prompt Gemini to create a plan
Ask Gemini to act as an expert analyst and propose a plan for creating your data quality rules. Tell Gemini not to write the YAML file yet so it focuses on analysis. Gemini analyzes the JSON file and returns a structured plan
You are an expert Google Cloud Dataplex engineer.
Your first task is to create a plan. I have a file in the current directory named ./dq_profile_results.json.
Based on the statistical data within that file, propose a step-by-step plan to create a Dataplex data quality rules file.
Your plan should identify which specific columns are good candidates for rules like nonNullExpectation, setExpectation, or rangeExpectation, and explain why based on the metrics (for example, "Plan to create a nonNullExpectation for column X because its null percentage is 0%").
Do not write the YAML file yet. Just provide the plan.
Generate data quality rules
Gemini's plan relies entirely on statistical patterns and lacks your specific business knowledge.
Review the plan and ask yourself the following questions:
- Does it align with your business goals and context?
- Are any statistically sound rules actually impractical (like a strict rowCount for a growing table)?
Refine the plan with Gemini, or approve it as-is using the following example prompt. The prompt starts by providing some feedback, then instructs Gemini to generate the dq_rules.yaml file in your working directory and conform to the DataQualityRule specification because Knowledge Catalog requires a precise YAML structure. This helps prevent syntax errors or the use of outdated schema versions.
- "The plan looks good. Please proceed."
- "The rowCount rule is not necessary, as the table size changes daily. The rest of the plan is approved. Please proceed."
- "For the setExpectation on the geo_continent column, please also include 'Antarctica'."
Once you have incorporated my feedback, please generate the `dq_rules.yaml` file.
You must adhere to the following strict requirements:
- Schema Compliance: The YAML structure must strictly conform to the DataQualityRule specification. For a definitive source of truth, you must refer to the sample_rule.yaml file in the current directory and the DataQualityRule class definition. Search for the `data_quality.py` file inside the `./dq_venv/lib/` directory to read this class definition.
- Data-Driven Values: All rule parameters, such as thresholds or expected values, must be derived directly from the statistical metrics in dq_profile_results.json.
- Rule Justification: For each rule, add a comment (#) on the line above explaining the justification, as you outlined in your plan.
- Output Purity: The final output must only be the raw YAML code block, perfectly formatted and ready for immediate deployment.
Create and run a data quality scan
You now have an agent-generated set of data quality rules that you can register and deploy as a scan.
Exit the Gemini CLI by entering
/quitor pressingCtrl+Ctwice.Then, create a data scan in Knowledge Catalog:
export DQ_SCAN="dq-scan" gcloud dataplex datascans create data-quality $DQ_SCAN \ --project=$PROJECT_ID \ --location=$LOCATION \ --data-quality-spec-file=dq_rules.yaml \ --data-source-resource="//bigquery.googleapis.com/projects/$PROJECT_ID/datasets/$DATASET_ID/tables/mv_ga4_user_session_flat"Finally, run the scan:
gcloud dataplex datascans run $DQ_SCAN --location=$LOCATION --project=$PROJECT_IDThis command creates a data quality scan named
dq-scan.Check your scan's progress in the Knowledge Catalog section of the Google Cloud console.
- In the navigation menu, go to Knowledge Catalog and select Data profiling & quality in the Govern section.
- Find the
dq-scan. When the scan completes, click the scan to see the results.
Clean up
To avoid recurring billing charges for the resources you created in this tutorial, delete them.
Delete the Knowledge Catalog scans
Delete your profile and quality scans using the specific scan names from this codelab:
# Delete the Data Quality Scan
gcloud dataplex datascans delete dq-scan \
--location=us-central1 \
--project=$PROJECT_ID --quiet
# Delete the Data Profile Scans
gcloud dataplex datascans delete profile-scan-mv-ga4-user-session-flat \
--location=us-central1 \
--project=$PROJECT_ID --quiet
gcloud dataplex datascans delete profile-scan-mv-ga4-ecommerce-transactions \
--location=us-central1 \
--project=$PROJECT_ID --quiet
gcloud dataplex datascans delete profile-scan-mv-ga4-ecommerce-items \
--location=us-central1 \
--project=$PROJECT_ID --quiet
Delete the sample dataset
Delete your temporary BigQuery dataset and its tables.
bq rm -r -f --dataset $PROJECT_ID:dataplex_dq_codelab
Delete local files
Deactivate the Python virtual environment and remove the cloned repository and its contents:
deactivate
cd ../../..
rm -rf devrel-demos
Conclusion
Congratulations, you just built an end-to-end, programmatic data governance workflow.
By pairing Gemini with Knowledge Catalog, you've built a foundation for AI-assisted governance. This approach doesn't replace the governance loop, but accelerates the process of rule creation so that you can focus on validating and refining rules based on your business logic.
What's next
- Read more about the philosophy behind this architecture in AI-Assisted Governance: Accelerating Data Quality with Human Oversight.
- Manage data quality as code by creating a CI/CD pipeline.
- Explore using custom SQL rules to enforce business-specific logic.
- Optimize your scans with filters and sampling to reduce costs.
- Automate your infrastructure by provisioning Knowledge Catalog resources with Terraform to manage your data governance at scale.
- Learn more about the Gemini CLI open-source AI agent.
- Try other Knowledge Catalog use cases