AI agents can reason, but they start with zero knowledge about your specific company. Imagine asking an agent, "What is our Q1 revenue?" Without guidance, the agent might pick from dozens of tables named "revenue" in your databases, ranging from official reports to messy test data. If the agent picks the table with the closest-sounding name, it could return convincingly wrong answers based on unverified sources.
Metadata enrichment is the solution to this context problem. In this tutorial, you set up aspects that provide this context, and use the Gemini CLI to test data context and verify that an agent can accurately ground its answers in trusted, certified data.
Objectives
- Create a realistic data lake for testing.
- Use Knowledge Catalog aspects to label your "gold" data and distinguish it from test data.
- Test your data context locally with the Gemini CLI.
Before you begin
Before you begin, make sure you do the following:
- Pick a Google Cloud project for this tutorial.
- Confirm that billing is enabled for your project.
To complete this tutorial, you should also have a basic understanding of BigQuery, Knowledge Catalog, and Terraform.
Prepare your environment
This tutorial uses Google Cloud Shell, a command-line environment that runs in the cloud.
From the Google Cloud console, click Activate Cloud Shell in the top right toolbar. It takes a few moments to provision and connect to the environment.
In Cloud Shell, set your
PROJECT_IDandREGIONvariables so that all future commands target your specific Google Cloud project.export PROJECT_ID=$(gcloud config get-value project) gcloud config set project $PROJECT_ID export REGION="us-central1"Enable the necessary Google Cloud services.
gcloud services enable \ artifactregistry.googleapis.com \ bigqueryunified.googleapis.com \ cloudaicompanion.googleapis.com \ cloudbuild.googleapis.com \ cloudresourcemanager.googleapis.com \ datacatalog.googleapis.com \ run.googleapis.comClone the Google Cloud DevRel Demos repository.
Download the infrastructure code and scripts from GitHub. Use a sparse checkout to pull only the specific folder you need for this tutorial.
# Perform a shallow clone to get only the latest repository structure without the full history git clone --depth 1 --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/devrel-demos.git cd devrel-demos # Specify and download only the folder you need for this tutorial git sparse-checkout set data-analytics/governance-context cd data-analytics/governance-context
Build the data lake
To make things realistic, you need a mix of official data and untrusted, messy data. Use Terraform and the preconfigured Terraform configuration files from the tutorial repository to quickly set this up.
The Terraform configuration handles two tasks:
- Sets up the Knowledge Catalog aspect types (metadata templates), the BigQuery datasets, and tables, including
finance_mart.fin_monthly_closing_internalandanalyst_sandbox.tmp_data_dump_v2_final_real. - Loads sample data into the tables.
Open the
terraformdirectory and initialize it.cd terraform terraform initApply the configuration. This might take up to a minute.
terraform apply -var="project_id=${PROJECT_ID}" -var="region=${REGION}" -auto-approve
You now have an ungoverned data lake! To an AI agent, the tables in your data lake look exactly the same since they're just objects with columns. To fix this, you need to apply governance in the next step.
Apply governance
This is the most important part of the setup. Right now, your two tables look identical to an AI agent. To tell them apart, you apply aspects, which are like certified metadata labels that give the agent the context it needs. In this section, you use two scripts: one to generate the metadata and another to apply it to your tables.
Generate governance payloads
Terraform already set up the aspect types, so now you need to generate the data to fill them.
Run the ./generate_payloads.sh script to create an aspect_payloads/ directory. The directory contains four YAML files that define different governance scenarios, which you apply in the next step.
Navigate back to the root of the tutorial directory and run the ./generate_payloads.sh script:
cd ..
chmod +x ./generate_payloads.sh
./generate_payloads.sh
Apply aspects
Before you run the
apply_governance.shscript, take a look at the data that will be attached to your tables. Run the following command to see the metadata defined for your internal finance data:cat aspect_payloads/fin_internal.yamlThe YAML file defines the business context for the table:
your-project-id.us-central1.official-data-product-spec: data: product_tier: GOLD_CRITICAL data_domain: FINANCE usage_scope: INTERNAL_ONLY update_frequency: DAILY_BATCH is_certified: trueNotice how this explicitly marks the data as
is_certified: trueand assigns it theGOLD_CRITICALtier. This provides the AI agent with clear, structured rules to follow.Run the
apply_governance.shscript. This script iterates through your BigQuery tables and uses thegcloudCLI to "stamp" the metadata from your YAML payloads onto each table.chmod +x ./apply_governance.sh ./apply_governance.sh
Verify the metadata
Before you move on, check that the script applied the aspects correctly.
- Open the Knowledge Catalog page in the Google Cloud console. You can use the top search bar to find it.
- Search for
fin_monthly_closing_internal. Select the table name in the results to open its details page. - In the Optional tags and aspects section, find the
official-data-product-specaspect. Confirm that the values match the "Gold Internal" scenario you applied.
You've now successfully used metadata to tell these tables apart, and you've given your AI agent a way to do the same.
Test your data context with the Gemini CLI
Before you build a full web application, you can test your governance logic locally using a Model Context Protocol (MCP) environment. In this setup, the Gemini CLI acts as the client (the interface you talk to) and the Knowledge Catalog extension acts as the local server.
Install the Knowledge Catalog extension
In Cloud Shell, install the Knowledge Catalog Extension.
export DATAPLEX_PROJECT="${PROJECT_ID}"
gemini extensions install https://github.com/gemini-cli-extensions/dataplex
Define rules
The GEMINI.md configuration file has logic that turns declarative rules like "I need safe data" into precise searches that only return tables with the correct governance labels.
Right now, the configuration file is just a template. You need to add your specific Google Cloud project ID to the rules so that when you run the CLI, it correctly targets your governed data.
Add your
PROJECT_IDto the configuration file.envsubst < GEMINI.md > GEMINI.md.tmp && mv GEMINI.md.tmp GEMINI.mdTo check your changes and understand how the data context works, take a quick look at the
GEMINI.mdfile:cat GEMINI.mdNotice that the file splits rules into Phase 1 and Phase 2. This enforces a strict order of operations. The agent must first look for the correct governance labels (Phase 1) before it's allowed to touch the data itself (Phase 2). This "search-first" logic prevents the agent from guessing table names or hallucinating answers from unverified sources.
Make sure Phase 2 contains your actual Google Cloud project ID. If this isn't correct, the agent won't know where to look for your data.
Start the Gemini CLI and test scenarios
Start a new Gemini session. Since you're in the project folder, the CLI automatically detects and loads the local GEMINI.md as the system context.
gemini
Verify installation
Confirm that the Knowledge Catalog Extension is active. dataplex should appear in the list of configured MCP server tools.
/mcp desc
Try it out
Now it's time to see your data context in action. Paste these prompts into the CLI one by one.
Scenario 1: Find "Gold" standard data
See if the Gemini CLI can find the most trustworthy data for a high-stakes board meeting.
We are preparing the deck for an internal Board of Directors meeting next week. I need the numbers to be absolutely finalized, trustworthy, and kept strictly confidential. Which table is safe to use?
The CLI should skip over the raw data and find fin_monthly_closing_internal. It does this by matching your request for "finalized" and "confidential" data against the GOLD_CRITICAL and INTERNAL_ONLY tags you applied earlier.
Scenario 2: Public disclosure
Pretend you want to share data externally. You want to make sure the CLI doesn't let any internal secrets slip out.
I need to share our quarterly financial summary with an external consulting firm. It is critical that we do not leak any raw or internal metrics. Which dataset is officially scrubbed and explicitly approved for external sharing?
Even though the internal table has the most detail, the CLI must bypass it. It should point you to fin_quarterly_public_report because it's the only table tagged as EXTERNAL_READY.
Scenario 3: Real-time operational needs
Data scientists often need the absolute latest info. See if the Gemini CLI understands the difference between a daily batch and a livestream.
My dashboard needs to show what's happening right now with our ad spend. I can't wait for the overnight load. What do you recommend?
The CLI should find mkt_realtime_campaign_performance. It identifies the REALTIME_STREAMING update frequency in the metadata.
Scenario 4: Sandbox exploration
Sometimes "good enough" is better than "perfect." See if the Gemini CLI can find the raw sandbox data for some experimental ML work.
I'm just playing around with some new ML models and need a lot of raw data. It doesn't need to be perfect, just a sandbox environment.
The CLI should find tmp_data_dump_v2_final_real. It knows this is the right choice because it matches the BRONZE_ADHOC tier and is explicitly marked with is_certified: false.
When you finish testing, you can exit the CLI session:
/quit
Clean up
Follow these steps to avoid recurring charges:
Destroy the Terraform resources.
cd ~/devrel-demos/data-analytics/governance-context/terraform terraform destroy -var="project_id=${PROJECT_ID}" -var="region=${REGION}" -auto-approveUninstall the Knowledge Catalog extension and remove your local demo files.
gemini extensions uninstall dataplex cd ~ rm -rf ~/devrel-demos
Conclusion
You've built a solid data foundation, applied strict context using metadata, and verified that it all works locally using the Gemini CLI.