Use Gemini CLI agent to test data context

AI agents can reason, but they start with zero knowledge about your specific company. Imagine asking an agent, "What is our Q1 revenue?" Without guidance, the agent might pick from dozens of tables named "revenue" in your databases, ranging from official reports to messy test data. If the agent picks the table with the closest-sounding name, it could return convincingly wrong answers based on unverified sources.

Metadata enrichment is the solution to this context problem. In this tutorial, you set up aspects that provide this context, and use the Gemini CLI to test data context and verify that an agent can accurately ground its answers in trusted, certified data.

Objectives

  • Create a realistic data lake for testing.
  • Use Knowledge Catalog aspects to label your "gold" data and distinguish it from test data.
  • Test your data context locally with the Gemini CLI.

Before you begin

Before you begin, make sure you do the following:

To complete this tutorial, you should also have a basic understanding of BigQuery, Knowledge Catalog, and Terraform.

Prepare your environment

This tutorial uses Google Cloud Shell, a command-line environment that runs in the cloud.

  1. From the Google Cloud console, click Activate Cloud Shell in the top right toolbar. It takes a few moments to provision and connect to the environment.

  2. In Cloud Shell, set your PROJECT_ID and REGION variables so that all future commands target your specific Google Cloud project.

    export PROJECT_ID=$(gcloud config get-value project)
    gcloud config set project $PROJECT_ID
    export REGION="us-central1"
    
  3. Enable the necessary Google Cloud services.

    gcloud services enable \
      artifactregistry.googleapis.com \
      bigqueryunified.googleapis.com \
      cloudaicompanion.googleapis.com \
      cloudbuild.googleapis.com \
      cloudresourcemanager.googleapis.com \
      datacatalog.googleapis.com \
      run.googleapis.com
    
  4. Clone the Google Cloud DevRel Demos repository.

    Download the infrastructure code and scripts from GitHub. Use a sparse checkout to pull only the specific folder you need for this tutorial.

    # Perform a shallow clone to get only the latest repository structure without the full history
    git clone --depth 1 --filter=blob:none --sparse https://github.com/GoogleCloudPlatform/devrel-demos.git
    cd devrel-demos
    
    # Specify and download only the folder you need for this tutorial
    git sparse-checkout set data-analytics/governance-context
    cd data-analytics/governance-context
    

Build the data lake

To make things realistic, you need a mix of official data and untrusted, messy data. Use Terraform and the preconfigured Terraform configuration files from the tutorial repository to quickly set this up.

The Terraform configuration handles two tasks:

  • Sets up the Knowledge Catalog aspect types (metadata templates), the BigQuery datasets, and tables, including finance_mart.fin_monthly_closing_internal and analyst_sandbox.tmp_data_dump_v2_final_real.
  • Loads sample data into the tables.
  1. Open the terraform directory and initialize it.

    cd terraform
    terraform init
    
  2. Apply the configuration. This might take up to a minute.

    terraform apply -var="project_id=${PROJECT_ID}" -var="region=${REGION}" -auto-approve
    

You now have an ungoverned data lake! To an AI agent, the tables in your data lake look exactly the same since they're just objects with columns. To fix this, you need to apply governance in the next step.

Apply governance

This is the most important part of the setup. Right now, your two tables look identical to an AI agent. To tell them apart, you apply aspects, which are like certified metadata labels that give the agent the context it needs. In this section, you use two scripts: one to generate the metadata and another to apply it to your tables.

Generate governance payloads

Terraform already set up the aspect types, so now you need to generate the data to fill them.

Run the ./generate_payloads.sh script to create an aspect_payloads/ directory. The directory contains four YAML files that define different governance scenarios, which you apply in the next step.

Navigate back to the root of the tutorial directory and run the ./generate_payloads.sh script:

cd ..
chmod +x ./generate_payloads.sh
./generate_payloads.sh

Apply aspects

  1. Before you run the apply_governance.sh script, take a look at the data that will be attached to your tables. Run the following command to see the metadata defined for your internal finance data:

    cat aspect_payloads/fin_internal.yaml
    

    The YAML file defines the business context for the table:

    your-project-id.us-central1.official-data-product-spec:
      data:
        product_tier: GOLD_CRITICAL
        data_domain: FINANCE
        usage_scope: INTERNAL_ONLY
        update_frequency: DAILY_BATCH
        is_certified: true
    

    Notice how this explicitly marks the data as is_certified: true and assigns it the GOLD_CRITICAL tier. This provides the AI agent with clear, structured rules to follow.

  2. Run the apply_governance.sh script. This script iterates through your BigQuery tables and uses the gcloud CLI to "stamp" the metadata from your YAML payloads onto each table.

    chmod +x ./apply_governance.sh
    ./apply_governance.sh
    

Verify the metadata

Before you move on, check that the script applied the aspects correctly.

  1. Open the Knowledge Catalog page in the Google Cloud console. You can use the top search bar to find it.
  2. Search for fin_monthly_closing_internal. Select the table name in the results to open its details page.
  3. In the Optional tags and aspects section, find the official-data-product-spec aspect. Confirm that the values match the "Gold Internal" scenario you applied.

You've now successfully used metadata to tell these tables apart, and you've given your AI agent a way to do the same.

Test your data context with the Gemini CLI

Before you build a full web application, you can test your governance logic locally using a Model Context Protocol (MCP) environment. In this setup, the Gemini CLI acts as the client (the interface you talk to) and the Knowledge Catalog extension acts as the local server.

Install the Knowledge Catalog extension

In Cloud Shell, install the Knowledge Catalog Extension.

export DATAPLEX_PROJECT="${PROJECT_ID}"

gemini extensions install https://github.com/gemini-cli-extensions/dataplex

Define rules

The GEMINI.md configuration file has logic that turns declarative rules like "I need safe data" into precise searches that only return tables with the correct governance labels.

Right now, the configuration file is just a template. You need to add your specific Google Cloud project ID to the rules so that when you run the CLI, it correctly targets your governed data.

  1. Add your PROJECT_ID to the configuration file.

    envsubst < GEMINI.md > GEMINI.md.tmp && mv GEMINI.md.tmp GEMINI.md
    
  2. To check your changes and understand how the data context works, take a quick look at the GEMINI.md file:

    cat GEMINI.md
    

    Notice that the file splits rules into Phase 1 and Phase 2. This enforces a strict order of operations. The agent must first look for the correct governance labels (Phase 1) before it's allowed to touch the data itself (Phase 2). This "search-first" logic prevents the agent from guessing table names or hallucinating answers from unverified sources.

    Make sure Phase 2 contains your actual Google Cloud project ID. If this isn't correct, the agent won't know where to look for your data.

Start the Gemini CLI and test scenarios

Start a new Gemini session. Since you're in the project folder, the CLI automatically detects and loads the local GEMINI.md as the system context.

gemini

Verify installation

Confirm that the Knowledge Catalog Extension is active. dataplex should appear in the list of configured MCP server tools.

/mcp desc

Try it out

Now it's time to see your data context in action. Paste these prompts into the CLI one by one.

Scenario 1: Find "Gold" standard data

See if the Gemini CLI can find the most trustworthy data for a high-stakes board meeting.

We are preparing the deck for an internal Board of Directors meeting next week. I need the numbers to be absolutely finalized, trustworthy, and kept strictly confidential. Which table is safe to use?

The CLI should skip over the raw data and find fin_monthly_closing_internal. It does this by matching your request for "finalized" and "confidential" data against the GOLD_CRITICAL and INTERNAL_ONLY tags you applied earlier.

Scenario 2: Public disclosure

Pretend you want to share data externally. You want to make sure the CLI doesn't let any internal secrets slip out.

I need to share our quarterly financial summary with an external consulting firm. It is critical that we do not leak any raw or internal metrics. Which dataset is officially scrubbed and explicitly approved for external sharing?

Even though the internal table has the most detail, the CLI must bypass it. It should point you to fin_quarterly_public_report because it's the only table tagged as EXTERNAL_READY.

Scenario 3: Real-time operational needs

Data scientists often need the absolute latest info. See if the Gemini CLI understands the difference between a daily batch and a livestream.

My dashboard needs to show what's happening right now with our ad spend. I can't wait for the overnight load. What do you recommend?

The CLI should find mkt_realtime_campaign_performance. It identifies the REALTIME_STREAMING update frequency in the metadata.

Scenario 4: Sandbox exploration

Sometimes "good enough" is better than "perfect." See if the Gemini CLI can find the raw sandbox data for some experimental ML work.

I'm just playing around with some new ML models and need a lot of raw data. It doesn't need to be perfect, just a sandbox environment.

The CLI should find tmp_data_dump_v2_final_real. It knows this is the right choice because it matches the BRONZE_ADHOC tier and is explicitly marked with is_certified: false.

When you finish testing, you can exit the CLI session:

/quit

Clean up

Follow these steps to avoid recurring charges:

  1. Destroy the Terraform resources.

    cd ~/devrel-demos/data-analytics/governance-context/terraform
    terraform destroy -var="project_id=${PROJECT_ID}" -var="region=${REGION}" -auto-approve
    
  2. Uninstall the Knowledge Catalog extension and remove your local demo files.

    gemini extensions uninstall dataplex
    cd ~
    rm -rf ~/devrel-demos
    

Conclusion

You've built a solid data foundation, applied strict context using metadata, and verified that it all works locally using the Gemini CLI.