Filter with metadata search

This page shows how to use the schema-based metadata search feature in Vertex AI RAG Engine. You can define a metadata schema for a corpus, attach metadata to files within that corpus, and use this metadata to filter contexts during retrieval.

Before you begin

To use metadata search, you must have an existing RAG corpus and RAG files. To learn how to create a corpus and upload files, see Manage your RAG knowledge base.

When you create a corpus or upload a file, the API returns a numeric RAG_CORPUS_ID and RAG_FILE_ID (for example, 123456789). You must have these IDs on hand to manage your metadata resources.

Metadata resources

The metadata search feature uses the following resource hierarchy:

  • RagCorpus: The top-level resource that contains your RAG files and their configurations.
    • RagDataSchema: A child resource of RagCorpus. It defines the structure for a single metadata field used within the corpus. Each schema specifies a key (for example, year) and a data type (INTEGER, FLOAT, STRING, DATETIME, BOOLEAN, or LIST). You must define a RagDataSchema for each metadata key that you intend to use for filtering.
    • RagFile: A child resource of RagCorpus representing a single ingested document.
      • RagMetadata: A child resource of RagFile. It represents a single key-value metadata pair attached to that specific file (for example, year=2024). The key used in a RagMetadata resource must correspond to a key defined in a RagDataSchema for the parent corpus. Multiple RagMetadata resources can be attached to a single file.

Manage metadata schemas

After you create a RAG corpus, define the metadata structure using the RagDataSchema resource at the corpus level. You can also view and delete the schemas you define.

Define metadata schemas

When defining a schema, you specify a key and a data type. The valid types a key can be are:

  • INTEGER: Passed in metadata values as an int_value.
  • FLOAT: Passed in metadata values as a float_value.
  • STRING: Passed in metadata values as a str_value.
  • DATETIME: Passed in metadata values as a datetime_value (RFC 3339 formatted string, for example, "2024-01-01T00:00:00Z").
  • BOOLEAN: Passed in metadata values as a bool_value.
  • LIST: Passed in metadata values as a list_value.

The following REST sample shows how to batch create metadata schemas that define the year, and month fields for a corpus.

REST

Replace the following variables:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region to process the request.
  • RAG_CORPUS_ID: The ID of the RAG corpus.
PROJECT_ID=PROJECT_ID
LOCATION=LOCATION
RAG_CORPUS_ID=RAG_CORPUS_ID

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/ragCorpora/${RAG_CORPUS_ID}/ragDataSchemas:batchCreate \
-d '{
  "parent": "projects/'"${PROJECT_ID}"'/locations/'"${LOCATION}"'/ragCorpora/'"${RAG_CORPUS_ID}"'",
  "requests": [
    {
      "parent": "projects/'"${PROJECT_ID}"'/locations/'"${LOCATION}"'/ragCorpora/'"${RAG_CORPUS_ID}"'",
      "rag_data_schema": {
        "key": "year",
        "schema_details": {"type": "INTEGER"}
      }
    },
    {
      "parent": "projects/'"${PROJECT_ID}"'/locations/'"${LOCATION}"'/ragCorpora/'"${RAG_CORPUS_ID}"'",
      "rag_data_schema": {
        "key": "month",
        "schema_details": {"type": "STRING"}
      }
    }
  ]
}'

List metadata schemas

The following REST sample shows how to list the metadata schemas defined for a corpus.

REST

Replace the following variables:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region to process the request.
  • RAG_CORPUS_ID: The ID of the RAG corpus.
PROJECT_ID=PROJECT_ID
LOCATION=LOCATION
RAG_CORPUS_ID=RAG_CORPUS_ID

curl -X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/ragCorpora/${RAG_CORPUS_ID}/ragDataSchemas

Delete metadata schemas

The following REST sample shows how to batch delete metadata schemas from a corpus.

REST

Replace the following variables:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region to process the request.
  • RAG_CORPUS_ID: The ID of the RAG corpus.
  • RAG_DATA_SCHEMA_KEY_1: The key name of the first metadata schema to delete.
  • RAG_DATA_SCHEMA_KEY_2: The key name of the second metadata schema to delete.
PROJECT_ID=PROJECT_ID
LOCATION=LOCATION
RAG_CORPUS_ID=RAG_CORPUS_ID
RAG_DATA_SCHEMA_KEY_1=RAG_DATA_SCHEMA_KEY_1
RAG_DATA_SCHEMA_KEY_2=RAG_DATA_SCHEMA_KEY_2

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/ragCorpora/${RAG_CORPUS_ID}/ragDataSchemas:batchDelete \
-d '{
  "names": [
    "projects/'"${PROJECT_ID}"'/locations/'"${LOCATION}"'/ragCorpora/'"${RAG_CORPUS_ID}"'/ragDataSchemas/'"${RAG_DATA_SCHEMA_KEY_1}"'",
    "projects/'"${PROJECT_ID}"'/locations/'"${LOCATION}"'/ragCorpora/'"${RAG_CORPUS_ID}"'/ragDataSchemas/'"${RAG_DATA_SCHEMA_KEY_2}"'"
  ]
}'

Manage file metadata

After you upload or import files to your corpus, you can attach, update, view, and delete metadata associated with those files.

Attach metadata to a file

The attached metadata must conform to the schema defined for the corpus. The following REST sample shows how to attach multiple metadata values to a file.

REST

Replace the following variables:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region to process the request.
  • RAG_CORPUS_ID: The ID of the RAG corpus.
  • RAG_FILE_ID: The ID of the RAG file.
PROJECT_ID=PROJECT_ID
LOCATION=LOCATION
RAG_CORPUS_ID=RAG_CORPUS_ID
RAG_FILE_ID=RAG_FILE_ID

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/ragCorpora/${RAG_CORPUS_ID}/ragFiles/${RAG_FILE_ID}/ragMetadata:batchCreate \
-d '{
  "parent": "projects/'"${PROJECT_ID}"'/locations/'"${LOCATION}"'/ragCorpora/'"${RAG_CORPUS_ID}"'/ragFiles/'"${RAG_FILE_ID}"'",
  "requests": [
    {
      "parent": "projects/'"${PROJECT_ID}"'/locations/'"${LOCATION}"'/ragCorpora/'"${RAG_CORPUS_ID}"'/ragFiles/'"${RAG_FILE_ID}"'",
      "rag_metadata": {
        "user_specified_metadata": { "key": "year", "value": { "int_value": 2024 } }
      }
    },
    {
      "parent": "projects/'"${PROJECT_ID}"'/locations/'"${LOCATION}"'/ragCorpora/'"${RAG_CORPUS_ID}"'/ragFiles/'"${RAG_FILE_ID}"'",
      "rag_metadata": {
        "user_specified_metadata": { "key": "month", "value": { "str_value": "May" } }
      }
    }
  ]
}'

List metadata for a file

The following REST sample shows how to list the metadata attached to a specific file.

REST

Replace the following variables:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region to process the request.
  • RAG_CORPUS_ID: The ID of the RAG corpus.
  • RAG_FILE_ID: The ID of the RAG file.
PROJECT_ID=PROJECT_ID
LOCATION=LOCATION
RAG_CORPUS_ID=RAG_CORPUS_ID
RAG_FILE_ID=RAG_FILE_ID

curl -X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/ragCorpora/${RAG_CORPUS_ID}/ragFiles/${RAG_FILE_ID}/ragMetadata

Update metadata

You can update the metadata values attached to a file. The following REST sample shows how to update the metadata value associated with the year key. The RAG_METADATA_KEY is the string used when you created the metadata.

REST

Replace the following variables:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region to process the request.
  • RAG_CORPUS_ID: The ID of the RAG corpus.
  • RAG_FILE_ID: The ID of the RAG file.
  • RAG_METADATA_KEY: The key name of the metadata to update.
PROJECT_ID=PROJECT_ID
LOCATION=LOCATION
RAG_CORPUS_ID=RAG_CORPUS_ID
RAG_FILE_ID=RAG_FILE_ID
RAG_METADATA_KEY=RAG_METADATA_KEY

curl -X PATCH \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/ragCorpora/${RAG_CORPUS_ID}/ragFiles/${RAG_FILE_ID}/ragMetadata/${RAG_METADATA_KEY} \
-d '{
  "user_specified_metadata": { "key": "'"${RAG_METADATA_KEY}"'", "value": { "int_value": 2025 } }
}'

Delete metadata from a file

The following REST sample shows how to batch delete metadata values from a file.

REST

Replace the following variables:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region to process the request.
  • RAG_CORPUS_ID: The ID of the RAG corpus.
  • RAG_FILE_ID: The ID of the RAG file.
  • RAG_METADATA_KEY_1: The key name of the first metadata value to delete.
  • RAG_METADATA_KEY_2: The key name of the second metadata value to delete.
PROJECT_ID=PROJECT_ID
LOCATION=LOCATION
RAG_CORPUS_ID=RAG_CORPUS_ID
RAG_FILE_ID=RAG_FILE_ID
RAG_METADATA_KEY_1=RAG_METADATA_KEY_1
RAG_METADATA_KEY_2=RAG_METADATA_KEY_2

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/ragCorpora/${RAG_CORPUS_ID}/ragFiles/${RAG_FILE_ID}/ragMetadata:batchDelete \
-d '{
  "names": [
    "projects/'"${PROJECT_ID}"'/locations/'"${LOCATION}"'/ragCorpora/'"${RAG_CORPUS_ID}"'/ragFiles/'"${RAG_FILE_ID}"'/ragMetadata/'"${RAG_METADATA_KEY_1}"'",
    "projects/'"${PROJECT_ID}"'/locations/'"${LOCATION}"'/ragCorpora/'"${RAG_CORPUS_ID}"'/ragFiles/'"${RAG_FILE_ID}"'/ragMetadata/'"${RAG_METADATA_KEY_2}"'"
  ]
}'

Filter contexts with metadata

To improve the relevance of your results, use file metadata to narrow down the search space during context retrieval. Add a metadata_filter expression to your RetrieveContexts request using CEL (Common Expression Language) (for example, year == 2024 && month == "May"). Only files with metadata that matches the filter expression are considered for retrieval.

The following REST sample shows how to retrieve contexts using a metadata filter.

REST

Replace the following variables:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region to process the request.
  • RAG_CORPUS_ID: The ID of the RAG corpus.
PROJECT_ID=PROJECT_ID
LOCATION=LOCATION
RAG_CORPUS_ID=RAG_CORPUS_ID
QUERY="except share amounts"

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}:retrieveContexts -d '{
    "vertex_rag_store": {
       "rag_resources": [
         {
           "rag_corpus": "projects/'"${PROJECT_ID}"'/locations/'"${LOCATION}"'/ragCorpora/'"${RAG_CORPUS_ID}"'"
         }
       ]
    },    
    "query": {
      "text": "'"${QUERY}"'", 
      "rag_retrieval_config": {
         "top_k": 10,
         "filter": {
            "vector_distance_threshold": 0.5,
            "metadata_filter": "year == 2024 && month == \"May\""
         }
      }
    }
}'

What's next