Data stores

Data stores are used by data store tools to find answers for end-user questions from your data. Data stores are a collection of websites, documents, or data in third-party systems, each of which references your data.

When an end-user asks the agent a question, the agent searches for an answer from the given source content and summarizes the findings into a coherent agent response. It also provides supporting links to the sources of the response for the end-user to learn more. The agent can provide up to five answer snippets for a given question.

Data store sources

You can use different sources for your data:

Restricted access data store sources

Google offers many additional first- and third-party data store sources as a restricted access feature. To see available sources and request access, see additional data store sources.

Website content

When adding website content as a source, you can add and exclude multiple sites. When you specify a site, you can use individual pages or * as a wildcard for a pattern. All HTML and PDF content will be processed.

You must verify your domain when using website content as a source.

Limitations:

  • Files from public URLs must have been crawled by the Google Search indexer to exist in the search index. You can check this with the Google Search Console.
  • Up to 200,000 pages are indexed. If the data store contains more pages, indexing fails at that point, but any content already indexed remains.

Import data

You can import data from BigQuery or Cloud Storage. This data can be in FAQ form or unstructured, and it can be with metadata or without metadata.

The following Data Import Options are available:

  • Add/Update Data: Adds the provided documents to the data store. If a new document has the same ID as an existing document, the new document replaces the old one.
  • Override Existing Data: Deletes all existing data and uploads new data. This action is irreversible.

FAQ data store

Data stores can hold answers to frequently asked questions. When user questions are matched with high confidence to an uploaded question, the agent returns the answer to that question without modification. You can provide a title and a URL for each question and answer pair that the agent displays.

Upload data to the data store in CSV format. Each file must include a header row that describes the columns.

For example:

"question","answer","title","url"
"Why is the sky blue?","The sky is blue because of Rayleigh scattering.","Rayleigh scattering","https://en.wikipedia.org/wiki/Rayleigh_scattering"
"What is the meaning of life?","42","",""

You can omit the title and url columns:

"answer","question"
"42","What is the meaning of life?"

During the upload process, you can select a folder where each file is processed as a CSV file, regardless of the file extension.

Limitations:

  • An extra space character after , causes an error.
  • Blank lines (even at the end of the file) cause an error.

Unstructured data store

Unstructured data stores can contain content in the following formats:

  • HTML
  • PDF
  • TXT
  • CSV

You can import files from another project's Cloud Storage bucket. To do so, grant explicit access to the import process. Follow the instructions in the error message, which will contain the name of the user that needs read access to the bucket to perform the import.

Limitations:

  • The maximum file size is 2.5 MB for text-based formats and 100 MB for other formats.

Data store with metadata

You can provide a title and URL as metadata. During a conversation, the agent can provide this information to help users quickly link to internal web pages that are not accessible by the Google Search indexer.

To import content with metadata, you must provide one or more JSON Lines files. Each line of this file describes one document. You do not directly upload the actual documents; URIs that link to the Cloud Storage paths are provided in the JSON Lines file.

To provide your JSON Lines files, provide a Cloud Storage folder that contains these files. Do not put any other files in this folder.

Field descriptions:

Field Type Description
id string Unique identifier of the document.
content.mimeType string MIME type of the document. "application/pdf" and "text/html" are supported.
content.uri string URI for the document in Cloud Storage.
structData string Single line JSON object with optional title and url fields.

For example:

{ "id": "d001", "content": {"mimeType": "application/pdf", "uri": "gs://example-import/unstructured/first_doc.pdf"}, "structData": {"title": "First Document", "url": "https://internal.example.com/documents/first_doc.pdf"} }
{ "id": "d002", "content": {"mimeType": "application/pdf", "uri": "gs://example-import/unstructured/second_doc.pdf"}, "structData": {"title": "Second Document", "url": "https://internal.example.com/documents/second_doc.pdf"} }
{ "id": "d003", "content": {"mimeType": "text/html", "uri": "gs://example-import/unstructured/mypage.html"}, "structData": {"title": "My Page", "url": "https://internal.example.com/mypage.html"} }

Data store without metadata

This type of content has no metadata. Instead, you provide URI links to the individual documents. The content type is determined by the file extension.

Parse and chunk configuration

Depending on the data source, you can configure parse and chunk settings as defined by Agent Search.

Use Cloud Storage for a data store document

If your content is not public, storing it in Cloud Storage is the recommended option. When you create data store documents, you provide the URLs for your Cloud Storage objects in the form: gs://bucket-name/folder-name. Each document within the folder is added to the data store.

When you create the Cloud Storage bucket:

Follow the Cloud Storage quickstart to create a bucket and upload files.

Languages

For supported languages, see the data store column in the language reference.

For best performance, create data stores in a single language.

After creating a data store, you can optionally specify the data store language. If you set the data store language, you can connect the data store to an agent that is configured for a different language. For example, you can create a French data store that is connected to an English agent.

Supported regions

For information about supported regions, see the region reference.

(Restricted access) Additional data store sources

Additional data store types are listed in the following table. They are available as restricted access features. You can fill out the access request form to request access. Once approved, you will be able to see these options when you create a data store in Vertex AI Agent Builder.

Third-party data store sources

Data store source Description
Box Import data from your organization's Box site.
Confluence Cloud Import data from your Confluence Cloud workspace.
Dropbox Import data from your Dropbox storage.
EntraID Import data from your organization's EntraID system.
Jira Cloud Import data from your Jira task management system.
OneDrive Import data from your organization's OneDrive storage.
Microsoft Outlook Import data from Microsoft Outlook.
Salesforce Import data from Salesforce.
ServiceNow Import data from ServiceNow.
SharePoint Import data from your organization's SharePoint system.
Slack Import data from Slack.
Microsoft Teams Import data from Microsoft Teams.

Set up a third-party data store using a connector

This section outlines the process of setting up a data store using third-party data. For instructions specific to each third-party data source, see the Generative AI App Builder documentation.

Identity providers

Identity providers let you manage users, groups, and authentication. When you set up a third-party data store, you can use either a Google identity provider or a third-party identity provider.

Google identity provider:

  • Users of the agent sign in using their Google credentials. This is any @gmail.com email address or any account that uses Google as the identity provider (for example, Google Workspace). This step is skipped if users talk to the agent using Google Cloud directly, because Google identity is automatically built into the system.
  • You can assign access to Google accounts using Identity and Access Management (IAM).

Third-party identity provider:

  • Users of the agent sign in using non-Google credentials, for example a Microsoft email address.
  • You must create a Workforce Pool using Google Cloud containing the non-Google identity providers. You can then use IAM to grant access to either the entire pool or individual users within that pool.
  • This method can't be used with any Google Cloud projects set up under the @google.com organization.

Connectors

Third-party data stores are implemented using a connector. Each connector can contain multiple data stores, which are stored as entities in the Dialogflow CX system.

  • Before you create a data store, you must set up each region with a single identity provider in Google Cloud > Agent Builder > Settings. All data stores in that region use the same identity provider. You can choose either a Google identity or a third-party identity in a workforce pool. The same Google credential is considered a different identity if it's in a workforce pool. For example, test@gmail.com is considered a different identity than workforcePools/test-pool/subject/test@gmail.com.
    • Create a workforce pool (if needed).
    • Go to Agent Builder Settings and select either Google Identity or 3rd Party Identity. Click Save to save the identity to the region.
    • You can now create a data store in the region.
  • Each data store saves Access Control List (ACL) data with each document. This record tracks which users or groups have read access to which entities. During runtime, a user or group member receives responses from the agent only if the responses source from entities that they have read access to. If a user has no read access to any entities in the data store, the agent returns an empty response.
  • Because the data in the data store is a copy of the third-party instance, it needs to be periodically refreshed. You can configure the refresh intervals on a time scale of either hours or days.
  • After you configure your data store and click Create, it can take up to an hour for the data store to appear in your data stores list.

Data store tracing

This feature includes two parts:

  1. Display of the data store internal execution tracings and step latencies in the agent simulator.
  2. Export of the data store internal execution tracings and step latencies into Cloud Logging and BigQuery.

View data in the simulator

To display tracing and execution data in the agent simulator, expand the details about a conversation turn by clicking on the expander arrow to the right of the agent's response.

The execution tab displays the internal data store execution traces, including:

  • The original user input.
  • The query as rewritten by the data store engine.
  • Quality signals from execution steps, such as security check status, stability check status, grounding check result, and safety check status.
  • Search snippets from the data store search.
  • The list of supporting documents for the snippets.

The latency tab displays a time graph for various data store execution steps. The list of steps varies depending on how the data store is configured and the execution flow. Displayed data can include the following:

  • FAQ match: Performs an FAQ matching step.
  • Query rewriting: Rewrites the original user query.
  • Search: Performs snippet searching.
  • Summarization: Summarizes the response.
  • Safety checks: Performs safety checking steps.

View tracing data in other locations

  • If you configure the conversational agent with conversation history logging, you can view data store tracing in Conversation History.
  • If you configure the conversational agent with Logging, you can view tracings and latencies in the cloud Logs Explorer.
  • If you configure the conversational agent with BigQuery export, you can view tracings and latencies in an exported BigQuery table.

What's next

To learn how to create and use a data store with an agent, see the data store tools documentation.