Retrieval-augmented generation (RAG) is a technique that's used to retrieve and provide relevant information to LLMs to generate verifiable responses. The information can include fresh information, a topic and context, or ground truth.
This page shows you how to use Vertex AI RAG Engine with the Gemini Live API, which lets you specify and retrieve information from the RAG corpus.
Prerequisites
The following prerequisites must be completed before you can use Vertex AI RAG Engine with the multimodal Live API:
Enable the RAG API in Vertex AI.
To upload files to the RAG Corpus, see Import RAG files example API.
Set up
You can use Vertex AI RAG Engine with the Live API by specifying Vertex AI RAG Engine as a tool. The following code sample demonstrates how to specify Vertex AI RAG Engine as a tool:
Replace the following variables:
- YOUR_PROJECT_ID: The ID of your Google Cloud project.
- YOUR_CORPUS_ID: The ID of your corpus.
- YOUR_LOCATION: The region to process the request.
PROJECT_ID = "YOUR_PROJECT_ID"
RAG_CORPUS_ID = "YOUR_CORPUS_ID"
LOCATION = "YOUR_LOCATION"
TOOLS = {
"retrieval": {
"vertex_rag_store": {
"rag_resources": {
"rag_corpus": "projects/${PROJECT_ID}/locations/${LOCATION}/ragCorpora/${RAG_CORPUS_ID}"
}
}
}
Use Websocket
for real-time communication
To enable real-time communication between a client and a server, you must use a
Websocket
. These code samples demonstrate how to use a Websocket
using the
Python API and the Python SDK.
Python API
CONFIG = {"response_modalities": ["TEXT"], "speech_config": { "language_code": "en-US" }}
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {bearer_token[0]}",
}
HOST= "${LOCATION}-aiplatform.googleapis.com"
SERVICE_URL = f"wss://{HOST}/ws/google.cloud.aiplatform.v1beta1.LlmBidiService/BidiGenerateContent"
MODEL="gemini-2.0-flash-exp"
# Connect to the server
async with connect(SERVICE_URL, additional_headers=headers) as ws:
# Setup the session
await ws.send(
json.dumps(
{
"setup": {
"model": MODEL,
"generation_config": CONFIG,
# Setup RAG as a retrieval tool
"tools": TOOLS,
}
}
)
)
# Receive setup response
raw_response = await ws.recv(decode=False)
setup_response = json.loads(raw_response.decode("ascii"))
# Send text message
text_input = "What are popular LLMs?"
display(Markdown(f"**Input:** {text_input}"))
msg = {
"client_content": {
"turns": [{"role": "user", "parts": [{"text": text_input}]}],
"turn_complete": True,
}
}
await ws.send(json.dumps(msg))
responses = []
# Receive chunks of server response
async for raw_response in ws:
response = json.loads(raw_response.decode())
server_content = response.pop("serverContent", None)
if server_content is None:
break
model_turn = server_content.pop("modelTurn", None)
if model_turn is not None:
parts = model_turn.pop("parts", None)
if parts is not None:
display(Markdown(f"**parts >** {parts}"))
responses.append(parts[0]["text"])
# End of turn
turn_complete = server_content.pop("turnComplete", None)
if turn_complete:
grounding_metadata = server_content.pop("groundingMetadata", None)
if grounding_metadata is not None:
grounding_chunks = grounding_metadata.pop("groundingChunks", None)
if grounding_chunks is not None:
for chunk in grounding_chunks:
display(Markdown(f"**grounding_chunk >** {chunk}"))
break
# Print the server response
display(Markdown(f"**Response >** {''.join(responses)}"))
Python SDK
To learn how to install the generative AI SDK, see Install a library:
from google import genai
from google.genai import types
from google.genai.types import (Content, LiveConnectConfig, HttpOptions, Modality, Part,)
from IPython import display
MODEL="gemini-2.0-flash-exp"
client = genai.Client(
vertexai=True,
project=PROJECT_ID,
location=LOCATION
)
async with client.aio.live.connect(
model=MODEL,
config=LiveConnectConfig(response_modalities=[Modality.TEXT],
tools=TOOLS),
) as session:
text_input = "\'What are core LLM techniques?\'"
print("> ", text_input, "\n")
await session.send_client_content(
turns=Content(role="user", parts=[Part(text=text_input)])
)
async for message in session.receive()
if message.text:
display.display(display.Markdown(message.text))
continue
Use Vertex AI RAG Engine as the context store
You can use Vertex AI RAG Engine as the context store for Gemini Live API to store session context to form and retrieve past contexts that are related to your conversation and enrich the current context for model generation. You can also take advantage of this feature to share contexts across your different Live API sessions.
Vertex AI RAG Engine supports storing and indexing the following forms of data from session contexts:
- Text
- Audio speech
Create a MemoryCorpus type corpus
To store and index conversation texts from the session context, you must create
a RAG corpus of the MemoryCorpus
type. You must also specify an LLM parser in
your memory corpus configuration that's used to parse session contexts stored
from the Live API to build memory for indexing.
This code sample demonstrates how to create a corpus. However, first replace the variables with values.
# Currently supports Google first-party embedding models
EMBEDDING_MODEL = YOUR_EMBEDDING_MODEL # Such as "publishers/google/models/text-embedding-005"
MEMORY_CORPUS_DISPLAY_NAME = YOUR_MEMORY_CORPUS_DISPLAY_NAME
LLM_PARSER_MODEL_NAME = YOUR_LLM_PARSER_MODEL_NAME # Such as "projects/{project_id}/locations/{location}/publishers/google/models/gemini-2.5-pro-preview-05-06"
memory_corpus = rag.create_corpus(
display_name=MEMORY_CORPUS_DISPLAY_NAME,
corpus_type_config=rag.RagCorpusTypeConfig(
corpus_type_config=rag.MemoryCorpus(
llm_parser=rag.LlmParserConfig(
model_name=LLM_PARSER_MODEL_NAME,
)
)
),
backend_config=rag.RagVectorDbConfig(
rag_embedding_model_config=rag.RagEmbeddingModelConfig(
vertex_prediction_endpoint=rag.VertexPredictionEndpoint(
publisher_model=EMBEDDING_MODEL
)
)
),
)
Specify your memory corpus to store contexts
When using your memory corpus with the Live API, you must specify the memory
corpus as a retrieval tool and then set store_context
to true
to allow the
Live API to store the session contexts.
This code sample demonstrates how to specify your memory corpus to store contexts. However, first replace the variables with values.
from google import genai
from google.genai import types
from google.genai.types import (Content, LiveConnectConfig, HttpOptions, Modality, Part)
from IPython import display
PROJECT_ID=YOUR_PROJECT_ID
LOCATION=YOUR_LOCATION
TEXT_INPUT=YOUR_TEXT_INPUT
MODEL_NAME=YOUR_MODEL_NAME # Such as "gemini-2.0-flash-exp"
client = genai.Client(
vertexai=True,
project=PROJECT_ID,
location=LOCATION,
)
memory_store=types.VertexRagStore(
rag_resources=[
types.VertexRagStoreRagResource(
rag_corpus=memory_corpus.name
)
],
store_context=True
)
async with client.aio.live.connect(
model=MODEL_NAME,
config=LiveConnectConfig(response_modalities=[Modality.TEXT],
tools=[types.Tool(
retrieval=types.Retrieval(
vertex_rag_store=memory_store))]),
) as session:
text_input=TEXT_INPUT
await session.send_client_content(
turns=Content(role="user", parts=[Part(text=text_input)])
)
async for message in session.receive():
if message.text:
display.display(display.Markdown(message.text))
continue
What's next
- To learn more about Vertex AI RAG Engine, see Vertex AI RAG Engine overview.
- To learn more about the RAG API, see Vertex AI RAG Engine API.
- To manage your RAG corpora, see Corpus management.
- To manage your RAG files, see File management.
- To learn how to use the Vertex AI SDK to run
Vertex AI RAG Engine tasks, see RAG quickstart for
Python.