Live API を使用したインタラクティブな会話

Live API を使用すると、Gemini との双方向の音声と動画による低レイテンシのやり取りが可能になります。

このガイドでは、双方向のインタラクティブな会話の設定、音声設定の調整、セッションの管理などを行う方法について説明します。

サポートされているモデル

Live API は次のモデルで使用できます。

モデルバージョン	可用性レベル
`gemini-live-2.5-flash`	限定公開 GA^*
`gemini-live-2.5-flash-preview-native-audio`	公開プレビュー版
`gemini-live-2.5-flash-preview-native-audio-09-2025`	公開プレビュー版

^* アクセス権をリクエストするには、Google アカウントチームの担当者にお問い合わせください。

会話を開始する

コンソール

[Vertex AI Studio] > [リアルタイムストリーミング] を開きます。
[ セッションを開始] をクリックして会話を開始します。

セッションを終了するには、[ セッションを停止] をクリックします。

Python

リアルタイムのインタラクティブな会話を有効にするには、マイクとスピーカーにアクセスできるローカルコンピュータ（Colab ノートブックではない）で、次のサンプルを実行します。このサンプルでは、API との会話を設定し、音声プロンプトを送信して音声レスポンスを受信できるようにします。

"""
# Installation
# on linux
sudo apt-get install portaudio19-dev

# on mac
brew install portaudio

python3 -m venv env
source env/bin/activate
pip install google-genai
"""

import asyncio
import pyaudio
from google import genai
from google.genai import types

CHUNK=4200
FORMAT=pyaudio.paInt16
CHANNELS=1
RECORD_SECONDS=5
MODEL = 'gemini-live-2.5-flash'
INPUT_RATE=16000
OUTPUT_RATE=24000

client = genai.Client(
    vertexai=True,
    project=GOOGLE_CLOUD_PROJECT,
    location=GOOGLE_CLOUD_LOCATION,
)
config = {
    "response_modalities": ["AUDIO"],
    "input_audio_transcription": {},  # Configure input transcription
    "output_audio_transcription": {},  # Configure output transcription
}

async def main():
    print(MODEL)
    p = pyaudio.PyAudio()
    async with client.aio.live.connect(model=MODEL, config=config) as session:
        #exit()
        async def send():
            stream = p.open(
                format=FORMAT, channels=CHANNELS, rate=INPUT_RATE, input=True, frames_per_buffer=CHUNK)
            while True:
                frame = stream.read(CHUNK)
                await session.send(input={"data": frame, "mime_type": "audio/pcm"})
                await asyncio.sleep(10**-12)
        async def receive():
            output_stream = p.open(
                format=FORMAT, channels=CHANNELS, rate=OUTPUT_RATE, output=True, frames_per_buffer=CHUNK)
            async for message in session.receive():
                if message.server_content.input_transcription:
                  print(message.server_content.model_dump(mode="json", exclude_none=True))
                if message.server_content.output_transcription:
                  print(message.server_content.model_dump(mode="json", exclude_none=True))
                if message.server_content.model_turn:
                    for part in message.server_content.model_turn.parts:
                        if part.inline_data.data:
                            audio_data=part.inline_data.data
                            output_stream.write(audio_data)
                            await asyncio.sleep(10**-12)
        send_task = asyncio.create_task(send())
        receive_task = asyncio.create_task(receive())
        await asyncio.gather(send_task, receive_task)

asyncio.run(main())

Python

テキストプロンプトを送信して音声レスポンスを受信できる API との会話を設定します。

# Set model generation_config
CONFIG = {"response_modalities": ["AUDIO"]}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {bearer_token[0]}",
}

async def main() -> None:
    # Connect to the server
    async with connect(SERVICE_URL, additional_headers=headers) as ws:

        # Setup the session
        async def setup() -> None:
            await ws.send(
                json.dumps(
                    {
                        "setup": {
                            "model": "gemini-live-2.5-flash",
                            "generation_config": CONFIG,
                        }
                    }
                )
            )

            # Receive setup response
            raw_response = await ws.recv(decode=False)
            setup_response = json.loads(raw_response.decode("ascii"))
            print(f"Connected: {setup_response}")
            return

        # Send text message
        async def send() -> bool:
            text_input = input("Input > ")
            if text_input.lower() in ("q", "quit", "exit"):
                return False

            msg = {
                "client_content": {
                    "turns": [{"role": "user", "parts": [{"text": text_input}]}],
                    "turn_complete": True,
                }
            }

            await ws.send(json.dumps(msg))
            return True

        # Receive server response
        async def receive() -> None:
            responses = []

            # Receive chucks of server response
            async for raw_response in ws:
                response = json.loads(raw_response.decode())
                server_content = response.pop("serverContent", None)
                if server_content is None:
                    break

                model_turn = server_content.pop("modelTurn", None)
                if model_turn is not None:
                    parts = model_turn.pop("parts", None)
                    if parts is not None:
                        for part in parts:
                            pcm_data = base64.b64decode(part["inlineData"]["data"])
                            responses.append(np.frombuffer(pcm_data, dtype=np.int16))

                # End of turn
                turn_complete = server_content.pop("turnComplete", None)
                if turn_complete:
                    break

            # Play the returned audio message
            display(Markdown("**Response >**"))
            display(Audio(np.concatenate(responses), rate=24000, autoplay=True))
            return

        await setup()

        while True:
            if not await send():
                break
            await receive()

会話を開始し、プロンプトを入力するか、「q」、「quit」、または「exit」と入力して終了します。

await main()

言語と音声の設定を変更する

Live API は Chirp 3 を使用して、さまざまな HD 音声と言語で合成音声レスポンスをサポートしています。各音声の全リストとデモについては、Chirp 3: HD 音声をご覧ください。

レスポンスの音声と言語を設定するには:

コンソール

[Vertex AI Studio] > [リアルタイムストリーミング] を開きます。
[出力] エキスパンダで、[音声] プルダウンから音声を選択します。
同じエキスパンダで、[言語] プルダウンから言語を選択します。
[ セッションを開始] をクリックしてセッションを開始します。

Python

config = LiveConnectConfig(
    response_modalities=["AUDIO"],
    speech_config=SpeechConfig(
        voice_config=VoiceConfig(
            prebuilt_voice_config=PrebuiltVoiceConfig(
                voice_name=voice_name,
            )
        ),
        language_code="en-US",
    ),
)

音声アクティビティ検出の設定を変更する

音声アクティビティ検出（VAD）により、モデルは人が話しているときを認識できます。これは、ユーザーがいつでもモデルを中断できるようにするため、自然な会話を作成するうえで不可欠です。

モデルは、連続した音声入力ストリームに対して、音声アクティビティ検出（VAD）を自動的に実行します。VAD 設定は、セットアップメッセージの realtimeInputConfig.automaticActivityDetection フィールドを使用して構成できます。VAD が中断を検出すると、進行中の生成はキャンセルされ、破棄されます。クライアントにすでに送信された情報だけが、セッション履歴に保持されます。その後、サーバーは中断を報告するメッセージを送信します。

音声ストリームが 1 秒以上一時停止すると（たとえば、ユーザーがマイクをオフにした場合）、キャッシュに保存された音声をフラッシュするために audioStreamEnd イベントが送信されます。クライアントはいつでも音声データの送信を再開できます。

また、セットアップメッセージで realtimeInputConfig.automaticActivityDetection.disabled を true に設定することで、自動 VAD を無効にすることもできます。この構成では、クライアントがユーザーの音声を検出し、適切なタイミングで activityStart メッセージと activityEnd メッセージを送信します。audioStreamEnd は送信されません。中断は activityEnd でマークされます。

Python

config = LiveConnectConfig(
    response_modalities=["TEXT"],
    realtime_input_config=RealtimeInputConfig(
        automatic_activity_detection=AutomaticActivityDetection(
            disabled=False,  # default
            start_of_speech_sensitivity=StartSensitivity.START_SENSITIVITY_LOW, # Either START_SENSITIVITY_LOW or START_SENSITIVITY_HIGH
            end_of_speech_sensitivity=EndSensitivity.END_SENSITIVITY_LOW, # Either END_SENSITIVITY_LOW or END_SENSITIVITY_HIGH
            prefix_padding_ms=20,
            silence_duration_ms=100,
        )
    ),
)

async with client.aio.live.connect(
    model=MODEL_ID,
    config=config,
) as session:
    audio_bytes = Path("sample.pcm").read_bytes()

    await session.send_realtime_input(
        media=Blob(data=audio_bytes, mime_type="audio/pcm;rate=16000")
    )

    # if stream gets paused, send:
    # await session.send_realtime_input(audio_stream_end=True)

    response = []
    async for message in session.receive():
        if message.server_content.interrupted is True:
            # The model generation was interrupted
            response.append("The session was interrupted")

        if message.text:
            response.append(message.text)

    display(Markdown(f"**Response >** {''.join(response)}"))

セッションを延長する

会話セッションのデフォルトの最大長は 10 分です。セッションが終了する 60 秒前に、goAway 通知（BidiGenerateContentServerMessage.goAway）がクライアントに送信されます。

Gen AI SDK を使用すると、セッションの長さを 10 分単位で延長できます。セッションを延長できる回数に制限はありません。例については、セッション再開を有効または無効にするをご覧ください。

コンテキストウィンドウ

Live API コンテキストウィンドウは、リアルタイムでストリーミングされたデータ（音声の場合は 1 秒あたり 25 トークン（TPS）、動画の場合は 258 TPS）や、テキスト入力やモデル出力などの他のコンテンツを保存するために使用されます。

コンテキストウィンドウが最大長（Vertex AI Studio の [最大コンテンツサイズ] スライダーまたは API の trigger_tokens を使用して設定）を超えると、コンテキストウィンドウの圧縮を使用して最も古いターンが切り捨てられ、セッションの突然の終了を防ぎます。コンテキストウィンドウの圧縮は、コンテキストウィンドウが最大長（Vertex AI Studio で [ターゲットコンテキストのサイズ] スライダーを使用して設定するか、API で target_tokens を使用して設定）に達するとトリガーされ、トークンの合計数がこのターゲットサイズに戻るまで、会話の最も古い部分が削除されます。

たとえば、コンテキストの最大長が 32000 トークンに設定され、ターゲットサイズが 16000 トークンに設定されている場合:

ターン 1: 会話が始まります。この例では、リクエストで 12,000 トークンが使用されています。
- コンテキストの合計サイズ: 12,000 トークン
ターン 2: 別のリクエストを行います。このリクエストでは、さらに 12,000 トークンが使用されます。
- コンテキストの合計サイズ: 24,000 トークン
ターン 3: 別のリクエストを行います。このリクエストでは 14,000 トークンが使用されます。
- コンテキストの合計サイズ: 38,000 トークン

コンテキストの合計サイズが 32,000 トークンの上限を超えたため、コンテキストウィンドウの圧縮がトリガーされます。システムは会話の先頭に戻り、トークンの合計サイズが 16,000 トークンの目標を下回るまで、古いターンを削除します。

ターン 1（12,000 トークン）が削除されます。合計は 26,000 トークンとなり、16,000 トークンの目標を上回っています。
ターン 2（12,000 トークン）が削除されます。合計は 14,000 トークンになります。

最終的に、アクティブなメモリにはターン 3 のみが残り、会話はその時点から続行されます。

コンテキストの長さとターゲットサイズの最小長と最大長は次のとおりです。

設定（API フラグ）	最小値	最大値
コンテキストの最大長（`trigger_tokens`）	5,000	128,000
ターゲットコンテキストのサイズ（`target_tokens`）	0	128,000

コンテキストウィンドウを設定するには:

コンソール

[Vertex AI Studio] > [リアルタイムストリーミング] を開きます。
クリックして [詳細] メニューを開きます。
[セッションのコンテキスト] セクションで、[コンテキストの最大サイズ] スライダーを使用して、コンテキストサイズを 5,000～128,000 の値に設定します。
（省略可）同じセクションで、[ターゲットコンテキストのサイズ] スライダーを使用して、ターゲットサイズを 0～128,000 の範囲の値に設定します。

Python

セットアップメッセージで context_window_compression.trigger_tokens フィールドと context_window_compression.sliding_window.target_tokens フィールドを設定します。

config = types.LiveConnectConfig(
      temperature=0.7,
      response_modalities=['TEXT'],
      system_instruction=types.Content(
          parts=[types.Part(text='test instruction')], role='user'
      ),
      context_window_compression=types.ContextWindowCompressionConfig(
          trigger_tokens=1000,
          sliding_window=types.SlidingWindow(target_tokens=10),
      ),
  )

同時セッション数

プロジェクトごとに最大 1,000 個の同時セッションを設定できます。

セッション中にシステム指示を更新する

Live API を使用すると、アクティブなセッション中にシステム指示を更新できます。これを使用して、レスポンスの言語の変更やトーンの変更など、モデルのレスポンスを調整します。

セッション中にシステム指示を更新するには、system ロールでテキストコンテンツを送信します。更新されたシステム指示は、残りのセッションで有効になります。

Python

session.send_client_content(
      turns=types.Content(
          role="system", parts=[types.Part(text="new system instruction")]
      ),
      turn_complete=False
  )

セッションの再開を有効または無効にする

セッションの再開を使用すると、24 時間以内に前のセッションに再接続できます。これは、テキスト、動画、音声のプロンプト、モデル出力などのキャッシュに保存されたデータを保存することで実現されます。このキャッシュに保存されたデータには、プロジェクトレベルのプライバシーが適用されます。

デフォルトでは、セッションの再開は無効になっています。

セッション再開機能を有効にするには、BidiGenerateContentSetup メッセージの sessionResumption フィールドを設定します。有効にすると、サーバーは現在のキャッシュに保存されたセッションコンテキストのスナップショットを定期的に取得し、内部ストレージに保存します。

スナップショットが正常に取得されると、resumptionUpdate が返されます。これにはハンドル ID が含まれています。このハンドル ID を記録しておくと、後で使用してスナップショットからセッションを再開できます。

セッションの再開を有効にしてハンドル ID を取得する例を次に示します。

Python

import asyncio
from google import genai
from google.genai import types

client = genai.Client(
    vertexai=True,
    project=GOOGLE_CLOUD_PROJECT,
    location=GOOGLE_CLOUD_LOCATION,
)
model = "gemini-live-2.5-flash"

async def main():
    print(f"Connecting to the service with handle {previous_session_handle}...")
    async with client.aio.live.connect(
        model=model,
        config=types.LiveConnectConfig(
            response_modalities=["AUDIO"],
            session_resumption=types.SessionResumptionConfig(
                # The handle of the session to resume is passed here,
                # or else None to start a new session.
                handle=previous_session_handle
            ),
        ),
    ) as session:
        while True:
            await session.send_client_content(
                turns=types.Content(
                    role="user", parts=[types.Part(text="Hello world!")]
                )
            )
            async for message in session.receive():
                # Periodically, the server will send update messages that may
                # contain a handle for the current state of the session.
                if message.session_resumption_update:
                    update = message.session_resumption_update
                    if update.resumable and update.new_handle:
                        # The handle should be retained and linked to the session.
                        return update.new_handle

                # For the purposes of this example, placeholder input is continually fed
                # to the model. In non-sample code, the model inputs would come from
                # the user.
                if message.server_content and message.server_content.turn_complete:
                    break

if __name__ == "__main__":
    asyncio.run(main())

シームレスなセッションの再開を実現するには、透過モードを有効にします。

Python

types.LiveConnectConfig(
            response_modalities=["AUDIO"],
            session_resumption=types.SessionResumptionConfig(
                transparent=True,
    ),
)

透過モードが有効になると、コンテキストスナップショットに対応するクライアントメッセージのインデックスが明示的に返されます。これにより、再開ハンドルからセッションを再開するときに、どのクライアントメッセージを再送信する必要があるかを特定できます。

詳細

Live API の使用について詳しくは、以下をご覧ください。

Live API を使用したインタラクティブな会話 コレクションでコンテンツを整理 必要に応じて、コンテンツの保存と分類を行います。

サポートされているモデル

会話を開始する

コンソール

Python

Python

言語と音声の設定を変更する

コンソール

Python

音声アクティビティ検出の設定を変更する

Python

セッションを延長する

コンテキスト ウィンドウ

コンソール

Python

同時セッション数

セッション中にシステム指示を更新する

Python

セッションの再開を有効または無効にする

Python

Python

詳細

Live API を使用したインタラクティブな会話

コンテキストウィンドウ