This document describes how to configure synthesized speech responses and voice activity detection in Live API. You can configure responses in a variety of HD voices and languages, and also configure voice activity detection settings to allow users to interrupt the model.
Set the language and voice
To set the response language and voice, configure as follows:
Console
- Open Vertex AI Studio > Stream realtime.
- In the Outputs expander, select a voice from the Voice drop-down.
- In the same expander, select a language from the Language drop-down.
- Click Start session to start the session.
Python
from google.genai.types import LiveConnectConfig, SpeechConfig, VoiceConfig, PrebuiltVoiceConfig config = LiveConnectConfig( response_modalities=["AUDIO"], speech_config=SpeechConfig( voice_config=VoiceConfig( prebuilt_voice_config=PrebuiltVoiceConfig( voice_name=voice_name, ) ), language_code="en-US", ), )
Voices supported
The Live API supports the following 30 voice options in the
voice_name field:
|
Zephyr -- Bright Kore -- Firm Orus -- Firm Autonoe -- Bright Umbriel -- Easy-going Erinome -- Clear Laomedeia -- Upbeat Schedar -- Even Achird -- Friendly Sadachbia -- Lively |
Puck -- Upbeat Fenrir -- Excitable Aoede -- Breezy Enceladus -- Breathy Algieba -- Smooth Algenib -- Gravelly Achernar -- Soft Gacrux -- Mature Zubenelgenubi -- Casual Sadaltager -- Knowledgeable |
Charon -- Informative Leda -- Youthful Callirrhoe -- Easy-going Iapetus -- Clear Despina -- Smooth Rasalgethi -- Informative Alnilam -- Firm Pulcherrima -- Forward Vindemiatrix -- Gentle Sulafat -- Warm |
Languages supported
The Live API supports the following 24 languages:
| Language | BCP-47 Code | Language | BCP-47 Code |
|---|---|---|---|
| Arabic (Egyptian) | ar-EG | German (Germany) | de-DE |
| English (US) | en-US | Spanish (US) | es-US |
| French (France) | fr-FR | Hindi (India) | hi-IN |
| Indonesian (Indonesia) | id-ID | Italian (Italy) | it-IT |
| Japanese (Japan) | ja-JP | Korean (Korea) | ko-KR |
| Portuguese (Brazil) | pt-BR | Russian (Russia) | ru-RU |
| Dutch (Netherlands) | nl-NL | Polish (Poland) | pl-PL |
| Thai (Thailand) | th-TH | Turkish (Turkey) | tr-TR |
| Vietnamese (Vietnam) | vi-VN | Romanian (Romania) | ro-RO |
| Ukrainian (Ukraine) | uk-UA | Bengali (Bangladesh) | bn-BD |
| English (India) | en-IN & hi-IN bundle | Marathi (India) | mr-IN |
| Tamil (India) | ta-IN | Telugu (India) | te-IN |
Configure voice activity detection
Voice activity detection (VAD) allows the model to recognize when a person is speaking. This is essential for creating natural conversations, because it allows a user to interrupt the model at any time.
When VAD detects an interruption, the ongoing generation is canceled and
discarded. Only the information already sent to the client is retained in the
session history. The server then sends a BidiGenerateContentServerContent
message to report the interruption. The server then discards any pending
function calls and sends a BidiGenerateContentServerContent message with the
IDs of the canceled calls.
Python
config = { "response_modalities": ["audio"], "realtime_input_config": { "automatic_activity_detection": { "disabled": False, # default "start_of_speech_sensitivity": "low", "end_of_speech_sensitivity": "low", "prefix_padding_ms": 20, "silence_duration_ms": 100, } } }
What's next
- Start and manage live sessions
- Send audio and video streams
- Using speech-to-speech translation
- Best practices with the Live API