The Gemini Live API enables low-latency, real-time voice and video interactions with Gemini. It processes continuous streams of audio, video, or text to deliver immediate, human-like spoken responses. This creates a natural conversational experience for your users.
Try the Gemini Live API in the Google Cloud console
Key features
The Gemini Live API offers a comprehensive set of features for building robust voice and video agents:
- High audio quality: Gemini Live API provides natural, realistic-sounding speech across multiple languages.
- Multilingual support: Converse in 24 supported languages.
- Barge-in: Users can interrupt the model at any time for responsive interactions.
- Affective dialog: Adapts response style and tone to match the user's input expression.
- Proactive audio: Lets you control when the model responds and in what contexts.
- Tool use: Integrates tools like function calling and Google Search for dynamic interactions.
- Audio transcriptions: Provides text transcripts of both user input and model output.
- Speech-to-speech translation: (Experimental) Optimized for low-latency translation between languages.
Technical specifications
The following table outlines the technical specifications for the Gemini Live API:
| Category | Details |
|---|---|
| Input modalities | Audio (raw 16-bit PCM audio, 16kHz, little-endian), images/video (JPEG 1FPS), text |
| Output modalities | Audio (raw 16-bit PCM audio, 24kHz, little-endian), text |
| Protocol | Stateful WebSocket connection (WSS) |
Supported models
The following models support the Gemini Live API. Select the appropriate model based on your interaction requirements.
| Model ID | Availability | Use case | Key features |
|---|---|---|---|
gemini-live-2.5-flash-preview-native-audio-09-2025 |
Public preview | Cost-efficiency in real-time voice agents. |
Native audio Audio transcriptions Voice activity detection Affective dialog Proactive audio Tool use |
gemini-2.5-flash-s2st-exp-11-2025 |
Private experimental | Speech-to-Speech Translation (experimental). Optimized for translation tasks. |
Native audio Audio transcriptions Tool use Speech-to-speech translation |
Get started
Select the guide that matches your development environment:
Gen AI SDK tutorial
Connect to the Gemini Live API using the Gen AI SDK to build a real-time multimodal application with a Python backend.
WebSocket tutorial
Connect to the Gemini Live API using WebSockets to build a real-time multimodal application with a JavaScript frontend and a Python backend.
ADK tutorial
Create an agent and use the Agent Development Kit (ADK) Streaming to enable voice and video communication.
Partner integrations
If you prefer a simpler development process, you can use one of our partner platforms. These platforms have already integrated the Gemini Live API over the WebRTC protocol to streamline the development of real-time audio and video applications.
