Gemini Live S2S Voice Provider
Gemini Live provides native speech-to-speech via the Google Gemini Live API over WebSocket, enabling bidirectional audio streaming with Gemini’s multimodal models. The provider supports audio and text output, function calling, and server-side voice activity detection.
Choose Gemini Live when you want native speech-to-speech with Google’s multimodal Gemini models. Gemini Live handles voice activity detection server-side, simplifying client implementation. It supports function calling and bidirectional audio over WebSocket. For OpenAI’s equivalent, consider OpenAI Realtime. For AWS-native S2S, consider Amazon Nova S2S.
Installation
Section titled “Installation”import _ "github.com/lookatitude/beluga-ai/voice/s2s/providers/gemini"The blank import registers the "gemini_live" provider with the S2S registry.
Configuration
Section titled “Configuration”| Field | Type | Default | Description |
|---|---|---|---|
Voice | string | — | Prebuilt voice name |
Model | string | "gemini-2.0-flash-exp" | Gemini model identifier |
Instructions | string | — | System instruction for the session |
Tools | []schema.ToolDefinition | — | Tools available to the model |
Extra | — | — | See below |
Extra Fields
Section titled “Extra Fields”| Key | Type | Required | Description |
|---|---|---|---|
api_key | string | Yes | Google AI API key |
base_url | string | No | Override WebSocket URL |
Basic Usage
Section titled “Basic Usage”package main
import ( "context" "fmt" "log" "os"
"github.com/lookatitude/beluga-ai/voice/s2s" _ "github.com/lookatitude/beluga-ai/voice/s2s/providers/gemini")
func main() { ctx := context.Background()
engine, err := s2s.New("gemini_live", s2s.Config{ Model: "gemini-2.0-flash-exp", Extra: map[string]any{"api_key": os.Getenv("GOOGLE_API_KEY")}, }) if err != nil { log.Fatal(err) }
session, err := engine.Start(ctx) if err != nil { log.Fatal(err) } defer session.Close()
// Send audio to the session if err := session.SendAudio(ctx, audioChunk); err != nil { log.Fatal(err) }
// Receive events for event := range session.Recv() { switch event.Type { case s2s.EventAudioOutput: playAudio(event.Audio) case s2s.EventTextOutput: fmt.Printf("Agent: %s", event.Text) case s2s.EventToolCall: fmt.Printf("Tool call: %s(%s)\n", event.ToolCall.Name, event.ToolCall.Arguments) case s2s.EventTurnEnd: fmt.Println("--- turn complete ---") case s2s.EventError: log.Printf("error: %v", event.Error) } }}Direct Construction
Section titled “Direct Construction”import "github.com/lookatitude/beluga-ai/voice/s2s/providers/gemini"
engine, err := gemini.New(s2s.Config{ Model: "gemini-2.0-flash-exp", Instructions: "You are a helpful voice assistant.", Extra: map[string]any{"api_key": os.Getenv("GOOGLE_API_KEY")},})Session Lifecycle
Section titled “Session Lifecycle”- Start: Opens a WebSocket connection with the API key as a query parameter and sends a setup message configuring the model, generation config (audio modalities, voice), system instructions, and tools.
- Audio I/O: Audio is sent via
SendAudioas base64-encoded PCM (16kHz) inrealtimeInput.mediaChunks. Output audio arrives as base64-encodedinlineDatainserverContent.modelTurn.parts. - Text: Send text via
SendTextusingclientContent.turnswithturnComplete: true. - Tool Calls: The model emits function calls via
toolCall.functionCalls. Send results back viaSendToolResultusing thetoolResponsemessage format. - Interrupt: Gemini Live handles interruptions via server-side VAD. The
Interruptmethod is a no-op since the server detects user speech automatically. - Close: Terminates the WebSocket connection.
Tool Calling
Section titled “Tool Calling”session, err := engine.Start(ctx, s2s.WithInstructions("You are a helpful assistant."), s2s.WithTools([]schema.ToolDefinition{ { Name: "search", Description: "Search the web for information", InputSchema: map[string]any{ "type": "object", "properties": map[string]any{ "query": map[string]any{"type": "string"}, }, "required": []string{"query"}, }, }, }),)
for event := range session.Recv() { if event.Type == s2s.EventToolCall { result := executeSearch(event.ToolCall.Arguments) if err := session.SendToolResult(ctx, schema.ToolResult{ CallID: event.ToolCall.ID, Content: []schema.ContentPart{schema.TextPart{Text: result}}, }); err != nil { log.Printf("send tool result: %v", err) } }}FrameProcessor Integration
Section titled “FrameProcessor Integration”processor := s2s.AsFrameProcessor(engine, s2s.WithModel("gemini-2.0-flash-exp"), s2s.WithInstructions("You are a helpful voice assistant."),)pipeline := voice.Chain(vadProcessor, processor)Advanced Features
Section titled “Advanced Features”Voice Configuration
Section titled “Voice Configuration”The voice is set in the setup message under generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName:
session, err := engine.Start(ctx, s2s.WithVoice("Charon"), s2s.WithModel("gemini-2.0-flash-exp"),)Audio Format
Section titled “Audio Format”Audio is exchanged in PCM format at 16kHz. The provider handles base64 encoding (sending) and decoding (receiving) transparently. The MIME type is audio/pcm;rate=16000.
Server-Side VAD
Section titled “Server-Side VAD”Gemini Live uses server-side voice activity detection. The model automatically detects when the user stops speaking, making the Interrupt method a no-op. User interruptions are handled by the server detecting new speech input.
Per-Session Options
Section titled “Per-Session Options”session, err := engine.Start(ctx, s2s.WithVoice("Kore"), s2s.WithModel("gemini-2.0-flash-exp"), s2s.WithInstructions("You are a concise assistant. Keep responses under 30 seconds."),)Custom Endpoint
Section titled “Custom Endpoint”engine, err := s2s.New("gemini_live", s2s.Config{ Model: "gemini-2.0-flash-exp", Extra: map[string]any{ "api_key": os.Getenv("GOOGLE_API_KEY"), "base_url": "wss://custom-gemini-endpoint.example.com/ws", },})