Skip to content
Docs

Gemini Live S2S Voice Provider

Gemini Live provides native speech-to-speech via the Google Gemini Live API over WebSocket, enabling bidirectional audio streaming with Gemini’s multimodal models. The provider supports audio and text output, function calling, and server-side voice activity detection.

Choose Gemini Live when you want native speech-to-speech with Google’s multimodal Gemini models. Gemini Live handles voice activity detection server-side, simplifying client implementation. It supports function calling and bidirectional audio over WebSocket. For OpenAI’s equivalent, consider OpenAI Realtime. For AWS-native S2S, consider Amazon Nova S2S.

import _ "github.com/lookatitude/beluga-ai/voice/s2s/providers/gemini"

The blank import registers the "gemini_live" provider with the S2S registry.

FieldTypeDefaultDescription
VoicestringPrebuilt voice name
Modelstring"gemini-2.0-flash-exp"Gemini model identifier
InstructionsstringSystem instruction for the session
Tools[]schema.ToolDefinitionTools available to the model
ExtraSee below
KeyTypeRequiredDescription
api_keystringYesGoogle AI API key
base_urlstringNoOverride WebSocket URL
package main
import (
"context"
"fmt"
"log"
"os"
"github.com/lookatitude/beluga-ai/voice/s2s"
_ "github.com/lookatitude/beluga-ai/voice/s2s/providers/gemini"
)
func main() {
ctx := context.Background()
engine, err := s2s.New("gemini_live", s2s.Config{
Model: "gemini-2.0-flash-exp",
Extra: map[string]any{"api_key": os.Getenv("GOOGLE_API_KEY")},
})
if err != nil {
log.Fatal(err)
}
session, err := engine.Start(ctx)
if err != nil {
log.Fatal(err)
}
defer session.Close()
// Send audio to the session
if err := session.SendAudio(ctx, audioChunk); err != nil {
log.Fatal(err)
}
// Receive events
for event := range session.Recv() {
switch event.Type {
case s2s.EventAudioOutput:
playAudio(event.Audio)
case s2s.EventTextOutput:
fmt.Printf("Agent: %s", event.Text)
case s2s.EventToolCall:
fmt.Printf("Tool call: %s(%s)\n", event.ToolCall.Name, event.ToolCall.Arguments)
case s2s.EventTurnEnd:
fmt.Println("--- turn complete ---")
case s2s.EventError:
log.Printf("error: %v", event.Error)
}
}
}
import "github.com/lookatitude/beluga-ai/voice/s2s/providers/gemini"
engine, err := gemini.New(s2s.Config{
Model: "gemini-2.0-flash-exp",
Instructions: "You are a helpful voice assistant.",
Extra: map[string]any{"api_key": os.Getenv("GOOGLE_API_KEY")},
})
  1. Start: Opens a WebSocket connection with the API key as a query parameter and sends a setup message configuring the model, generation config (audio modalities, voice), system instructions, and tools.
  2. Audio I/O: Audio is sent via SendAudio as base64-encoded PCM (16kHz) in realtimeInput.mediaChunks. Output audio arrives as base64-encoded inlineData in serverContent.modelTurn.parts.
  3. Text: Send text via SendText using clientContent.turns with turnComplete: true.
  4. Tool Calls: The model emits function calls via toolCall.functionCalls. Send results back via SendToolResult using the toolResponse message format.
  5. Interrupt: Gemini Live handles interruptions via server-side VAD. The Interrupt method is a no-op since the server detects user speech automatically.
  6. Close: Terminates the WebSocket connection.
session, err := engine.Start(ctx,
s2s.WithInstructions("You are a helpful assistant."),
s2s.WithTools([]schema.ToolDefinition{
{
Name: "search",
Description: "Search the web for information",
InputSchema: map[string]any{
"type": "object",
"properties": map[string]any{
"query": map[string]any{"type": "string"},
},
"required": []string{"query"},
},
},
}),
)
for event := range session.Recv() {
if event.Type == s2s.EventToolCall {
result := executeSearch(event.ToolCall.Arguments)
if err := session.SendToolResult(ctx, schema.ToolResult{
CallID: event.ToolCall.ID,
Content: []schema.ContentPart{schema.TextPart{Text: result}},
}); err != nil {
log.Printf("send tool result: %v", err)
}
}
}
processor := s2s.AsFrameProcessor(engine,
s2s.WithModel("gemini-2.0-flash-exp"),
s2s.WithInstructions("You are a helpful voice assistant."),
)
pipeline := voice.Chain(vadProcessor, processor)

The voice is set in the setup message under generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName:

session, err := engine.Start(ctx,
s2s.WithVoice("Charon"),
s2s.WithModel("gemini-2.0-flash-exp"),
)

Audio is exchanged in PCM format at 16kHz. The provider handles base64 encoding (sending) and decoding (receiving) transparently. The MIME type is audio/pcm;rate=16000.

Gemini Live uses server-side voice activity detection. The model automatically detects when the user stops speaking, making the Interrupt method a no-op. User interruptions are handled by the server detecting new speech input.

session, err := engine.Start(ctx,
s2s.WithVoice("Kore"),
s2s.WithModel("gemini-2.0-flash-exp"),
s2s.WithInstructions("You are a concise assistant. Keep responses under 30 seconds."),
)
engine, err := s2s.New("gemini_live", s2s.Config{
Model: "gemini-2.0-flash-exp",
Extra: map[string]any{
"api_key": os.Getenv("GOOGLE_API_KEY"),
"base_url": "wss://custom-gemini-endpoint.example.com/ws",
},
})