OpenAI Realtime S2S Voice Provider
OpenAI Realtime provides native speech-to-speech via the OpenAI Realtime API over WebSocket, bypassing the traditional STT/LLM/TTS cascade for lower end-to-end latency. The provider supports bidirectional audio streaming, server-side voice activity detection (VAD), tool calling, and text transcripts of both user and agent speech.
Choose OpenAI Realtime when you need native speech-to-speech with GPT-4o’s reasoning capabilities. By skipping the STT/LLM/TTS cascade, this provider delivers lower end-to-end latency and more natural conversations, with built-in tool calling support. For Google’s equivalent S2S approach, consider Gemini Live. For AWS-native deployments, consider Amazon Nova S2S.
Installation
Section titled “Installation”import _ "github.com/lookatitude/beluga-ai/voice/s2s/providers/openai"The blank import registers the "openai_realtime" provider with the S2S registry.
Configuration
Section titled “Configuration”| Field | Type | Default | Description |
|---|---|---|---|
Voice | string | "alloy" | Voice (alloy, echo, shimmer, etc.) |
Model | string | "gpt-4o-realtime-preview" | Realtime model identifier |
Instructions | string | — | System prompt for the session |
Tools | []schema.ToolDefinition | — | Tools available to the model |
Extra | — | — | See below |
Extra Fields
Section titled “Extra Fields”| Key | Type | Required | Description |
|---|---|---|---|
api_key | string | Yes | OpenAI API key (sk-...) |
base_url | string | No | Override WebSocket URL |
Basic Usage
Section titled “Basic Usage”package main
import ( "context" "fmt" "log" "os"
"github.com/lookatitude/beluga-ai/voice/s2s" _ "github.com/lookatitude/beluga-ai/voice/s2s/providers/openai")
func main() { ctx := context.Background()
engine, err := s2s.New("openai_realtime", s2s.Config{ Voice: "alloy", Model: "gpt-4o-realtime-preview", Extra: map[string]any{"api_key": os.Getenv("OPENAI_API_KEY")}, }) if err != nil { log.Fatal(err) }
session, err := engine.Start(ctx) if err != nil { log.Fatal(err) } defer session.Close()
// Send audio to the session if err := session.SendAudio(ctx, audioChunk); err != nil { log.Fatal(err) }
// Receive events for event := range session.Recv() { switch event.Type { case s2s.EventAudioOutput: playAudio(event.Audio) case s2s.EventTextOutput: fmt.Printf("Agent: %s", event.Text) case s2s.EventTranscript: fmt.Printf("User said: %s\n", event.Text) case s2s.EventToolCall: fmt.Printf("Tool call: %s(%s)\n", event.ToolCall.Name, event.ToolCall.Arguments) case s2s.EventTurnEnd: fmt.Println("--- turn complete ---") case s2s.EventError: log.Printf("error: %v", event.Error) } }}Direct Construction
Section titled “Direct Construction”import openai_realtime "github.com/lookatitude/beluga-ai/voice/s2s/providers/openai"
engine, err := openai_realtime.New(s2s.Config{ Voice: "alloy", Model: "gpt-4o-realtime-preview", Instructions: "You are a helpful voice assistant.", Extra: map[string]any{"api_key": os.Getenv("OPENAI_API_KEY")},})Session Lifecycle
Section titled “Session Lifecycle”- Start: Opens a WebSocket connection and sends a
session.updatemessage with voice, modalities, audio format, and VAD configuration. - Audio I/O: Audio is sent via
SendAudio(base64-encoded PCM16) and received asEventAudioOutputevents (decoded from base64). - Text: Send text messages via
SendText, which creates a conversation item and triggers a response. - Tool Calls: The model emits
EventToolCallevents. Send results back viaSendToolResult. - Interrupt: Call
Interruptto cancel the current model response (sendsresponse.cancel). - Close: Terminates the WebSocket connection.
Tool Calling
Section titled “Tool Calling”The OpenAI Realtime API supports tool calling within S2S sessions:
session, err := engine.Start(ctx, s2s.WithInstructions("You are a helpful assistant with access to tools."), s2s.WithTools([]schema.ToolDefinition{ { Name: "get_weather", Description: "Get current weather for a location", InputSchema: map[string]any{ "type": "object", "properties": map[string]any{ "location": map[string]any{"type": "string"}, }, "required": []string{"location"}, }, }, }),)
for event := range session.Recv() { if event.Type == s2s.EventToolCall { result := executeWeatherTool(event.ToolCall.Arguments) if err := session.SendToolResult(ctx, schema.ToolResult{ CallID: event.ToolCall.ID, Content: []schema.ContentPart{schema.TextPart{Text: result}}, }); err != nil { log.Printf("send tool result: %v", err) } }}FrameProcessor Integration
Section titled “FrameProcessor Integration”Wrap the S2S engine as a FrameProcessor for use in a voice pipeline. This is the recommended pattern for the hybrid pipeline mode:
processor := s2s.AsFrameProcessor(engine, s2s.WithVoice("alloy"), s2s.WithInstructions("You are a helpful voice assistant."),)pipeline := voice.Chain(vadProcessor, processor)The AsFrameProcessor wrapper handles:
- Forwarding audio frames to
SendAudio - Forwarding text frames to
SendText - Converting control frames (interrupt signal) to
Interruptcalls - Emitting audio output, text output, and end-of-utterance control frames
Advanced Features
Section titled “Advanced Features”Server-Side VAD
Section titled “Server-Side VAD”The provider configures server-side voice activity detection by default ("turn_detection": {"type": "server_vad"}). The server detects when the user stops speaking and automatically triggers a response.
Audio Format
Section titled “Audio Format”Audio is exchanged in PCM16 format (16-bit linear PCM). The provider handles base64 encoding/decoding transparently.
Per-Session Options
Section titled “Per-Session Options”session, err := engine.Start(ctx, s2s.WithVoice("echo"), s2s.WithModel("gpt-4o-realtime-preview"), s2s.WithInstructions("Respond concisely."),)Custom Endpoint
Section titled “Custom Endpoint”engine, err := s2s.New("openai_realtime", s2s.Config{ Voice: "alloy", Extra: map[string]any{ "api_key": os.Getenv("OPENAI_API_KEY"), "base_url": "wss://custom-realtime.openai.azure.com/v1/realtime", },})