Voice Providers — STT, TTS & S2S
Beluga AI provides a unified voice pipeline with three provider categories: Speech-to-Text (STT), Text-to-Speech (TTS), and Speech-to-Speech (S2S). Every provider registers itself via init(), so a blank import is sufficient to make it available through the registry.
Architecture
Section titled “Architecture”The voice pipeline uses a frame-based processing model. Atomic Frame values (audio chunks, text fragments, control signals) flow through linked FrameProcessor goroutines via Go channels. Every voice provider can be used standalone or wrapped as a FrameProcessor for integration into a pipeline.
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐│Transport│───>│ VAD │───>│ STT │───>│ LLM │└─────────┘ └─────────┘ └─────────┘ └─────────┘ │ ┌─────────┐ │ │Transport│<───┌─────────┐ └─────────┘ │ TTS │ └─────────┘For S2S providers, the STT/LLM/TTS cascade is replaced by a single provider:
┌─────────┐ ┌─────────┐ ┌─────────────────┐ ┌─────────┐│Transport│───>│ VAD │───>│ S2S Provider │───>│Transport│└─────────┘ └─────────┘ └─────────────────┘ └─────────┘STT Interface
Section titled “STT Interface”All STT providers implement:
type STT interface { Transcribe(ctx context.Context, audio []byte, opts ...Option) (string, error) TranscribeStream(ctx context.Context, audioStream iter.Seq2[[]byte, error], opts ...Option) iter.Seq2[TranscriptEvent, error]}Instantiate via the registry or direct construction:
import ( "github.com/lookatitude/beluga-ai/voice/stt" _ "github.com/lookatitude/beluga-ai/voice/stt/providers/deepgram")
engine, err := stt.New("deepgram", stt.Config{ Language: "en", Model: "nova-2", Extra: map[string]any{"api_key": os.Getenv("DEEPGRAM_API_KEY")},})TTS Interface
Section titled “TTS Interface”All TTS providers implement:
type TTS interface { Synthesize(ctx context.Context, text string, opts ...Option) ([]byte, error) SynthesizeStream(ctx context.Context, textStream iter.Seq2[string, error], opts ...Option) iter.Seq2[[]byte, error]}Instantiate via the registry or direct construction:
import ( "github.com/lookatitude/beluga-ai/voice/tts" _ "github.com/lookatitude/beluga-ai/voice/tts/providers/elevenlabs")
engine, err := tts.New("elevenlabs", tts.Config{ Voice: "rachel", Extra: map[string]any{"api_key": os.Getenv("ELEVENLABS_API_KEY")},})S2S Interface
Section titled “S2S Interface”S2S providers handle bidirectional audio streaming natively:
type S2S interface { Start(ctx context.Context, opts ...Option) (Session, error)}
type Session interface { SendAudio(ctx context.Context, audio []byte) error SendText(ctx context.Context, text string) error SendToolResult(ctx context.Context, result schema.ToolResult) error Recv() <-chan SessionEvent Interrupt(ctx context.Context) error Close() error}Instantiate via the registry:
import ( "github.com/lookatitude/beluga-ai/voice/s2s" _ "github.com/lookatitude/beluga-ai/voice/s2s/providers/openai")
engine, err := s2s.New("openai_realtime", s2s.Config{ Voice: "alloy", Model: "gpt-4o-realtime-preview", Extra: map[string]any{"api_key": os.Getenv("OPENAI_API_KEY")},})session, err := engine.Start(ctx)defer session.Close()Configuration
Section titled “Configuration”STT Config
Section titled “STT Config”| Field | Type | Description |
|---|---|---|
Language | string | BCP-47 language code (e.g., "en-US", "es") |
Model | string | Provider-specific model name |
Punctuation | bool | Enable automatic punctuation insertion |
Diarization | bool | Enable speaker diarization |
SampleRate | int | Audio sample rate in Hz |
Encoding | string | Audio encoding format ("linear16", "opus") |
Extra | map[string]any | Provider-specific configuration (e.g., api_key) |
TTS Config
Section titled “TTS Config”| Field | Type | Description |
|---|---|---|
Voice | string | Voice identifier (provider-specific) |
Model | string | Provider-specific model name |
SampleRate | int | Output sample rate in Hz |
Format | AudioFormat | Output format: pcm, opus, mp3, wav |
Speed | float64 | Speech rate multiplier (1.0 = normal) |
Pitch | float64 | Voice pitch adjustment (-20.0 to 20.0) |
Extra | map[string]any | Provider-specific configuration (e.g., api_key) |
S2S Config
Section titled “S2S Config”| Field | Type | Description |
|---|---|---|
Voice | string | Voice identifier (provider-specific) |
Model | string | Provider-specific model name |
Instructions | string | System prompt for the session |
Tools | []schema.ToolDefinition | Tools available to the S2S session |
SampleRate | int | Audio sample rate in Hz |
Extra | map[string]any | Provider-specific configuration |
FrameProcessor Integration
Section titled “FrameProcessor Integration”Every voice provider can be wrapped as a FrameProcessor for pipeline use:
// STT as FrameProcessorsttProcessor := stt.AsFrameProcessor(sttEngine, stt.WithLanguage("en"))
// TTS as FrameProcessorttsProcessor := tts.AsFrameProcessor(ttsEngine, 24000, tts.WithVoice("rachel"))
// S2S as FrameProcessors2sProcessor := s2s.AsFrameProcessor(s2sEngine, s2s.WithVoice("alloy"))
// Chain processors into a pipelinepipeline := voice.Chain(sttProcessor, llmProcessor, ttsProcessor)STT Providers
Section titled “STT Providers”| Provider | Registry Name | Streaming | Description |
|---|---|---|---|
| Deepgram | deepgram | Native WebSocket | Real-time STT with Nova-2 models |
| AssemblyAI | assemblyai | Native WebSocket | Real-time and async transcription |
| OpenAI Whisper | whisper | Chunked batch | Whisper models via OpenAI API |
| Gladia | gladia | Native WebSocket | Real-time STT with language detection |
| ElevenLabs STT | elevenlabs | Chunked batch | Scribe transcription engine |
| Groq Whisper | groq | Buffered batch | Ultra-fast Whisper inference on LPU |
TTS Providers
Section titled “TTS Providers”| Provider | Registry Name | Description |
|---|---|---|
| ElevenLabs | elevenlabs | Premium voice cloning and synthesis |
| Cartesia | cartesia | Low-latency Sonic voice engine |
| PlayHT | playht | AI voice generation platform |
| LMNT | lmnt | Ultra-low-latency voice synthesis |
| Fish Audio | fish | Open-source voice synthesis |
| Smallest AI | smallest | Lightning-fast TTS engine |
| Groq TTS | groq | Fast TTS via Groq API |
S2S Providers
Section titled “S2S Providers”| Provider | Registry Name | Description |
|---|---|---|
| OpenAI Realtime | openai_realtime | Bidirectional audio via WebSocket |
| Gemini Live | gemini_live | Google’s live multimodal API |
| Amazon Nova S2S | nova | Nova Sonic via AWS Bedrock |
Each provider category supports lifecycle hooks:
// STT hookssttHooks := stt.Hooks{ OnTranscript: func(ctx context.Context, event stt.TranscriptEvent) { log.Printf("transcript: %s (final=%v)", event.Text, event.IsFinal) }, OnUtterance: func(ctx context.Context, text string) { log.Printf("utterance complete: %s", text) },}
// TTS hooksttsHooks := tts.Hooks{ BeforeSynthesize: func(ctx context.Context, text string) { log.Printf("synthesizing: %s", text) }, OnAudioChunk: func(ctx context.Context, chunk []byte) { log.Printf("audio chunk: %d bytes", len(chunk)) },}
// S2S hookss2sHooks := s2s.Hooks{ OnTurn: func(ctx context.Context, userText, agentText string) { log.Printf("turn: user=%q agent=%q", userText, agentText) }, OnInterrupt: func(ctx context.Context) { log.Println("user interrupted") }, OnToolCall: func(ctx context.Context, call schema.ToolCall) { log.Printf("tool call: %s", call.Name) },}
// Compose multiple hookscombined := stt.ComposeHooks(loggingHooks, metricsHooks)