Skip to content
Docs

Voice Providers — STT, TTS & S2S

Beluga AI provides a unified voice pipeline with three provider categories: Speech-to-Text (STT), Text-to-Speech (TTS), and Speech-to-Speech (S2S). Every provider registers itself via init(), so a blank import is sufficient to make it available through the registry.

The voice pipeline uses a frame-based processing model. Atomic Frame values (audio chunks, text fragments, control signals) flow through linked FrameProcessor goroutines via Go channels. Every voice provider can be used standalone or wrapped as a FrameProcessor for integration into a pipeline.

┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Transport│───>│ VAD │───>│ STT │───>│ LLM │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
┌─────────┐ │
│Transport│<───┌─────────┐
└─────────┘ │ TTS │
└─────────┘

For S2S providers, the STT/LLM/TTS cascade is replaced by a single provider:

┌─────────┐ ┌─────────┐ ┌─────────────────┐ ┌─────────┐
│Transport│───>│ VAD │───>│ S2S Provider │───>│Transport│
└─────────┘ └─────────┘ └─────────────────┘ └─────────┘

All STT providers implement:

type STT interface {
Transcribe(ctx context.Context, audio []byte, opts ...Option) (string, error)
TranscribeStream(ctx context.Context, audioStream iter.Seq2[[]byte, error], opts ...Option) iter.Seq2[TranscriptEvent, error]
}

Instantiate via the registry or direct construction:

import (
"github.com/lookatitude/beluga-ai/voice/stt"
_ "github.com/lookatitude/beluga-ai/voice/stt/providers/deepgram"
)
engine, err := stt.New("deepgram", stt.Config{
Language: "en",
Model: "nova-2",
Extra: map[string]any{"api_key": os.Getenv("DEEPGRAM_API_KEY")},
})

All TTS providers implement:

type TTS interface {
Synthesize(ctx context.Context, text string, opts ...Option) ([]byte, error)
SynthesizeStream(ctx context.Context, textStream iter.Seq2[string, error], opts ...Option) iter.Seq2[[]byte, error]
}

Instantiate via the registry or direct construction:

import (
"github.com/lookatitude/beluga-ai/voice/tts"
_ "github.com/lookatitude/beluga-ai/voice/tts/providers/elevenlabs"
)
engine, err := tts.New("elevenlabs", tts.Config{
Voice: "rachel",
Extra: map[string]any{"api_key": os.Getenv("ELEVENLABS_API_KEY")},
})

S2S providers handle bidirectional audio streaming natively:

type S2S interface {
Start(ctx context.Context, opts ...Option) (Session, error)
}
type Session interface {
SendAudio(ctx context.Context, audio []byte) error
SendText(ctx context.Context, text string) error
SendToolResult(ctx context.Context, result schema.ToolResult) error
Recv() <-chan SessionEvent
Interrupt(ctx context.Context) error
Close() error
}

Instantiate via the registry:

import (
"github.com/lookatitude/beluga-ai/voice/s2s"
_ "github.com/lookatitude/beluga-ai/voice/s2s/providers/openai"
)
engine, err := s2s.New("openai_realtime", s2s.Config{
Voice: "alloy",
Model: "gpt-4o-realtime-preview",
Extra: map[string]any{"api_key": os.Getenv("OPENAI_API_KEY")},
})
session, err := engine.Start(ctx)
defer session.Close()
FieldTypeDescription
LanguagestringBCP-47 language code (e.g., "en-US", "es")
ModelstringProvider-specific model name
PunctuationboolEnable automatic punctuation insertion
DiarizationboolEnable speaker diarization
SampleRateintAudio sample rate in Hz
EncodingstringAudio encoding format ("linear16", "opus")
Extramap[string]anyProvider-specific configuration (e.g., api_key)
FieldTypeDescription
VoicestringVoice identifier (provider-specific)
ModelstringProvider-specific model name
SampleRateintOutput sample rate in Hz
FormatAudioFormatOutput format: pcm, opus, mp3, wav
Speedfloat64Speech rate multiplier (1.0 = normal)
Pitchfloat64Voice pitch adjustment (-20.0 to 20.0)
Extramap[string]anyProvider-specific configuration (e.g., api_key)
FieldTypeDescription
VoicestringVoice identifier (provider-specific)
ModelstringProvider-specific model name
InstructionsstringSystem prompt for the session
Tools[]schema.ToolDefinitionTools available to the S2S session
SampleRateintAudio sample rate in Hz
Extramap[string]anyProvider-specific configuration

Every voice provider can be wrapped as a FrameProcessor for pipeline use:

// STT as FrameProcessor
sttProcessor := stt.AsFrameProcessor(sttEngine, stt.WithLanguage("en"))
// TTS as FrameProcessor
ttsProcessor := tts.AsFrameProcessor(ttsEngine, 24000, tts.WithVoice("rachel"))
// S2S as FrameProcessor
s2sProcessor := s2s.AsFrameProcessor(s2sEngine, s2s.WithVoice("alloy"))
// Chain processors into a pipeline
pipeline := voice.Chain(sttProcessor, llmProcessor, ttsProcessor)
ProviderRegistry NameStreamingDescription
DeepgramdeepgramNative WebSocketReal-time STT with Nova-2 models
AssemblyAIassemblyaiNative WebSocketReal-time and async transcription
OpenAI WhisperwhisperChunked batchWhisper models via OpenAI API
GladiagladiaNative WebSocketReal-time STT with language detection
ElevenLabs STTelevenlabsChunked batchScribe transcription engine
Groq WhispergroqBuffered batchUltra-fast Whisper inference on LPU
ProviderRegistry NameDescription
ElevenLabselevenlabsPremium voice cloning and synthesis
CartesiacartesiaLow-latency Sonic voice engine
PlayHTplayhtAI voice generation platform
LMNTlmntUltra-low-latency voice synthesis
Fish AudiofishOpen-source voice synthesis
Smallest AIsmallestLightning-fast TTS engine
Groq TTSgroqFast TTS via Groq API
ProviderRegistry NameDescription
OpenAI Realtimeopenai_realtimeBidirectional audio via WebSocket
Gemini Livegemini_liveGoogle’s live multimodal API
Amazon Nova S2SnovaNova Sonic via AWS Bedrock

Each provider category supports lifecycle hooks:

// STT hooks
sttHooks := stt.Hooks{
OnTranscript: func(ctx context.Context, event stt.TranscriptEvent) {
log.Printf("transcript: %s (final=%v)", event.Text, event.IsFinal)
},
OnUtterance: func(ctx context.Context, text string) {
log.Printf("utterance complete: %s", text)
},
}
// TTS hooks
ttsHooks := tts.Hooks{
BeforeSynthesize: func(ctx context.Context, text string) {
log.Printf("synthesizing: %s", text)
},
OnAudioChunk: func(ctx context.Context, chunk []byte) {
log.Printf("audio chunk: %d bytes", len(chunk))
},
}
// S2S hooks
s2sHooks := s2s.Hooks{
OnTurn: func(ctx context.Context, userText, agentText string) {
log.Printf("turn: user=%q agent=%q", userText, agentText)
},
OnInterrupt: func(ctx context.Context) {
log.Println("user interrupted")
},
OnToolCall: func(ctx context.Context, call schema.ToolCall) {
log.Printf("tool call: %s", call.Name)
},
}
// Compose multiple hooks
combined := stt.ComposeHooks(loggingHooks, metricsHooks)