Voice Pipeline
Frame-based STT→LLM→TTS processing with sub-800ms end-to-end latency. Speech-to-speech modes, Silero VAD, semantic turn detection, and WebRTC/LiveKit transports.
Overview
The Beluga AI Voice Pipeline provides a frame-based architecture for real-time voice interactions.
Rather than treating voice as a monolithic stream, the pipeline decomposes audio into discrete
Frames — atomic units of audio, text, images, or control signals — that flow through
a chain of FrameProcessor nodes. This design enables fine-grained composition: swap any
STT, TTS, or LLM provider without rewriting your pipeline.
The voice system supports two primary modes. The cascading pipeline (STT → LLM → TTS) transcribes speech, processes it through any LLM, and synthesizes a response. For lower latency, speech-to-speech (S2S) mode bypasses text entirely, using models like OpenAI Realtime or Gemini Live to process audio natively. Both modes integrate seamlessly with the agent runtime, memory, and tool systems.
End-to-end latency is managed through a strict latency budget: transport under 50ms, VAD under 1ms, STT under 200ms, LLM TTFT under 300ms, and TTS TTFB under 200ms — targeting sub-800ms total round-trip. Transports include WebSocket, WebRTC, LiveKit rooms, and Daily.co for production deployment.
Capabilities
Frame-Based Architecture
Every stage in the pipeline implements the FrameProcessor interface, processing atomic
Frames that carry audio chunks, transcription text, images, or control signals. This
composable design allows processors to be added, removed, or reordered without affecting the rest of
the chain.
Cascading Pipeline (STT → LLM → TTS)
The three-stage cascading pipeline is the most common voice pattern. Speech is transcribed by any supported STT provider, the text is processed through an LLM agent (with full access to tools and memory), and the response is synthesized back to audio. Mix and match providers freely — use Deepgram for STT, Claude for reasoning, and ElevenLabs for TTS in the same pipeline.
Speech-to-Speech (S2S)
For the lowest possible latency, S2S mode bypasses the text intermediary entirely. Supported providers include OpenAI Realtime API and Gemini Live, which process audio input and produce audio output natively. S2S mode can fall back to cascading mode when the S2S provider is unavailable.
STT Providers
Six STT providers are supported out of the box: Deepgram Nova-3 (streaming-first with low latency), ElevenLabs Scribe (high accuracy multilingual), OpenAI Whisper (broad language support), AssemblyAI Slam-1 (real-time streaming), Groq (ultra-fast inference), and Gladia (enterprise multilingual). All providers implement the same interface, making swaps seamless.
TTS Providers
Seven TTS providers cover a range of voice quality and latency profiles: ElevenLabs (premium voice cloning), Cartesia Sonic (ultra-low latency streaming), PlayHT (natural conversational voices), Groq (fast inference), Fish Audio (multilingual), LMNT (real-time optimized), and Smallest.ai (lightweight edge deployment).
Voice Activity Detection
Silero VAD runs in under 1ms per 30ms audio chunk, providing reliable speech/silence detection. Beyond simple energy-based detection, Beluga also supports semantic turn detection that uses LLM context to determine when a speaker has finished their thought, reducing false interruptions in conversational scenarios.
Transports
Four transport options are available: WebSocket for simple bidirectional streaming, WebRTC for peer-to-peer low-latency audio, LiveKit rooms for scalable multi-participant sessions, and Daily.co for managed infrastructure. All transports produce and consume the same Frame types.
Latency Budget
The voice pipeline enforces a latency budget to meet the sub-800ms target. Each stage has a budget allocation: Transport <50ms, VAD <1ms, STT <200ms, LLM TTFT <300ms, and TTS TTFB <200ms. Observability hooks report per-stage timing, enabling you to identify and resolve bottlenecks.
Architecture
Providers & Implementations
STT Providers
| Name | Priority | Key Differentiator |
|---|---|---|
| Deepgram Nova-3 | P0 | Streaming-first, lowest latency, excellent accuracy |
| ElevenLabs Scribe | P0 | High accuracy multilingual transcription |
| OpenAI Whisper | P0 | Broadest language support, well-known baseline |
| AssemblyAI Slam-1 | P1 | Real-time streaming with speaker diarization |
| Groq | P1 | Ultra-fast inference on Whisper models |
| Gladia | P2 | Enterprise multilingual with custom vocabulary |
TTS Providers
| Name | Priority | Key Differentiator |
|---|---|---|
| ElevenLabs | P0 | Premium voice cloning, highest naturalness |
| Cartesia Sonic | P0 | Ultra-low latency streaming, word-level timestamps |
| PlayHT | P1 | Natural conversational voices, emotion control |
| Groq | P1 | Fast inference, competitive voice quality |
| Fish Audio | P1 | Multilingual support, open-source models |
| LMNT | P2 | Real-time optimized, custom voice creation |
| Smallest.ai | P2 | Lightweight models for edge deployment |
Speech-to-Speech (S2S) Providers
| Name | Priority | Key Differentiator |
|---|---|---|
| OpenAI Realtime | P0 | Native audio-in/audio-out, function calling support |
| Gemini Live | P0 | Multimodal native audio, long context |
| Ultravox (Fixie) | P1 | Open-weight, self-hostable S2S model |
VAD Providers
| Name | Priority | Key Differentiator |
|---|---|---|
| Silero VAD | P0 | Sub-1ms latency per 30ms chunk, high accuracy |
| Semantic Turn Detection | P1 | LLM-aware turn boundary detection |
Full Example
A complete voice pipeline using Deepgram STT, an LLM agent, and ElevenLabs TTS over WebSocket transport:
package main
import (
"context"
"log"
"github.com/lookatitude/beluga-ai/voice"
"github.com/lookatitude/beluga-ai/voice/stt/providers/deepgram"
"github.com/lookatitude/beluga-ai/voice/tts/providers/elevenlabs"
"github.com/lookatitude/beluga-ai/voice/transport"
"github.com/lookatitude/beluga-ai/agent"
"github.com/lookatitude/beluga-ai/llm"
_ "github.com/lookatitude/beluga-ai/llm/providers/openai"
)
func main() {
ctx := context.Background()
// Create an LLM-backed agent for the voice pipeline
model, err := llm.New("openai", llm.ProviderConfig{
Model: "gpt-4o",
})
if err != nil {
log.Fatal(err)
}
voiceAgent := agent.New("voice-assistant",
agent.WithModel(model),
agent.WithSystemPrompt("You are a helpful voice assistant. Keep responses concise."),
)
// Configure STT (Deepgram Nova-3)
sttProcessor := deepgram.New(
deepgram.WithModel("nova-3"),
deepgram.WithLanguage("en"),
deepgram.WithInterimResults(true),
)
// Configure TTS (ElevenLabs)
ttsProcessor := elevenlabs.New(
elevenlabs.WithVoiceID("rachel"),
elevenlabs.WithModel("eleven_turbo_v2_5"),
elevenlabs.WithOutputFormat("pcm_24000"),
)
// Build the voice pipeline
pipeline := voice.NewPipeline(
voice.WithSTT(sttProcessor),
voice.WithAgent(voiceAgent),
voice.WithTTS(ttsProcessor),
voice.WithVAD(voice.SileroVAD()),
voice.WithLatencyBudget(voice.DefaultLatencyBudget()),
)
// Create WebSocket transport and start serving
ws := transport.NewWebSocket(
transport.WithAddr(":8080"),
transport.WithPath("/voice"),
)
log.Println("Voice pipeline listening on ws://localhost:8080/voice")
if err := pipeline.Serve(ctx, ws); err != nil {
log.Fatal(err)
}
}