Capability Layer

Voice Pipeline

Frame-based STT→LLM→TTS processing with sub-800ms end-to-end latency. Speech-to-speech modes, Silero VAD, semantic turn detection, and WebRTC/LiveKit transports.

6+ STT7+ TTSS2S<800ms E2E TargetFrame-Based

Overview

The Beluga AI Voice Pipeline provides a frame-based architecture for real-time voice interactions. Rather than treating voice as a monolithic stream, the pipeline decomposes audio into discrete Frames — atomic units of audio, text, images, or control signals — that flow through a chain of FrameProcessor nodes. This design enables fine-grained composition: swap any STT, TTS, or LLM provider without rewriting your pipeline.

The voice system supports two primary modes. The cascading pipeline (STT → LLM → TTS) transcribes speech, processes it through any LLM, and synthesizes a response. For lower latency, speech-to-speech (S2S) mode bypasses text entirely, using models like OpenAI Realtime or Gemini Live to process audio natively. Both modes integrate seamlessly with the agent runtime, memory, and tool systems.

End-to-end latency is managed through a strict latency budget: transport under 50ms, VAD under 1ms, STT under 200ms, LLM TTFT under 300ms, and TTS TTFB under 200ms — targeting sub-800ms total round-trip. Transports include WebSocket, WebRTC, LiveKit rooms, and Daily.co for production deployment.

Capabilities

Frame-Based Architecture

Every stage in the pipeline implements the FrameProcessor interface, processing atomic Frames that carry audio chunks, transcription text, images, or control signals. This composable design allows processors to be added, removed, or reordered without affecting the rest of the chain.

Cascading Pipeline (STT → LLM → TTS)

The three-stage cascading pipeline is the most common voice pattern. Speech is transcribed by any supported STT provider, the text is processed through an LLM agent (with full access to tools and memory), and the response is synthesized back to audio. Mix and match providers freely — use Deepgram for STT, Claude for reasoning, and ElevenLabs for TTS in the same pipeline.

Speech-to-Speech (S2S)

For the lowest possible latency, S2S mode bypasses the text intermediary entirely. Supported providers include OpenAI Realtime API and Gemini Live, which process audio input and produce audio output natively. S2S mode can fall back to cascading mode when the S2S provider is unavailable.

STT Providers

Six STT providers are supported out of the box: Deepgram Nova-3 (streaming-first with low latency), ElevenLabs Scribe (high accuracy multilingual), OpenAI Whisper (broad language support), AssemblyAI Slam-1 (real-time streaming), Groq (ultra-fast inference), and Gladia (enterprise multilingual). All providers implement the same interface, making swaps seamless.

TTS Providers

Seven TTS providers cover a range of voice quality and latency profiles: ElevenLabs (premium voice cloning), Cartesia Sonic (ultra-low latency streaming), PlayHT (natural conversational voices), Groq (fast inference), Fish Audio (multilingual), LMNT (real-time optimized), and Smallest.ai (lightweight edge deployment).

Voice Activity Detection

Silero VAD runs in under 1ms per 30ms audio chunk, providing reliable speech/silence detection. Beyond simple energy-based detection, Beluga also supports semantic turn detection that uses LLM context to determine when a speaker has finished their thought, reducing false interruptions in conversational scenarios.

Transports

Four transport options are available: WebSocket for simple bidirectional streaming, WebRTC for peer-to-peer low-latency audio, LiveKit rooms for scalable multi-participant sessions, and Daily.co for managed infrastructure. All transports produce and consume the same Frame types.

Latency Budget

The voice pipeline enforces a latency budget to meet the sub-800ms target. Each stage has a budget allocation: Transport <50ms, VAD <1ms, STT <200ms, LLM TTFT <300ms, and TTS TTFB <200ms. Observability hooks report per-stage timing, enabling you to identify and resolve bottlenecks.

Architecture

Cascading Pipeline (STT → LLM → TTS)
Audio Input
Transport
VAD
STT
LLM Agent
TTS
Audio Output
Speech-to-Speech (S2S)
Audio Input
Transport
S2S Model
Audio Output
Latency Budget
Transport <50ms
+
VAD <1ms
+
STT <200ms
+
LLM TTFT <300ms
+
TTS TTFB <200ms
=
<800ms E2E

Providers & Implementations

STT Providers

Name Priority Key Differentiator
Deepgram Nova-3P0Streaming-first, lowest latency, excellent accuracy
ElevenLabs ScribeP0High accuracy multilingual transcription
OpenAI WhisperP0Broadest language support, well-known baseline
AssemblyAI Slam-1P1Real-time streaming with speaker diarization
GroqP1Ultra-fast inference on Whisper models
GladiaP2Enterprise multilingual with custom vocabulary

TTS Providers

Name Priority Key Differentiator
ElevenLabsP0Premium voice cloning, highest naturalness
Cartesia SonicP0Ultra-low latency streaming, word-level timestamps
PlayHTP1Natural conversational voices, emotion control
GroqP1Fast inference, competitive voice quality
Fish AudioP1Multilingual support, open-source models
LMNTP2Real-time optimized, custom voice creation
Smallest.aiP2Lightweight models for edge deployment

Speech-to-Speech (S2S) Providers

Name Priority Key Differentiator
OpenAI RealtimeP0Native audio-in/audio-out, function calling support
Gemini LiveP0Multimodal native audio, long context
Ultravox (Fixie)P1Open-weight, self-hostable S2S model

VAD Providers

Name Priority Key Differentiator
Silero VADP0Sub-1ms latency per 30ms chunk, high accuracy
Semantic Turn DetectionP1LLM-aware turn boundary detection

Full Example

A complete voice pipeline using Deepgram STT, an LLM agent, and ElevenLabs TTS over WebSocket transport:

package main

import (
    "context"
    "log"

    "github.com/lookatitude/beluga-ai/voice"
    "github.com/lookatitude/beluga-ai/voice/stt/providers/deepgram"
    "github.com/lookatitude/beluga-ai/voice/tts/providers/elevenlabs"
    "github.com/lookatitude/beluga-ai/voice/transport"
    "github.com/lookatitude/beluga-ai/agent"
    "github.com/lookatitude/beluga-ai/llm"
    _ "github.com/lookatitude/beluga-ai/llm/providers/openai"
)

func main() {
    ctx := context.Background()

    // Create an LLM-backed agent for the voice pipeline
    model, err := llm.New("openai", llm.ProviderConfig{
        Model: "gpt-4o",
    })
    if err != nil {
        log.Fatal(err)
    }

    voiceAgent := agent.New("voice-assistant",
        agent.WithModel(model),
        agent.WithSystemPrompt("You are a helpful voice assistant. Keep responses concise."),
    )

    // Configure STT (Deepgram Nova-3)
    sttProcessor := deepgram.New(
        deepgram.WithModel("nova-3"),
        deepgram.WithLanguage("en"),
        deepgram.WithInterimResults(true),
    )

    // Configure TTS (ElevenLabs)
    ttsProcessor := elevenlabs.New(
        elevenlabs.WithVoiceID("rachel"),
        elevenlabs.WithModel("eleven_turbo_v2_5"),
        elevenlabs.WithOutputFormat("pcm_24000"),
    )

    // Build the voice pipeline
    pipeline := voice.NewPipeline(
        voice.WithSTT(sttProcessor),
        voice.WithAgent(voiceAgent),
        voice.WithTTS(ttsProcessor),
        voice.WithVAD(voice.SileroVAD()),
        voice.WithLatencyBudget(voice.DefaultLatencyBudget()),
    )

    // Create WebSocket transport and start serving
    ws := transport.NewWebSocket(
        transport.WithAddr(":8080"),
        transport.WithPath("/voice"),
    )

    log.Println("Voice pipeline listening on ws://localhost:8080/voice")
    if err := pipeline.Serve(ctx, ws); err != nil {
        log.Fatal(err)
    }
}

Related Features