Capability Layer

Voice Pipeline

Frame-based STT→LLM→TTS processing with sub-800ms end-to-end latency. Speech-to-speech modes, Silero VAD, semantic turn detection, and WebRTC/LiveKit transports.

6+ STT7+ TTSS2S<800ms E2E TargetFrame-Based

Overview

The Beluga AI Voice Pipeline provides a frame-based architecture for real-time voice interactions. Rather than treating voice as a monolithic stream, the pipeline decomposes audio into discrete Frames — atomic units of audio, text, images, or control signals — that flow through a chain of FrameProcessor nodes. This design enables fine-grained composition: swap any STT, TTS, or LLM provider without rewriting your pipeline.

The voice system supports two primary modes. The cascading pipeline (STT → LLM → TTS) transcribes speech, processes it through any LLM, and synthesizes a response. For lower latency, speech-to-speech (S2S) mode bypasses text entirely, using models like OpenAI Realtime or Gemini Live to process audio natively. Both modes integrate seamlessly with the agent runtime, memory, and tool systems.

End-to-end latency is managed through a strict latency budget: transport under 50ms, VAD under 1ms, STT under 200ms, LLM TTFT under 300ms, and TTS TTFB under 200ms — targeting sub-800ms total round-trip. Transports include WebSocket, WebRTC, LiveKit rooms, and Daily.co for production deployment.

Capabilities

Frame-Based Architecture

Every stage in the pipeline implements the FrameProcessor interface, processing atomic Frames that carry audio chunks, transcription text, images, or control signals. This composable design allows processors to be added, removed, or reordered without affecting the rest of the chain.

Cascading Pipeline (STT → LLM → TTS)

The three-stage cascading pipeline is the most common voice pattern. Speech is transcribed by any supported STT provider, the text is processed through an LLM agent (with full access to tools and memory), and the response is synthesized back to audio. Mix and match providers freely — use Deepgram for STT, Claude for reasoning, and ElevenLabs for TTS in the same pipeline.

Speech-to-Speech (S2S)

For the lowest possible latency, S2S mode bypasses the text intermediary entirely. Supported providers include OpenAI Realtime API and Gemini Live, which process audio input and produce audio output natively. S2S mode can fall back to cascading mode when the S2S provider is unavailable.

STT Providers

Six STT providers are supported out of the box: Deepgram Nova-3 (streaming-first with low latency), ElevenLabs Scribe (high accuracy multilingual), OpenAI Whisper (broad language support), AssemblyAI Slam-1 (real-time streaming), Groq (ultra-fast inference), and Gladia (enterprise multilingual). All providers implement the same interface, making swaps seamless.

TTS Providers

Seven TTS providers cover a range of voice quality and latency profiles: ElevenLabs (premium voice cloning), Cartesia Sonic (ultra-low latency streaming), PlayHT (natural conversational voices), Groq (fast inference), Fish Audio (multilingual), LMNT (real-time optimized), and Smallest.ai (lightweight edge deployment).

Voice Activity Detection

Silero VAD runs in under 1ms per 30ms audio chunk, providing reliable speech/silence detection. Beyond simple energy-based detection, Beluga also supports semantic turn detection that uses LLM context to determine when a speaker has finished their thought, reducing false interruptions in conversational scenarios.

Transports

Four transport options are available: WebSocket for simple bidirectional streaming, WebRTC for peer-to-peer low-latency audio, LiveKit rooms for scalable multi-participant sessions, and Daily.co for managed infrastructure. All transports produce and consume the same Frame types.

Latency Budget

The voice pipeline enforces a latency budget to meet the sub-800ms target. Each stage has a budget allocation: Transport <50ms, VAD <1ms, STT <200ms, LLM TTFT <300ms, and TTS TTFB <200ms. Observability hooks report per-stage timing, enabling you to identify and resolve bottlenecks.

Architecture

Cascading Pipeline (STT → LLM → TTS)

Audio Input

→

Transport

→

VAD

→

STT

→

LLM Agent

→

TTS

→

Audio Output

Speech-to-Speech (S2S)

Audio Input

→

Transport

→

S2S Model

→

Audio Output

Latency Budget

Transport <50ms

VAD <1ms

STT <200ms

LLM TTFT <300ms

TTS TTFB <200ms

<800ms E2E

Providers & Implementations

STT Providers

Name	Priority	Key Differentiator
Deepgram Nova-3	P0	Streaming-first, lowest latency, excellent accuracy
ElevenLabs Scribe	P0	High accuracy multilingual transcription
OpenAI Whisper	P0	Broadest language support, well-known baseline
AssemblyAI Slam-1	P1	Real-time streaming with speaker diarization
Groq	P1	Ultra-fast inference on Whisper models
Gladia	P2	Enterprise multilingual with custom vocabulary

TTS Providers

Name	Priority	Key Differentiator
ElevenLabs	P0	Premium voice cloning, highest naturalness
Cartesia Sonic	P0	Ultra-low latency streaming, word-level timestamps
PlayHT	P1	Natural conversational voices, emotion control
Groq	P1	Fast inference, competitive voice quality
Fish Audio	P1	Multilingual support, open-source models
LMNT	P2	Real-time optimized, custom voice creation
Smallest.ai	P2	Lightweight models for edge deployment

Speech-to-Speech (S2S) Providers

Name	Priority	Key Differentiator
OpenAI Realtime	P0	Native audio-in/audio-out, function calling support
Gemini Live	P0	Multimodal native audio, long context
Ultravox (Fixie)	P1	Open-weight, self-hostable S2S model

VAD Providers

Name	Priority	Key Differentiator
Silero VAD	P0	Sub-1ms latency per 30ms chunk, high accuracy
Semantic Turn Detection	P1	LLM-aware turn boundary detection

Full Example

A complete voice pipeline using Deepgram STT, an LLM agent, and ElevenLabs TTS over WebSocket transport:

package main

import (
    "context"
    "log"

    "github.com/lookatitude/beluga-ai/voice"
    "github.com/lookatitude/beluga-ai/voice/stt/providers/deepgram"
    "github.com/lookatitude/beluga-ai/voice/tts/providers/elevenlabs"
    "github.com/lookatitude/beluga-ai/voice/transport"
    "github.com/lookatitude/beluga-ai/agent"
    "github.com/lookatitude/beluga-ai/llm"
    _ "github.com/lookatitude/beluga-ai/llm/providers/openai"
)

func main() {
    ctx := context.Background()

    // Create an LLM-backed agent for the voice pipeline
    model, err := llm.New("openai", llm.ProviderConfig{
        Model: "gpt-4o",
    })
    if err != nil {
        log.Fatal(err)
    }

    voiceAgent := agent.New("voice-assistant",
        agent.WithModel(model),
        agent.WithSystemPrompt("You are a helpful voice assistant. Keep responses concise."),
    )

    // Configure STT (Deepgram Nova-3)
    sttProcessor := deepgram.New(
        deepgram.WithModel("nova-3"),
        deepgram.WithLanguage("en"),
        deepgram.WithInterimResults(true),
    )

    // Configure TTS (ElevenLabs)
    ttsProcessor := elevenlabs.New(
        elevenlabs.WithVoiceID("rachel"),
        elevenlabs.WithModel("eleven_turbo_v2_5"),
        elevenlabs.WithOutputFormat("pcm_24000"),
    )

    // Build the voice pipeline
    pipeline := voice.NewPipeline(
        voice.WithSTT(sttProcessor),
        voice.WithAgent(voiceAgent),
        voice.WithTTS(ttsProcessor),
        voice.WithVAD(voice.SileroVAD()),
        voice.WithLatencyBudget(voice.DefaultLatencyBudget()),
    )

    // Create WebSocket transport and start serving
    ws := transport.NewWebSocket(
        transport.WithAddr(":8080"),
        transport.WithPath("/voice"),
    )

    log.Println("Voice pipeline listening on ws://localhost:8080/voice")
    if err := pipeline.Serve(ctx, ws); err != nil {
        log.Fatal(err)
    }
}

AI Agents

Data & Retrieval

Infrastructure

Orchestration

Voice Pipeline

Overview

Capabilities

Frame-Based Architecture

Cascading Pipeline (STT → LLM → TTS)

Speech-to-Speech (S2S)

STT Providers

TTS Providers

Voice Activity Detection

Transports

Latency Budget

Architecture

Providers & Implementations

STT Providers

TTS Providers

Speech-to-Speech (S2S) Providers

VAD Providers

Full Example

Related Features

AI Agents

Data & Retrieval

Infrastructure

Orchestration

Voice Pipeline

Overview

Capabilities

Frame-Based Architecture

Cascading Pipeline (STT → LLM → TTS)

Speech-to-Speech (S2S)

STT Providers

TTS Providers

Voice Activity Detection

Transports

Latency Budget

Architecture

Providers & Implementations

STT Providers

TTS Providers

Speech-to-Speech (S2S) Providers

VAD Providers

Full Example

Related Features

Agent Runtime

LLM Providers

Protocols