Skip to content
Docs

Voice API — Frame Pipeline, VAD, S2S

import "github.com/lookatitude/beluga-ai/voice"

Package voice provides the voice and multimodal pipeline for the Beluga AI framework. It implements a frame-based processing model inspired by Pipecat where atomic Frames (audio chunks, text fragments, images, control signals) flow through linked FrameProcessors via Go channels.

The fundamental data unit is the Frame, which carries typed data:

  • [FrameAudio] — raw audio data (PCM, Opus, etc.)
  • [FrameText] — text fragments (transcripts, LLM output)
  • [FrameControl] — control signals (start, stop, interrupt, end-of-utterance)
  • [FrameImage] — image/video frames for multimodal pipelines

Convenience constructors are provided: NewAudioFrame, NewTextFrame, NewControlFrame, and NewImageFrame.

The core abstraction is the FrameProcessor interface. Each processor reads frames from an input channel, processes them, and writes results to an output channel. Processors run as goroutines and must close the output channel when done.

type FrameProcessor interface {
Process(ctx context.Context, in <-chan Frame, out chan<- Frame) error
}

Use FrameProcessorFunc to adapt plain functions as FrameProcessors. Use Chain to connect multiple processors in series.

Three composable pipeline modes are supported:

  • Cascading: STT → LLM → TTS (each a FrameProcessor goroutine)
  • S2S: Native audio-in/audio-out (OpenAI Realtime, Gemini Live)
  • Hybrid: S2S default, fallback to cascade for complex tool use

The VoicePipeline implements the cascading mode:

pipe := voice.NewPipeline(
voice.WithTransport(transport),
voice.WithVAD(vad),
voice.WithSTT(stt),
voice.WithLLM(model),
voice.WithTTS(tts),
)
err := pipe.Run(ctx)

The HybridPipeline combines S2S and cascade modes, switching based on a configurable SwitchPolicy:

hybrid := voice.NewHybridPipeline(
voice.WithS2S(s2sEngine),
voice.WithCascade(cascadePipeline),
voice.WithSwitchPolicy(voice.OnToolOverload),
)
err := hybrid.Run(ctx)

The VAD interface detects speech in audio data. A built-in EnergyVAD uses RMS energy thresholds, and providers in voice/vad/providers/ offer Silero and WebRTC-based detection. The VAD registry follows the standard RegisterVAD/NewVAD/ListVAD pattern.

The VoiceSession tracks conversational state (idle, listening, speaking) and Turn history. It is safe for concurrent use.

The Hooks struct provides optional callbacks for pipeline events: OnSpeechStart, OnSpeechEnd, OnTranscript, OnResponse, and OnError. Use ComposeHooks to merge multiple hooks.

Target end-to-end latency: transport <50ms, VAD <1ms, STT <200ms, LLM TTFT <300ms, TTS TTFB <200ms, return <50ms = <800ms E2E.