Voice API — Frame Pipeline, VAD, S2S
import "github.com/lookatitude/beluga-ai/voice"Package voice provides the voice and multimodal pipeline for the Beluga AI framework. It implements a frame-based processing model inspired by Pipecat where atomic Frames (audio chunks, text fragments, images, control signals) flow through linked FrameProcessors via Go channels.
Frame Types
Section titled “Frame Types”The fundamental data unit is the Frame, which carries typed data:
- [FrameAudio] — raw audio data (PCM, Opus, etc.)
- [FrameText] — text fragments (transcripts, LLM output)
- [FrameControl] — control signals (start, stop, interrupt, end-of-utterance)
- [FrameImage] — image/video frames for multimodal pipelines
Convenience constructors are provided: NewAudioFrame, NewTextFrame,
NewControlFrame, and NewImageFrame.
FrameProcessor Interface
Section titled “FrameProcessor Interface”The core abstraction is the FrameProcessor interface. Each processor reads
frames from an input channel, processes them, and writes results to an output
channel. Processors run as goroutines and must close the output channel when
done.
type FrameProcessor interface { Process(ctx context.Context, in <-chan Frame, out chan<- Frame) error}Use FrameProcessorFunc to adapt plain functions as FrameProcessors. Use
Chain to connect multiple processors in series.
Pipeline Modes
Section titled “Pipeline Modes”Three composable pipeline modes are supported:
- Cascading: STT → LLM → TTS (each a FrameProcessor goroutine)
- S2S: Native audio-in/audio-out (OpenAI Realtime, Gemini Live)
- Hybrid: S2S default, fallback to cascade for complex tool use
Cascading Pipeline
Section titled “Cascading Pipeline”The VoicePipeline implements the cascading mode:
pipe := voice.NewPipeline( voice.WithTransport(transport), voice.WithVAD(vad), voice.WithSTT(stt), voice.WithLLM(model), voice.WithTTS(tts),)err := pipe.Run(ctx)Hybrid Pipeline
Section titled “Hybrid Pipeline”The HybridPipeline combines S2S and cascade modes, switching based on a
configurable SwitchPolicy:
hybrid := voice.NewHybridPipeline( voice.WithS2S(s2sEngine), voice.WithCascade(cascadePipeline), voice.WithSwitchPolicy(voice.OnToolOverload),)err := hybrid.Run(ctx)Voice Activity Detection
Section titled “Voice Activity Detection”The VAD interface detects speech in audio data. A built-in EnergyVAD
uses RMS energy thresholds, and providers in voice/vad/providers/ offer
Silero and WebRTC-based detection. The VAD registry follows the standard
RegisterVAD/NewVAD/ListVAD pattern.
Session Management
Section titled “Session Management”The VoiceSession tracks conversational state (idle, listening, speaking)
and Turn history. It is safe for concurrent use.
The Hooks struct provides optional callbacks for pipeline events:
OnSpeechStart, OnSpeechEnd, OnTranscript, OnResponse, and OnError.
Use ComposeHooks to merge multiple hooks.
Latency Budget
Section titled “Latency Budget”Target end-to-end latency: transport <50ms, VAD <1ms, STT <200ms, LLM TTFT <300ms, TTS TTFB <200ms, return <50ms = <800ms E2E.