Skip to content
Docs

VAD API — Silero, WebRTC Detection

import "github.com/lookatitude/beluga-ai/voice/vad/providers/silero"

Package silero provides the Silero VAD (Voice Activity Detection) provider for the Beluga AI voice pipeline. It uses the Silero VAD ONNX model via an energy-based approximation for high-accuracy speech detection on 16-bit PCM audio.

This package requires CGO and is only compiled when the cgo build tag is set.

This package registers itself as “silero” with the voice VAD registry. Import it with a blank identifier to enable:

import _ "github.com/lookatitude/beluga-ai/voice/vad/providers/silero"
vad, err := voice.NewVAD("silero", map[string]any{
"threshold": 0.5,
"model_path": "/path/to/silero_vad.onnx",
})
result, err := vad.DetectActivity(ctx, audioPCM)

Configuration is passed as map[string]any:

  • threshold — Speech probability threshold, 0.0 to 1.0 (default: 0.5)
  • sample_rate — Audio sample rate, 8000 or 16000 (default: 16000)
  • model_path — Path to Silero VAD ONNX model file (optional, falls back to energy-based detection)

When the ONNX model is not available, the provider uses an energy-based fallback calibrated to approximate Silero’s behavior.

  • [VAD] — implements voice.VAD using Silero
  • [Config] — configuration struct
  • [New] — constructor accepting Config

import "github.com/lookatitude/beluga-ai/voice/vad/providers/webrtc"

Package webrtc provides a pure Go WebRTC-style VAD (Voice Activity Detection) provider for the Beluga AI voice pipeline. It uses energy and zero-crossing rate (ZCR) analysis on 16-bit PCM audio to detect speech, distinguishing voiced content from noise.

This package registers itself as “webrtc” with the voice VAD registry. Import it with a blank identifier to enable:

import _ "github.com/lookatitude/beluga-ai/voice/vad/providers/webrtc"
vad, err := voice.NewVAD("webrtc", map[string]any{"threshold": 1500.0})
result, err := vad.DetectActivity(ctx, audioPCM)

Speech is detected when both conditions are met:

  • RMS energy exceeds the energy threshold (filters out silence)
  • Zero-crossing rate is below the ZCR threshold (filters out noise)

This dual-criteria approach provides better discrimination between speech and noise compared to energy-only detection.

Configuration is passed as map[string]any:

  • threshold — RMS energy threshold (default: 1000.0)
  • zcr_threshold — Zero-crossing rate threshold (default: 0.1)
  • [VAD] — implements voice.VAD using energy + ZCR analysis
  • [New] — constructor accepting energy and ZCR thresholds