Silero VAD Provider
The Silero provider implements voice activity detection using the Silero VAD ONNX model. When the ONNX runtime is unavailable, it falls back to an energy-based detector calibrated to approximate Silero’s sensitivity. This provider requires CGO.
Choose Silero when you need high-accuracy, neural network-based voice activity detection. The ONNX model provides more precise speech/silence classification than energy-based methods, especially in noisy environments. Note that it requires CGO. For a pure Go alternative with no CGO dependency, use WebRTC.
Installation
Section titled “Installation”go get github.com/lookatitude/beluga-ai/voice/vad/providers/sileroThis package uses the cgo build tag. Ensure CGO is enabled in your build environment.
Configuration
Section titled “Configuration”| Field | Required | Default | Description |
|---|---|---|---|
Threshold | No | 0.5 | Speech probability threshold (0.0-1.0) |
SampleRate | No | 16000 | Audio sample rate in Hz (8000 or 16000) |
ModelPath | No | — | Path to Silero VAD ONNX model file |
Registry configuration keys:
| Key | Type | Maps to |
|---|---|---|
threshold | float64 | Threshold |
sample_rate | int | SampleRate |
model_path | string | ModelPath |
Basic Usage
Section titled “Basic Usage”package main
import ( "context" "fmt" "log"
"github.com/lookatitude/beluga-ai/voice" _ "github.com/lookatitude/beluga-ai/voice/vad/providers/silero")
func main() { vad, err := voice.NewVAD("silero", map[string]any{ "threshold": 0.5, "sample_rate": 16000, }) if err != nil { log.Fatal(err) }
// audioPCM is 16-bit little-endian PCM audio data var audioPCM []byte // ... obtained from audio source
result, err := vad.DetectActivity(context.Background(), audioPCM) if err != nil { log.Fatal(err) }
fmt.Printf("Speech: %v, Event: %s, Confidence: %.2f\n", result.IsSpeech, result.EventType, result.Confidence)}Advanced Features
Section titled “Advanced Features”ONNX Model Loading
Section titled “ONNX Model Loading”When a ModelPath is provided, the Silero provider loads the ONNX model for neural network-based speech detection. Without a model path, it uses an energy-based fallback calibrated to approximate Silero’s output:
vad, err := voice.NewVAD("silero", map[string]any{ "threshold": 0.5, "model_path": "/models/silero_vad.onnx",})Continuous Detection with State Tracking
Section titled “Continuous Detection with State Tracking”The provider tracks speech state internally and emits transition events:
for _, chunk := range audioChunks { result, err := vad.DetectActivity(ctx, chunk) if err != nil { log.Fatal(err) }
switch result.EventType { case voice.VADSpeechStart: fmt.Println("Speech started") case voice.VADSpeechEnd: fmt.Println("Speech ended") case voice.VADSilence: // Ongoing silence }}Threshold Tuning
Section titled “Threshold Tuning”Lower thresholds increase sensitivity (more false positives), while higher thresholds reduce sensitivity (more missed speech):
0.3— High sensitivity, suitable for quiet environments0.5— Balanced default for general use0.7— Low sensitivity, rejects ambient noise
Direct Construction
Section titled “Direct Construction”For compile-time type safety, construct the provider directly:
import "github.com/lookatitude/beluga-ai/voice/vad/providers/silero"
vad, err := silero.New(silero.Config{ Threshold: 0.5, SampleRate: 16000, ModelPath: "/models/silero_vad.onnx",})Error Handling
Section titled “Error Handling”result, err := vad.DetectActivity(ctx, audioPCM)if err != nil { log.Printf("VAD detection failed: %v", err) // Handle error — audio frame may be malformed}Audio data must be 16-bit little-endian PCM. Frames shorter than 2 bytes return a silence result without error.