Skip to content
Docs

Silero VAD Provider

The Silero provider implements voice activity detection using the Silero VAD ONNX model. When the ONNX runtime is unavailable, it falls back to an energy-based detector calibrated to approximate Silero’s sensitivity. This provider requires CGO.

Choose Silero when you need high-accuracy, neural network-based voice activity detection. The ONNX model provides more precise speech/silence classification than energy-based methods, especially in noisy environments. Note that it requires CGO. For a pure Go alternative with no CGO dependency, use WebRTC.

Terminal window
go get github.com/lookatitude/beluga-ai/voice/vad/providers/silero

This package uses the cgo build tag. Ensure CGO is enabled in your build environment.

FieldRequiredDefaultDescription
ThresholdNo0.5Speech probability threshold (0.0-1.0)
SampleRateNo16000Audio sample rate in Hz (8000 or 16000)
ModelPathNoPath to Silero VAD ONNX model file

Registry configuration keys:

KeyTypeMaps to
thresholdfloat64Threshold
sample_rateintSampleRate
model_pathstringModelPath
package main
import (
"context"
"fmt"
"log"
"github.com/lookatitude/beluga-ai/voice"
_ "github.com/lookatitude/beluga-ai/voice/vad/providers/silero"
)
func main() {
vad, err := voice.NewVAD("silero", map[string]any{
"threshold": 0.5,
"sample_rate": 16000,
})
if err != nil {
log.Fatal(err)
}
// audioPCM is 16-bit little-endian PCM audio data
var audioPCM []byte // ... obtained from audio source
result, err := vad.DetectActivity(context.Background(), audioPCM)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Speech: %v, Event: %s, Confidence: %.2f\n",
result.IsSpeech, result.EventType, result.Confidence)
}

When a ModelPath is provided, the Silero provider loads the ONNX model for neural network-based speech detection. Without a model path, it uses an energy-based fallback calibrated to approximate Silero’s output:

vad, err := voice.NewVAD("silero", map[string]any{
"threshold": 0.5,
"model_path": "/models/silero_vad.onnx",
})

The provider tracks speech state internally and emits transition events:

for _, chunk := range audioChunks {
result, err := vad.DetectActivity(ctx, chunk)
if err != nil {
log.Fatal(err)
}
switch result.EventType {
case voice.VADSpeechStart:
fmt.Println("Speech started")
case voice.VADSpeechEnd:
fmt.Println("Speech ended")
case voice.VADSilence:
// Ongoing silence
}
}

Lower thresholds increase sensitivity (more false positives), while higher thresholds reduce sensitivity (more missed speech):

  • 0.3 — High sensitivity, suitable for quiet environments
  • 0.5 — Balanced default for general use
  • 0.7 — Low sensitivity, rejects ambient noise

For compile-time type safety, construct the provider directly:

import "github.com/lookatitude/beluga-ai/voice/vad/providers/silero"
vad, err := silero.New(silero.Config{
Threshold: 0.5,
SampleRate: 16000,
ModelPath: "/models/silero_vad.onnx",
})
result, err := vad.DetectActivity(ctx, audioPCM)
if err != nil {
log.Printf("VAD detection failed: %v", err)
// Handle error — audio frame may be malformed
}

Audio data must be 16-bit little-endian PCM. Frames shorter than 2 bytes return a silence result without error.