Custom VAD with Silero Models
Silero VAD is a neural network-based voice activity detector that runs locally using ONNX Runtime. Unlike energy-based VAD methods that rely on volume thresholds — and therefore trigger on any loud noise like keyboard typing or door closing — neural VAD analyzes spectral features of the audio to distinguish human speech from non-speech sounds. This makes it significantly more accurate in real-world environments. This tutorial demonstrates how to configure Silero VAD with custom model paths, tune thresholds, and integrate with streaming audio pipelines.
What You Will Build
Section titled “What You Will Build”A voice activity detection system using Silero’s ONNX model that classifies audio frames as speech or non-speech, with real-time streaming support and configurable sensitivity parameters.
Prerequisites
Section titled “Prerequisites”- Go 1.23+
- Silero VAD ONNX model file (downloadable from the Silero VAD repository)
- Completion of Sensitivity Tuning is recommended
Step 1: Create a Silero VAD Provider
Section titled “Step 1: Create a Silero VAD Provider”The VAD provider follows Beluga’s standard registry pattern. The FrameSize option controls how many audio samples are processed per inference call. A frame size of 512 samples at 16kHz represents 32ms of audio, which is the standard frame duration for real-time speech processing. Smaller frames provide faster detection at the cost of slightly lower accuracy per frame.
package main
import ( "context" "fmt" "log" "os" "path/filepath"
"github.com/lookatitude/beluga-ai/voice/vad" _ "github.com/lookatitude/beluga-ai/voice/vad/providers/silero")
func main() { ctx := context.Background()
modelPath := os.Getenv("SILERO_VAD_MODEL_PATH") if modelPath == "" { modelPath = filepath.Join(os.TempDir(), "silero_vad.onnx") }
provider, err := vad.NewProvider(ctx, "silero", vad.DefaultConfig(), vad.WithModelPath(modelPath), vad.WithThreshold(0.5), vad.WithSampleRate(16000), vad.WithFrameSize(512), ) if err != nil { log.Fatalf("create VAD provider: %v", err) }
// Process a single audio frame audio := make([]byte, 1024) hasSpeech, err := provider.Process(ctx, audio) if err != nil { log.Fatalf("process: %v", err) }
fmt.Printf("Speech detected: %v\n", hasSpeech)}Configuration Options
Section titled “Configuration Options”| Option | Default | Description |
|---|---|---|
ModelPath | (built-in) | Path to the Silero VAD ONNX model file |
Threshold | 0.5 | Speech confidence threshold (0.0-1.0) |
SampleRate | 16000 | Audio sample rate in Hz |
FrameSize | 512 | Number of samples per audio frame |
MinSpeechDuration | 250ms | Minimum consecutive speech frames to confirm speech |
MaxSilenceDuration | 500ms | Maximum silence within speech before marking end |
EnablePreprocessing | false | Apply preprocessing (normalization) before inference |
The MinSpeechDuration parameter prevents single-frame false positives from triggering speech detection. By requiring multiple consecutive frames of detected speech, the provider filters out transient noise spikes that happen to exceed the confidence threshold.
Step 2: Real-Time Streaming Detection
Section titled “Step 2: Real-Time Streaming Detection”For continuous audio processing, use ProcessStream to feed audio frames through a channel and receive VAD results in real time. The channel-based API is used here instead of iter.Seq2 because VAD processing is inherently bidirectional: your application sends audio frames concurrently with receiving results, and both directions operate at their own pace.
audioCh := make(chan []byte, 8) resultCh, err := provider.ProcessStream(ctx, audioCh) if err != nil { log.Fatalf("start process stream: %v", err) }
// Consume results in a separate goroutine go func() { for result := range resultCh { if result.Error != nil { log.Printf("VAD error: %v", result.Error) continue } if result.HasVoice { fmt.Printf("Speech detected (confidence: %.2f)\n", result.Confidence) } } }()
// Feed audio frames from your source for frame := range audioSource { audioCh <- frame } close(audioCh) // Signal end of audioThe VADResult struct provides both the binary detection result and the underlying confidence score, allowing your application to implement custom thresholding logic if needed:
| Field | Type | Description |
|---|---|---|
HasVoice | bool | Whether speech was detected |
Confidence | float64 | Model confidence score (0.0-1.0) |
Error | error | Non-nil if processing failed |
Step 3: Threshold Tuning by Environment
Section titled “Step 3: Threshold Tuning by Environment”The threshold determines how confidently the model must classify a frame as speech before reporting it. The right threshold depends on your deployment environment because background noise characteristics vary significantly. A quiet office has occasional low-frequency HVAC noise, while a call center has continuous speech from adjacent operators.
// Quiet environment: lower threshold captures softer speech quietVAD, err := vad.NewProvider(ctx, "silero", vad.DefaultConfig(), vad.WithModelPath(modelPath), vad.WithThreshold(0.4), vad.WithMinSpeechDuration(200*time.Millisecond), )
// Noisy environment: higher threshold reduces false positives noisyVAD, err := vad.NewProvider(ctx, "silero", vad.DefaultConfig(), vad.WithModelPath(modelPath), vad.WithThreshold(0.7), vad.WithMinSpeechDuration(350*time.Millisecond), vad.WithMaxSilenceDuration(600*time.Millisecond), )Tuning Guidelines
Section titled “Tuning Guidelines”| Environment | Threshold | MinSpeechDuration | MaxSilenceDuration |
|---|---|---|---|
| Quiet office | 0.3-0.4 | 200ms | 400ms |
| Standard room | 0.5 | 250ms | 500ms |
| Open office | 0.6 | 300ms | 500ms |
| Call center | 0.7-0.8 | 350ms | 600ms |
Step 4: Combine with Noise Cancellation
Section titled “Step 4: Combine with Noise Cancellation”For the best results in noisy environments, apply noise cancellation before VAD processing. This two-stage approach — clean the audio, then detect speech — is more effective than raising the VAD threshold alone, because noise cancellation removes the noise signal entirely rather than requiring the model to classify through it.
import ( "github.com/lookatitude/beluga-ai/voice/noise" _ "github.com/lookatitude/beluga-ai/voice/noise/providers/rnnoise")
func processWithNoiseCancellation(ctx context.Context, audio []byte) (bool, error) { // First, reduce noise noiseCanceller, err := noise.NewProvider(ctx, "rnnoise", noise.DefaultConfig()) if err != nil { return false, fmt.Errorf("create noise canceller: %w", err) }
cleanAudio, err := noiseCanceller.Process(ctx, audio) if err != nil { return false, fmt.Errorf("noise cancellation: %w", err) }
// Then, detect voice activity on clean audio vadProvider, err := vad.NewProvider(ctx, "silero", vad.DefaultConfig(), vad.WithThreshold(0.5), ) if err != nil { return false, fmt.Errorf("create VAD: %w", err) }
return vadProvider.Process(ctx, cleanAudio)}Verification
Section titled “Verification”- Set
SILERO_VAD_MODEL_PATHto your ONNX model path (or omit for the built-in model). - Run the application and verify
Processreturnsfalsefor silence. - Feed speech audio and confirm
Processreturnstrue. - Test
ProcessStreamwith continuous audio and verify real-time results. - Compare detection accuracy at different threshold levels.
Next Steps
Section titled “Next Steps”- Sensitivity Tuning — Tune VAD alongside turn detection
- ML-Based Turn Prediction — Neural turn-end detection
- Voice Session Interruptions — Use VAD for barge-in detection