Skip to content
Docs

ONNX Edge VAD (Silero)

Running VAD on the edge eliminates the network round-trip for speech detection, which reduces latency and keeps audio data local for privacy-sensitive deployments. Silero VAD with ONNX Runtime enables voice activity detection on resource-constrained devices such as Raspberry Pi, embedded gateways, and kiosk hardware. Choose this approach when you need offline-capable VAD or when sending raw audio to a cloud service is not acceptable. This guide covers configuring the Silero VAD provider for low-resource environments.

The voice/vad package’s Silero provider loads an ONNX model to detect speech in audio frames. By tuning Threshold, FrameSize, and SampleRate, you can balance detection accuracy against CPU and memory usage on edge hardware.

  • Go 1.23 or later
  • Silero VAD ONNX model file (silero_vad.onnx)
  • ONNX Runtime libraries available on target platform
Terminal window
go get github.com/lookatitude/beluga-ai

Ensure the ONNX model is available on the edge device (bundled in the binary, downloaded at startup, or placed in read-only storage).

Create a Silero VAD provider optimized for edge deployment:

package main
import (
"context"
"fmt"
"log"
"os"
"path/filepath"
"time"
"github.com/lookatitude/beluga-ai/voice/vad"
_ "github.com/lookatitude/beluga-ai/voice/vad/providers/silero"
)
func main() {
ctx := context.Background()
modelPath := os.Getenv("SILERO_VAD_MODEL_PATH")
if modelPath == "" {
modelPath = filepath.Join("/opt/vad", "silero_vad.onnx")
}
cfg := vad.DefaultConfig()
provider, err := vad.NewProvider(ctx, "silero", cfg,
vad.WithModelPath(modelPath),
vad.WithThreshold(0.5),
vad.WithSampleRate(16000),
vad.WithFrameSize(512),
vad.WithMinSpeechDuration(200*time.Millisecond),
vad.WithMaxSilenceDuration(500*time.Millisecond),
)
if err != nil {
log.Fatalf("Failed to create VAD provider: %v", err)
}
audio := make([]byte, 1024)
speech, err := provider.Process(ctx, audio)
if err != nil {
log.Fatalf("Processing failed: %v", err)
}
fmt.Printf("Speech detected: %v\n", speech)
}

Use ProcessStream when processing a live microphone stream on the edge. Feed audio chunks from a capture loop and consume VADResult values for downstream logic:

audioCh := make(chan []byte, 16)
resultCh, err := provider.ProcessStream(ctx, audioCh)
if err != nil {
log.Fatalf("Stream setup failed: %v", err)
}
go func() {
for result := range resultCh {
if result.Speech {
// Forward speech frames to STT or processing pipeline
}
}
}()
// Feed audioCh from microphone capture loop

The Silero provider loads the ONNX model on first use (lazy loading). To avoid first-request latency, warm up with a dummy Process call during initialization:

// Warm up model during init
_, _ = provider.Process(ctx, make([]byte, 1024))
OptionDescriptionDefaultEdge Notes
ModelPathPath to ONNX model-Use local storage or /tmp
ThresholdDetection threshold0.50.5-0.6 typical
FrameSizeFrame size (samples)512Smaller = less CPU, coarser
SampleRateSample rate (Hz)16000Match input audio
MinSpeechDurationMin speech duration250 msTune to reduce false triggers
MaxSilenceDurationMax silence duration500 msTune for turn-taking behavior

Verify the path, file permissions, and that the model file exists on the device. Use absolute paths. Check available disk space and memory.

Reduce the effective frame rate (process every second frame), increase FrameSize slightly, or use a smaller Silero variant if available. Profile with pprof to identify bottlenecks.

Ensure ONNX Runtime libraries are installed or bundled for your target OS/architecture. Build with appropriate CGO tags.

  • Warm up the model during initialization to avoid first-request latency
  • Monitor model load time, Process latency, and memory usage
  • Plan for model updates (file replacement + restart) without breaking active sessions
  • Use absolute model paths to avoid working directory issues on embedded systems