Noise-Resistant Voice Activity Detection
Field-service applications, factory floors, and vehicle environments require voice activity detection (VAD) that stays accurate despite high background noise (85-100+ dB). Standard energy-based VAD — which simply checks if the audio signal exceeds an amplitude threshold — cannot distinguish speech from mechanical noise, fan hum, or engine vibration at these levels, producing 22% false positive rates. Every false trigger wastes downstream STT compute and can cause incorrect command activations in safety-critical environments.
The fundamental limitation of energy-based VAD is that it treats all sound energy equally. Model-based VAD (Silero, RNNoise) learns spectral patterns that distinguish human speech from environmental noise, maintaining accuracy where energy thresholds fail. Using Beluga AI’s VAD with Silero and RNNoise model-based providers, false triggers drop to 6% while maintaining sub-60ms latency.
Solution Architecture
Section titled “Solution Architecture”graph TB
Mic[Microphone] --> Pre[Preprocess]
Pre --> VAD[VAD Provider]
VAD --> Silero[Silero / RNNoise]
Silero --> Decision[Speech?]
Decision --> App[App Logic]
VAD --> Metrics[OTel Metrics]
Audio is optionally preprocessed (normalization), then passed to the VAD provider. Silero or RNNoise runs on each audio frame, and threshold/duration parameters filter spurious activations. Results feed application logic (wake word, push-to-talk, command trigger). OpenTelemetry records decisions and errors for tuning dashboards.
The preprocessing step is optional but important in noisy environments: normalizing audio amplitude ensures the VAD model receives input within its expected range, regardless of microphone gain or distance. Without normalization, the same voice at different distances can produce dramatically different model outputs.
Implementation
Section titled “Implementation”Silero VAD Setup
Section titled “Silero VAD Setup”The Silero VAD is configured with a higher threshold (0.55 vs. the default 0.5) and longer MinSpeechDuration (200ms) for noisy environments. The higher threshold reduces false triggers from noise bursts, while the longer minimum speech duration filters out brief non-speech sounds that the model might score above threshold. These parameters are deliberately exposed as configuration rather than hardcoded, since optimal values vary by deployment environment.
package main
import ( "context" "time"
"github.com/lookatitude/beluga-ai/voice")
func setupNoiseResistantVAD(ctx context.Context) (voice.FrameProcessor, error) { vad := voice.NewSileroVAD(voice.VADConfig{ Threshold: 0.55, // Tuned for noise environments MinSpeechDuration: 200 * time.Millisecond, MaxSilenceDuration: 600 * time.Millisecond, SampleRate: 16000, EnablePreprocessing: true, // Normalize input for models })
return vad, nil}Real-Time Processing
Section titled “Real-Time Processing”func processAudioStream(ctx context.Context, vad voice.FrameProcessor, audioStream <-chan []byte) error { in := make(chan voice.Frame) out := make(chan voice.Frame)
go func() { defer close(in) for audio := range audioStream { in <- voice.NewAudioFrame(audio, 16000) } }()
go func() { defer close(out) if err := vad.Process(ctx, in, out); err != nil { return } }()
for frame := range out { if isSpeechFrame(frame) { // Trigger application logic: command recognition, recording start, etc. handleSpeechDetected(ctx, frame) } }
return nil}Provider Selection
Section titled “Provider Selection”Choose the VAD provider based on your deployment constraints:
| Provider | Accuracy in Noise | Resource Usage | Dependency |
|---|---|---|---|
| Silero | High | Medium (ONNX runtime) | ONNX model file |
| RNNoise | Good | Low | No external model |
| WebRTC | Moderate | Very Low | None |
- Silero: Best accuracy in high-noise environments; requires ONNX model distribution
- RNNoise: Good balance of accuracy and resource usage; no external model needed
- WebRTC: Lightweight fallback for low-noise environments
Tuning Guide
Section titled “Tuning Guide”| Parameter | Range | Effect |
|---|---|---|
| Threshold | 0.4-0.7 | Higher = fewer false positives, more missed speech |
| MinSpeechDuration | 100-300ms | Higher = filters more noise bursts |
| MaxSilenceDuration | 300-800ms | Higher = merges closely spaced utterances |
| EnablePreprocessing | true/false | Normalizes audio input for better model performance |
Recommended starting point for industrial environments: Threshold=0.55, MinSpeechDuration=200ms, EnablePreprocessing=true.
Deployment Considerations
Section titled “Deployment Considerations”- Per-site profiles: Capture threshold and duration settings per deployment (factory vs. vehicle vs. outdoor)
- Calibration phase: Run a short calibration on first deployment to adapt to ambient noise levels
- Model distribution: Ensure reliable distribution of ONNX model files to edge and server targets
- A/B testing: Compare Silero vs. RNNoise performance in your specific environments using metrics
- Observability: Track false trigger rate, missed speech rate, and decision latency with OpenTelemetry
Results
Section titled “Results”| Metric | Before | After | Improvement |
|---|---|---|---|
| False trigger rate | 22% | 6% | 73% reduction |
| Missed speech rate | 15% | 4% | 73% reduction |
| P95 latency | 80ms | 55ms | 31% reduction |
Lessons Learned
Section titled “Lessons Learned”- Model-based VAD in noise: Silero provided clear improvement over energy-only detection in high-noise environments
- Threshold range: 0.5-0.6 worked well across most environments; document per-site profiles
- Metrics-driven tuning: Observability data made it straightforward to validate tuning and spot regressions
Related Resources
Section titled “Related Resources”- Multi-Speaker Segmentation for multi-speaker scenarios
- Barge-In Detection for VAD in barge-in detection
- Industrial Control for voice commands in noisy environments