Skip to content
Docs

Noise-Resistant Voice Activity Detection

Field-service applications, factory floors, and vehicle environments require voice activity detection (VAD) that stays accurate despite high background noise (85-100+ dB). Standard energy-based VAD — which simply checks if the audio signal exceeds an amplitude threshold — cannot distinguish speech from mechanical noise, fan hum, or engine vibration at these levels, producing 22% false positive rates. Every false trigger wastes downstream STT compute and can cause incorrect command activations in safety-critical environments.

The fundamental limitation of energy-based VAD is that it treats all sound energy equally. Model-based VAD (Silero, RNNoise) learns spectral patterns that distinguish human speech from environmental noise, maintaining accuracy where energy thresholds fail. Using Beluga AI’s VAD with Silero and RNNoise model-based providers, false triggers drop to 6% while maintaining sub-60ms latency.

graph TB
    Mic[Microphone] --> Pre[Preprocess]
    Pre --> VAD[VAD Provider]
    VAD --> Silero[Silero / RNNoise]
    Silero --> Decision[Speech?]
    Decision --> App[App Logic]
    VAD --> Metrics[OTel Metrics]

Audio is optionally preprocessed (normalization), then passed to the VAD provider. Silero or RNNoise runs on each audio frame, and threshold/duration parameters filter spurious activations. Results feed application logic (wake word, push-to-talk, command trigger). OpenTelemetry records decisions and errors for tuning dashboards.

The preprocessing step is optional but important in noisy environments: normalizing audio amplitude ensures the VAD model receives input within its expected range, regardless of microphone gain or distance. Without normalization, the same voice at different distances can produce dramatically different model outputs.

The Silero VAD is configured with a higher threshold (0.55 vs. the default 0.5) and longer MinSpeechDuration (200ms) for noisy environments. The higher threshold reduces false triggers from noise bursts, while the longer minimum speech duration filters out brief non-speech sounds that the model might score above threshold. These parameters are deliberately exposed as configuration rather than hardcoded, since optimal values vary by deployment environment.

package main
import (
"context"
"time"
"github.com/lookatitude/beluga-ai/voice"
)
func setupNoiseResistantVAD(ctx context.Context) (voice.FrameProcessor, error) {
vad := voice.NewSileroVAD(voice.VADConfig{
Threshold: 0.55, // Tuned for noise environments
MinSpeechDuration: 200 * time.Millisecond,
MaxSilenceDuration: 600 * time.Millisecond,
SampleRate: 16000,
EnablePreprocessing: true, // Normalize input for models
})
return vad, nil
}
func processAudioStream(ctx context.Context, vad voice.FrameProcessor, audioStream <-chan []byte) error {
in := make(chan voice.Frame)
out := make(chan voice.Frame)
go func() {
defer close(in)
for audio := range audioStream {
in <- voice.NewAudioFrame(audio, 16000)
}
}()
go func() {
defer close(out)
if err := vad.Process(ctx, in, out); err != nil {
return
}
}()
for frame := range out {
if isSpeechFrame(frame) {
// Trigger application logic: command recognition, recording start, etc.
handleSpeechDetected(ctx, frame)
}
}
return nil
}

Choose the VAD provider based on your deployment constraints:

ProviderAccuracy in NoiseResource UsageDependency
SileroHighMedium (ONNX runtime)ONNX model file
RNNoiseGoodLowNo external model
WebRTCModerateVery LowNone
  • Silero: Best accuracy in high-noise environments; requires ONNX model distribution
  • RNNoise: Good balance of accuracy and resource usage; no external model needed
  • WebRTC: Lightweight fallback for low-noise environments
ParameterRangeEffect
Threshold0.4-0.7Higher = fewer false positives, more missed speech
MinSpeechDuration100-300msHigher = filters more noise bursts
MaxSilenceDuration300-800msHigher = merges closely spaced utterances
EnablePreprocessingtrue/falseNormalizes audio input for better model performance

Recommended starting point for industrial environments: Threshold=0.55, MinSpeechDuration=200ms, EnablePreprocessing=true.

  • Per-site profiles: Capture threshold and duration settings per deployment (factory vs. vehicle vs. outdoor)
  • Calibration phase: Run a short calibration on first deployment to adapt to ambient noise levels
  • Model distribution: Ensure reliable distribution of ONNX model files to edge and server targets
  • A/B testing: Compare Silero vs. RNNoise performance in your specific environments using metrics
  • Observability: Track false trigger rate, missed speech rate, and decision latency with OpenTelemetry
MetricBeforeAfterImprovement
False trigger rate22%6%73% reduction
Missed speech rate15%4%73% reduction
P95 latency80ms55ms31% reduction
  • Model-based VAD in noise: Silero provided clear improvement over energy-only detection in high-noise environments
  • Threshold range: 0.5-0.6 worked well across most environments; document per-site profiles
  • Metrics-driven tuning: Observability data made it straightforward to validate tuning and spot regressions