Skip to content
Docs

WebRTC VAD Provider

The WebRTC provider implements voice activity detection using a dual-metric approach: RMS energy analysis combined with zero-crossing rate (ZCR). This pure Go implementation requires no external dependencies or CGO and effectively distinguishes voiced speech from background noise.

Choose WebRTC VAD when you need a lightweight, pure Go voice activity detector with zero external dependencies. It works well in controlled environments and requires no CGO or model files. The dual-metric approach (energy + zero-crossing rate) effectively rejects high-energy noise that would fool a pure energy detector. For higher accuracy in noisy environments, consider Silero.

Terminal window
go get github.com/lookatitude/beluga-ai/voice/vad/providers/webrtc
FieldRequiredDefaultDescription
energyThresholdNo1000.0RMS energy threshold for speech
zcrThresholdNo0.1Zero-crossing rate threshold

Registry configuration keys:

KeyTypeMaps to
thresholdfloat64energyThreshold
zcr_thresholdfloat64zcrThreshold
package main
import (
"context"
"fmt"
"log"
"github.com/lookatitude/beluga-ai/voice"
_ "github.com/lookatitude/beluga-ai/voice/vad/providers/webrtc"
)
func main() {
vad, err := voice.NewVAD("webrtc", map[string]any{
"threshold": 1000.0,
"zcr_threshold": 0.1,
})
if err != nil {
log.Fatal(err)
}
// audioPCM is 16-bit little-endian PCM audio data
var audioPCM []byte // ... obtained from audio source
result, err := vad.DetectActivity(context.Background(), audioPCM)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Speech: %v, Event: %s, Confidence: %.2f\n",
result.IsSpeech, result.EventType, result.Confidence)
}

The WebRTC provider uses two complementary metrics to classify audio:

  1. RMS Energy — Measures signal amplitude. High energy suggests speech or loud sounds.
  2. Zero-Crossing Rate (ZCR) — Measures how often the signal crosses zero. Noise tends to have high ZCR, while voiced speech has lower ZCR.

Speech is detected when energy exceeds the threshold and the zero-crossing rate is below the ZCR threshold. This combination rejects high-energy noise (fans, traffic) that would fool a pure energy detector.

The confidence score reflects both metrics:

  • Base confidence is the ratio of RMS energy to twice the energy threshold, clamped to [0, 1]
  • When ZCR exceeds the threshold (suggesting noise), confidence is halved
result, err := vad.DetectActivity(ctx, audioPCM)
if err != nil {
log.Fatal(err)
}
if result.Confidence > 0.8 {
fmt.Println("High-confidence speech detection")
}

Energy and ZCR thresholds can be adjusted independently:

ScenarioEnergy ThresholdZCR ThresholdEffect
Quiet room500.00.1More sensitive to soft speech
Noisy environment2000.00.05Strict filtering of background noise
Default1000.00.1Balanced for general use

The provider tracks state transitions between speech and silence:

for _, chunk := range audioChunks {
result, err := vad.DetectActivity(ctx, chunk)
if err != nil {
log.Fatal(err)
}
switch result.EventType {
case voice.VADSpeechStart:
fmt.Println("Speech started")
case voice.VADSpeechEnd:
fmt.Println("Speech ended")
case voice.VADSilence:
// Ongoing silence
}
}

For compile-time type safety, construct the provider directly:

import "github.com/lookatitude/beluga-ai/voice/vad/providers/webrtc"
vad := webrtc.New(1500.0, 0.08)

The constructor takes energyThreshold and zcrThreshold as positional arguments. Zero values use the defaults.

result, err := vad.DetectActivity(ctx, audioPCM)
if err != nil {
log.Printf("VAD detection failed: %v", err)
}

Audio data must be 16-bit little-endian PCM. Frames shorter than 4 bytes return a silence result without error.