ML-Based Barge-In Detection
Problem
Section titled “Problem”You need to support barge-in (user interrupts the agent) using both VAD and turn detection, with the option to use an ONNX model for more accurate turn-end classification in noisy or varied environments. Simple VAD-only barge-in systems trigger on any speech onset, which causes false positives when the user naturally pauses mid-utterance or speaks briefly at turn boundaries. This results in the agent prematurely stopping playback and starting transcription, leading to awkward interruptions and degraded user experience.
The challenge is distinguishing genuine interruptions (barge-in) from natural turn-taking cues (end-of-turn). Voice activity detection alone cannot make this distinction because it operates on acoustic features (energy, pitch) without understanding conversational context. Turn detection adds context by analyzing speech patterns and silence durations to predict when a speaker has finished their turn. Combining both signals allows accurate barge-in detection: speech onset (VAD) confirms the user is speaking, while turn detection confirms they are not simply finishing their previous utterance.
Solution
Section titled “Solution”Combine VAD (speech onset) with turn detection (turn context). Use the onnx provider when you have a turn-detection model; otherwise use heuristic. On speech onset during playback, call DetectTurnWithSilence to distinguish barge-in from end-of-turn, then stop TTS and switch to listening when barge-in is detected.
The reasoning behind this dual-signal approach is defense against false positives. VAD provides low-latency speech onset detection (10-50ms), which is critical for responsive barge-in. However, VAD is context-free and triggers on any speech, including the user’s own turn-ending phrases (“that’s all”, “thank you”). Turn detection provides context-aware classification, using features like silence duration, pitch trajectory, and learned patterns from training data. The ONNX provider uses a pre-trained neural model for higher accuracy, while the heuristic provider uses rule-based logic (silence thresholds, sentence boundaries) for simpler deployments.
This pattern follows Beluga’s FrameProcessor interface for composability. VAD and turn detection are separate processors that can be swapped, tuned, or tested independently. The application combines their outputs in a decision function (bargeIn = speaking && !done), which can be customized per use case (e.g., require multiple consecutive frames of barge-in to reduce flapping).
Code Example
Section titled “Code Example”package main
import ( "context" "fmt" "log" "os" "path/filepath" "time"
"go.opentelemetry.io/otel" "go.opentelemetry.io/otel/attribute" "go.opentelemetry.io/otel/trace"
"github.com/lookatitude/beluga-ai/voice/turndetection" turndetectioniface "github.com/lookatitude/beluga-ai/voice/turndetection/iface" "github.com/lookatitude/beluga-ai/voice/vad" vadiface "github.com/lookatitude/beluga-ai/voice/vad/iface")
var tracer = otel.Tracer("beluga.voice.bargein.recipe")
func main() { ctx := context.Background() ctx, span := tracer.Start(ctx, "ml_bargein_setup") defer span.End()
vadCfg := vad.DefaultConfig() vadProv, err := vad.NewProvider(ctx, "webrtc", vadCfg) if err != nil { span.RecordError(err) span.SetStatus(trace.StatusError, err.Error()) log.Fatalf("vad: %v", err) }
modelPath := os.Getenv("TURN_MODEL_PATH") if modelPath == "" { modelPath = filepath.Join(os.TempDir(), "turn_detection.onnx") } tdCfg := turndetection.DefaultConfig() td, err := turndetection.NewProvider(ctx, "onnx", tdCfg, turndetection.WithModelPath(modelPath), turndetection.WithThreshold(0.5), turndetection.WithMinSilenceDuration(250*time.Millisecond), ) if err != nil { span.RecordError(err) span.SetStatus(trace.StatusError, err.Error()) log.Fatalf("turn detector: %v", err) }
// Barge-in check: VAD says speech, turn detector says not end-of-turn audio := make([]byte, 1024) speaking, _ := vadProv.Process(ctx, audio) silence := 100 * time.Millisecond done, _ := td.DetectTurnWithSilence(ctx, audio, silence)
bargeIn := speaking && !done span.SetAttributes( attribute.Bool("vad.speaking", speaking), attribute.Bool("turn.done", done), attribute.Bool("barge_in", bargeIn), ) fmt.Printf("Barge-in: %v (speaking=%v, turn_done=%v)\n", bargeIn, speaking, done)}
func runBargeInLoop(ctx context.Context, vadProv vadiface.VADProvider, td turndetectioniface.TurnDetector, audio []byte, silence time.Duration) (bool, error) { speaking, err := vadProv.Process(ctx, audio) if err != nil { return false, err } if !speaking { return false, nil } done, err := td.DetectTurnWithSilence(ctx, audio, silence) if err != nil { return false, err } return !done, nil}Explanation
Section titled “Explanation”-
VAD for onset —
vadProv.Processreturns whether the user is speaking. Use it during TTS playback to detect interruption with low latency. VAD operates on short frames (10-30ms), providing near-instant feedback when speech energy rises above threshold. This low latency is critical for responsive barge-in: delays longer than 200ms feel unnatural to users. However, VAD alone is insufficient because it cannot distinguish between “user started new speech” (barge-in) and “user is finishing previous speech” (turn continuation). This is where turn detection provides context. -
Turn detection for context —
DetectTurnWithSilenceindicates end-of-turn. If the user is speaking (speaking == true) but the turn is not done (done == false), treat it as barge-in: user is interrupting mid-response. Turn detection analyzes temporal patterns like silence duration and pitch contours to classify whether the speaker intends to yield the floor. Thesilenceparameter tunes sensitivity: shorter values (100-200ms) detect barge-in faster but risk false positives; longer values (300-500ms) reduce false positives but increase latency. -
ONNX vs heuristic — Use
onnxwhen you have a model and want better accuracy; useheuristicfor simpler, model-free setups. Both supportDetectTurnWithSilence. The ONNX provider loads a pre-trained neural network (e.g., CNN or Transformer-based) trained on conversational datasets to predict turn boundaries. This provides higher accuracy in noisy or multi-speaker environments where heuristics struggle. The heuristic provider uses rule-based logic: if silence exceedsMinSilenceDurationand the last audio frame meets certain criteria (low pitch variance, decaying energy), predict end-of-turn. Heuristics work well for controlled environments (quiet rooms, single speaker) and avoid the complexity of model deployment.
Key insight: Barge-in = “user started speaking” (VAD) and “we are not at end-of-turn” (turn detection). Turn detection avoids falsely treating end-of-turn as barge-in. This two-signal approach is standard in commercial voice assistants because it balances responsiveness (VAD) with accuracy (turn detection). The pattern here uses Beluga’s registry pattern for swapping turn detection providers: change "onnx" to "heuristic" in NewProvider and the rest of the code remains unchanged. This makes it easy to A/B test different providers or fall back to heuristics when the ONNX model is unavailable.
Variations
Section titled “Variations”Heuristic-Only Barge-In
Section titled “Heuristic-Only Barge-In”Skip ONNX and use heuristic turn detection for simpler deployments:
td, err := turndetection.NewProvider(ctx, "heuristic", tdCfg, turndetection.WithMinSilenceDuration(200*time.Millisecond),)Stricter Barge-In
Section titled “Stricter Barge-In”Raise ONNX Threshold or heuristic MinSilenceDuration so only clearer interruptions count as barge-in, reducing false positives.
Related Recipes
Section titled “Related Recipes”- Sentence-Boundary Turns — Heuristic turn detection with sentence boundaries
- Handling Speech Interruption — Interruption handling in S2S
- VAD Sensitivity Profiles — Tune VAD for different environments