Skip to content
Docs

Low-Latency Turn Prediction

Voice AI vendors face a critical challenge: the delay between when a user stops speaking and when the system decides to respond. This “endpointing” decision directly impacts perceived responsiveness — every 100ms of unnecessary waiting adds to the user’s sense of lag. Fixed silence windows (800ms) feel sluggish, while shorter windows cause premature interruptions that cut off users mid-thought. The fundamental tension is between responsiveness (respond quickly) and accuracy (wait long enough to be sure the user is done).

Low-latency turn prediction uses Beluga AI’s turn detection with tuned heuristic and ONNX providers to achieve sub-100ms turn decisions with less than 5% false turn-end rate. The system offers two detection strategies — heuristic for rapid deployment and ONNX model-based for production accuracy — both accessed through the same FrameProcessor interface.

graph TB
    Audio[Audio Stream] --> VAD[VAD]
    VAD --> Silence[Silence Duration]
    Audio --> Buffer[Audio Buffer]
    Buffer --> TD[Turn Detector]
    Silence --> TD
    TD --> Decision[Turn End Decision]
    Decision --> Session[Voice Session]
    Session --> STT[STT]
    Session --> LLM[LLM]
    Session --> TTS[TTS]

The voice pipeline provides audio chunks and ongoing silence duration (from VAD). The turn detector uses this combined signal to decide when the user has finished speaking. On turn-end, the session triggers the STT, LLM, and TTS pipeline.

The turn detector consumes VAD output rather than raw audio because silence duration is the primary signal for turn-end detection. By placing the turn detector after VAD in the frame pipeline, it receives pre-computed speech/silence boundaries without duplicating that work. This layered approach also means the turn detector can be swapped (heuristic to ONNX) without changing the upstream VAD configuration.

Start with heuristic detection for rapid iteration without model dependencies. The heuristic approach combines silence duration, word count, and punctuation signals — all available without a dedicated ML model. This makes it deployable immediately and tunable through configuration rather than retraining.

package main
import (
"context"
"time"
"github.com/lookatitude/beluga-ai/voice"
)
func setupHeuristicTurnDetector(ctx context.Context) voice.FrameProcessor {
return voice.NewTurnDetector(voice.TurnDetectorConfig{
MinSilenceDuration: 280 * time.Millisecond, // Starting point; tune per corpus
MinTurnLength: 8, // Minimum words before turn-end
MaxTurnLength: 4000, // Guard against very long turns
SentenceEndMarkers: ".!?", // Punctuation-based heuristic
})
}

Upgrade to ONNX model-based detection for better accuracy in varied environments. The ONNX provider runs a trained turn-prediction model that captures patterns the heuristic cannot (prosody, speech rate changes, semantic cues). Both providers implement the same FrameProcessor interface, so switching from heuristic to ONNX is a configuration change, not a code change.

func setupONNXTurnDetector(ctx context.Context, modelPath string) voice.FrameProcessor {
return voice.NewTurnDetector(voice.TurnDetectorConfig{
Provider: "onnx",
ModelPath: modelPath,
Threshold: 0.45,
MinSilenceDuration: 250 * time.Millisecond,
})
}
func buildPipelineWithTurnDetection(ctx context.Context) (voice.FrameProcessor, error) {
vad := voice.NewSileroVAD(voice.VADConfig{
Threshold: 0.5,
})
turnDetector := setupHeuristicTurnDetector(ctx)
// Turn detector receives both audio frames and silence duration from VAD
// On turn-end decision, downstream processors (STT, LLM, TTS) are triggered
pipeline := voice.Chain(vad, turnDetector, sttProc, agentProc, ttsProc)
return pipeline, nil
}
ParameterDefaultEffectGuidance
MinSilenceDuration280msHow long silence before turn-endLower = faster but more false positives
Threshold (ONNX)0.45Model confidence for turn-endHigher = fewer false positives, more latency
MinTurnLength8Minimum words before turn-end possiblePrevents very short spurious turns
MaxTurnLength4000Maximum turn lengthGuards against never-ending turns
SentenceEndMarkers”.!?”Punctuation signals turn-endLanguage-specific; add per-language defaults
  • Start with heuristic: Deploy heuristic first; add ONNX once you have logs and latency targets
  • A/B testing: Run heuristic vs ONNX in shadow mode to validate models before switching
  • Per-language tuning: Add language-specific defaults for sentence-end markers and thresholds
  • Environment-specific: Treat MinSilenceDuration and Threshold as per-environment configuration (quiet vs noisy)
  • Observability: Instrument turn-detection latency and false positive/negative rates from day one
MetricBeforeAfterImprovement
Turn decision latency (p50)350ms85ms76% reduction
False turn-end rate12%4%67% reduction
Agent think time (p95)480ms180ms63% reduction
  • DetectTurnWithSilence: Using VAD-derived silence avoided extra buffering and sped up decisions
  • Config-driven tuning: MinSilenceDuration, Threshold, and turn-length limits allowed quick per-environment iteration
  • ONNX validation: Run shadow-mode comparisons before committing to model-based detection