Multi-Turn Voice Forms
Healthcare providers, insurance companies, and government agencies need to collect structured intake data over the phone. The data itself is well-defined (name, date of birth, policy number, symptoms), but the collection process is difficult: callers give fragmented answers, change their mind, provide information out of order, and go off-topic. Traditional touch-tone forms force callers into rigid sequences that do not accommodate these natural behaviors, resulting in 38% abandonment rates on forms longer than 5 fields.
A voice-based form system using Beluga AI’s voice pipeline manages form state across turns, validates answers in real time, and supports corrections and resumption. The system uses separate STT and TTS components (not S2S) because the form orchestrator needs to inspect the transcribed text for validation before generating the next prompt — this validation step requires text as an intermediate representation.
Solution Architecture
Section titled “Solution Architecture”graph TB
User[User] --> Mic[Microphone]
Mic --> Pipeline[Voice Pipeline]
Pipeline --> STT[STT]
Pipeline --> TTS[TTS]
Pipeline --> VAD[VAD]
STT --> Orchestrator[Form Orchestrator]
Orchestrator --> State[Form State]
State --> Valid[Validate Answer]
Valid --> Next[Next Question or Reprompt]
Next --> TTS
TTS --> Speaker[Speaker]
Speaker --> User
The voice pipeline handles audio processing (STT, TTS, VAD, turn detection). The form orchestrator manages form state separately from the session layer, advancing through questions, validating answers, and persisting progress for resumption.
Separating form state from session state is a key architectural decision. Sessions are transient (tied to a WebSocket connection), while form state must survive disconnections so callers can resume where they left off. By decoupling them, the form orchestrator can be tested independently of the voice pipeline, and form state can be persisted to any storage backend.
Implementation
Section titled “Implementation”Voice Form System
Section titled “Voice Form System”The form is defined as a sequence of FormField structs, each with a prompt, validation function, and required flag. The Run method drives a one-question-per-turn loop: synthesize the prompt via TTS, listen for a response via streaming STT, validate the answer, and either advance or reprompt. This sequential, field-by-field approach was chosen over free-form collection because it produces higher completion rates — callers know exactly what information is needed at each step.
package main
import ( "context" "fmt" "iter"
"github.com/lookatitude/beluga-ai/voice/stt" "github.com/lookatitude/beluga-ai/voice/tts"
_ "github.com/lookatitude/beluga-ai/voice/stt/providers/deepgram" _ "github.com/lookatitude/beluga-ai/voice/tts/providers/elevenlabs")
// FormField defines a single question in the form.type FormField struct { Name string Prompt string Validate func(string) error Required bool}
// VoiceForm collects structured data through voice conversation.type VoiceForm struct { fields []FormField current int answers map[string]string stt stt.STT tts tts.TTS}
func (f *VoiceForm) Run(ctx context.Context, audioIn iter.Seq2[[]byte, error]) (map[string]string, error) { // Ask the first question question := f.fields[f.current].Prompt audio, err := f.tts.Synthesize(ctx, question, tts.WithVoice("aria"), tts.WithSpeed(1.0), ) if err != nil { return nil, fmt.Errorf("synthesize: %w", err) } sendAudio(audio)
// Process answers transcripts := f.stt.TranscribeStream(ctx, audioIn, stt.WithLanguage("en"), stt.WithPunctuation(true), )
for event, err := range transcripts { if err != nil { return nil, fmt.Errorf("transcribe: %w", err) }
if !event.IsFinal { continue }
field := f.fields[f.current]
// Validate the answer if err := field.Validate(event.Text); err != nil { reprompt := fmt.Sprintf("I didn't quite get that. %s", field.Prompt) audio, err := f.tts.Synthesize(ctx, reprompt, tts.WithVoice("aria")) if err != nil { return nil, fmt.Errorf("synthesize reprompt: %w", err) } sendAudio(audio) continue }
// Save and advance f.answers[field.Name] = event.Text f.current++
if f.current >= len(f.fields) { confirm := "Thank you. I have all the information I need." audio, err := f.tts.Synthesize(ctx, confirm, tts.WithVoice("aria")) if err != nil { return nil, fmt.Errorf("synthesize confirm: %w", err) } sendAudio(audio) return f.answers, nil }
// Ask next question next := f.fields[f.current].Prompt audio, err := f.tts.Synthesize(ctx, next, tts.WithVoice("aria")) if err != nil { return nil, fmt.Errorf("synthesize next: %w", err) } sendAudio(audio) }
return f.answers, nil}Form State Persistence
Section titled “Form State Persistence”Store form state (current question index, collected answers, session ID) for resumption after dropped calls:
type FormState struct { SessionID string Current int Answers map[string]string CreatedAt time.Time UpdatedAt time.Time}
// Persist by session ID; restore on reconnectfunc (f *VoiceForm) SaveState(sessionID string) FormState { return FormState{ SessionID: sessionID, Current: f.current, Answers: f.answers, }}Deployment Considerations
Section titled “Deployment Considerations”- Turn detection: Use VAD and turn detection to avoid treating partial utterances as final answers
- Interruptions: Enable interruption support so users can correct themselves mid-prompt
- State persistence: Persist form state by session ID from day one for resumption after dropped calls
- Navigation: Add “repeat”, “go back”, and “skip” intents for user control
- Confirmation turns: For critical fields, add optional “Did you say X?” confirmation turns
- PII handling: Encrypt form state at rest when collecting sensitive data
Results
Section titled “Results”| Metric | Before | After | Improvement |
|---|---|---|---|
| Form completion rate | 62% | 88% | +42% |
| User correction rate | N/A | 12% | Under 15% target |
Lessons Learned
Section titled “Lessons Learned”- One question per turn: Clear structure prevents confusion, though longer forms need more turns
- Turn detection matters: Avoiding partial-utterance answers was critical for form accuracy
- Separate form state from session: Makes testing, persistence, and resumption straightforward
Related Resources
Section titled “Related Resources”- Voice Sessions Overview for session and transport patterns
- Voice AI Applications for voice pipeline architecture
- Conversational AI Assistant for broader conversational patterns