Real-Time STT Streaming Tutorial
Real-time STT streaming enables voice applications to process audio incrementally, delivering interim and final transcripts with minimal delay. Unlike batch transcription, which waits for the complete audio before producing output, streaming STT processes audio as it arrives and returns partial results in real time. This is essential for voice agents because the agent can begin understanding user intent before the user finishes speaking, enabling features like preemptive generation and turn-taking. This tutorial demonstrates how to configure a streaming STT provider, open a session, and handle transcripts as they arrive.
What You Will Build
Section titled “What You Will Build”A streaming speech-to-text pipeline using the Deepgram provider that processes audio chunks in real time and delivers both interim (partial) and final transcripts over a WebSocket connection.
Prerequisites
Section titled “Prerequisites”- Deepgram API key
- Go 1.23+
- Basic understanding of audio buffers and PCM encoding
Step 1: Initialize the STT Provider
Section titled “Step 1: Initialize the STT Provider”Use the registry pattern to create a Deepgram STT provider with streaming enabled. The blank import _ "github.com/lookatitude/beluga-ai/voice/stt/providers/deepgram" triggers the provider’s init() function, which registers the "deepgram" factory with the STT registry. This is the same registration pattern used across all Beluga extensible packages.
The WithEnableStreaming(true) option tells the provider to establish a persistent WebSocket connection for continuous audio processing, rather than using the batch HTTP API.
package main
import ( "context" "fmt" "log" "os"
"github.com/lookatitude/beluga-ai/voice/stt" _ "github.com/lookatitude/beluga-ai/voice/stt/providers/deepgram")
func main() { ctx := context.Background()
// Create a Deepgram STT provider with streaming and interim results provider, err := stt.NewProvider(ctx, "deepgram", stt.DefaultConfig(), stt.WithAPIKey(os.Getenv("DEEPGRAM_API_KEY")), stt.WithModel("nova-2"), stt.WithLanguage("en-US"), stt.WithSampleRate(16000), stt.WithEnableStreaming(true), ) if err != nil { log.Fatalf("create STT provider: %v", err) }
_ = provider}Step 2: Start a Streaming Session
Section titled “Step 2: Start a Streaming Session”A streaming session establishes a persistent WebSocket connection for continuous audio input and transcript output. The session remains open for the duration of the conversation, avoiding the overhead of establishing a new connection for each utterance.
// Open a streaming session session, err := provider.StartStreaming(ctx) if err != nil { log.Fatalf("start streaming: %v", err) } defer session.Close()The StartStreaming method returns a StreamingSession that supports bidirectional communication: you send audio chunks in, and receive TranscriptResult values out. This bidirectional pattern is why Beluga uses Go channels for STT streaming rather than iter.Seq2 — both the send and receive directions operate concurrently.
Step 3: Handle Transcripts
Section titled “Step 3: Handle Transcripts”Transcripts arrive on a Go channel. Each result indicates whether it is an interim (partial) or final transcript. Interim transcripts update continuously as the STT model refines its prediction; final transcripts represent the model’s confirmed output for a segment of speech. Processing these in a goroutine ensures the receive loop does not block the audio send path.
// Process transcripts in a separate goroutine go func() { for result := range session.ReceiveTranscript() { if result.Error != nil { fmt.Printf("transcript error: %v\n", result.Error) continue }
if result.IsFinal { fmt.Printf("\n[Final] %s (confidence: %.2f)\n", result.Text, result.Confidence) } else { fmt.Printf("\r[Interim] %s", result.Text) } } }()The TranscriptResult struct contains both the transcript text and metadata about the recognition:
| Field | Type | Description |
|---|---|---|
Text | string | Transcribed text |
IsFinal | bool | Whether this is a final or interim result |
Confidence | float64 | Model confidence score (0.0 to 1.0) |
Language | string | Detected language code |
Error | error | Non-nil if an error occurred |
Step 4: Send Audio Data
Section titled “Step 4: Send Audio Data”Stream audio chunks to the session. In production, these come from a microphone, WebRTC track, or transport layer. The frame size determines the tradeoff between latency and efficiency: smaller frames reduce latency (the model receives data sooner) but increase overhead (more WebSocket messages).
// Send audio chunks (e.g., 20ms frames of 16kHz mono PCM) // Each frame: 16000 samples/sec * 0.020 sec * 2 bytes/sample = 640 bytes audioFrame := make([]byte, 640)
err = session.SendAudio(ctx, audioFrame) if err != nil { log.Fatalf("send audio: %v", err) }For continuous streaming from a microphone, wrap the send loop:
// Continuous streaming from an audio source for frame := range audioSource { if err := session.SendAudio(ctx, frame); err != nil { log.Printf("send error: %v", err) break } }Step 5: Close the Session
Section titled “Step 5: Close the Session”Always close the session to flush pending transcripts and release the WebSocket connection. The Close method signals the server that no more audio will be sent, which triggers the server to finalize any pending transcriptions and return remaining results before terminating the connection.
// Close flushes any remaining audio and shuts down the connection if err := session.Close(); err != nil { log.Printf("close session: %v", err) }Architecture
Section titled “Architecture”The streaming STT pipeline follows this flow:
Audio Source ──▶ SendAudio() ──▶ [WebSocket] ──▶ Deepgram API │ ▼Application ◀── ReceiveTranscript() ◀── [WebSocket] ◀─┘Beluga’s STT interface uses Go channels for streaming rather than iter.Seq2, because audio processing requires true bidirectional communication where the sender and receiver operate concurrently. The iter.Seq2 pattern is used for unidirectional streaming (like LLM token output) where the consumer pulls values from a producer.
Verification
Section titled “Verification”- Set the
DEEPGRAM_API_KEYenvironment variable. - Run the application and pipe audio from a microphone or WAV file.
- Confirm that interim transcripts update in place and final transcripts appear on new lines.
- Verify that closing the session produces any remaining final transcripts.
Next Steps
Section titled “Next Steps”- Fine-tuning Whisper for Industry Terms — Improve accuracy for specialized vocabulary
- Voice Session Interruptions — Combine STT with full session management
- Custom Silero VAD — Add voice activity detection to filter silence