STT API — Speech-to-Text Providers
import "github.com/lookatitude/beluga-ai/voice/stt"Package stt provides the speech-to-text (STT) interface and provider registry
for the Beluga AI voice pipeline. Providers implement the STT interface and
register themselves via init() for discovery.
Core Interface
Section titled “Core Interface”The STT interface supports both batch and streaming transcription:
type STT interface { Transcribe(ctx context.Context, audio []byte, opts ...Option) (string, error) TranscribeStream(ctx context.Context, audioStream iter.Seq2[[]byte, error], opts ...Option) iter.Seq2[TranscriptEvent, error]}Transcript Events
Section titled “Transcript Events”Streaming transcription produces TranscriptEvent values containing the
transcribed text, finality flag, confidence score, timestamp, detected
language, and optional word-level timing via Word.
Registry Pattern
Section titled “Registry Pattern”Providers register via Register in their init() function and are created
with New. Use List to discover available providers.
import _ "github.com/lookatitude/beluga-ai/voice/stt/providers/deepgram"
engine, err := stt.New("deepgram", stt.Config{Language: "en"})text, err := engine.Transcribe(ctx, audioBytes)
// Streaming:for event, err := range engine.TranscribeStream(ctx, audioStream) { if err != nil { break } fmt.Printf("[%v] %s (final=%v)\n", event.Timestamp, event.Text, event.IsFinal)}Frame Processor Integration
Section titled “Frame Processor Integration”Use AsFrameProcessor to wrap an STT engine as a voice.FrameProcessor
for integration with the cascading pipeline.
processor := stt.AsFrameProcessor(engine)Configuration
Section titled “Configuration”The Config struct supports language, model, punctuation, diarization,
sample rate, encoding, and provider-specific extras. Use functional options
like WithLanguage, WithModel, and WithPunctuation to configure
individual operations.
The Hooks struct provides callbacks: OnTranscript (each event),
OnUtterance (finalized text), and OnError. Use ComposeHooks to merge.
Available Providers
Section titled “Available Providers”- deepgram — Deepgram Nova-2 (voice/stt/providers/deepgram)
- assemblyai — AssemblyAI (voice/stt/providers/assemblyai)
- whisper — OpenAI Whisper (voice/stt/providers/whisper)
- groq — Groq Whisper (voice/stt/providers/groq)
- elevenlabs — ElevenLabs Scribe (voice/stt/providers/elevenlabs)
- gladia — Gladia (voice/stt/providers/gladia)
assemblyai
Section titled “assemblyai”import "github.com/lookatitude/beluga-ai/voice/stt/providers/assemblyai"Package assemblyai provides the AssemblyAI STT provider for the Beluga AI voice pipeline. It uses the AssemblyAI Transcription API for batch transcription and WebSocket API for real-time streaming.
Registration
Section titled “Registration”This package registers itself as “assemblyai” with the stt registry. Import it with a blank identifier to enable:
import _ "github.com/lookatitude/beluga-ai/voice/stt/providers/assemblyai"engine, err := stt.New("assemblyai", stt.Config{ Language: "en", Extra: map[string]any{"api_key": "..."},})text, err := engine.Transcribe(ctx, audioBytes)Batch transcription uploads audio, creates a transcript, and polls for completion. Streaming uses the real-time WebSocket endpoint for low-latency partial and final transcripts with word-level timing.
Configuration
Section titled “Configuration”Required configuration in Config.Extra:
- api_key — AssemblyAI API key (required)
- base_url — Custom REST API base URL (optional)
- ws_url — Custom WebSocket URL (optional)
Exported Types
Section titled “Exported Types”- [Engine] — implements stt.STT using AssemblyAI
- [New] — constructor accepting stt.Config
deepgram
Section titled “deepgram”import "github.com/lookatitude/beluga-ai/voice/stt/providers/deepgram"Package deepgram provides the Deepgram STT provider for the Beluga AI voice pipeline. It uses the Deepgram HTTP API for batch transcription and WebSocket API for real-time streaming.
Registration
Section titled “Registration”This package registers itself as “deepgram” with the stt registry. Import it with a blank identifier to enable:
import _ "github.com/lookatitude/beluga-ai/voice/stt/providers/deepgram"engine, err := stt.New("deepgram", stt.Config{Language: "en", Model: "nova-2"})text, err := engine.Transcribe(ctx, audioBytes)Streaming transcription uses the Deepgram WebSocket API for low-latency partial and final transcripts with word-level timing.
Configuration
Section titled “Configuration”Required configuration in Config.Extra:
- api_key — Deepgram API key (required)
- base_url — Custom REST API base URL (optional)
- ws_url — Custom WebSocket URL (optional)
The default model is “nova-2”. Language, punctuation, diarization, encoding, and sample rate are all supported through [stt.Config].
Exported Types
Section titled “Exported Types”- [Engine] — implements stt.STT using Deepgram
- [New] — constructor accepting stt.Config
elevenlabs
Section titled “elevenlabs”import "github.com/lookatitude/beluga-ai/voice/stt/providers/elevenlabs"Package elevenlabs provides the ElevenLabs Scribe STT provider for the Beluga AI voice pipeline. It uses the ElevenLabs Speech-to-Text API for batch transcription via multipart form upload.
Registration
Section titled “Registration”This package registers itself as “elevenlabs” with the stt registry. Import it with a blank identifier to enable:
import _ "github.com/lookatitude/beluga-ai/voice/stt/providers/elevenlabs"engine, err := stt.New("elevenlabs", stt.Config{ Extra: map[string]any{"api_key": "xi-..."},})text, err := engine.Transcribe(ctx, audioBytes)ElevenLabs Scribe does not support native streaming. TranscribeStream falls back to transcribing each audio chunk independently.
Configuration
Section titled “Configuration”Required configuration in Config.Extra:
- api_key — ElevenLabs API key (required)
- base_url — Custom API base URL (optional)
The default model is “scribe_v1”.
Exported Types
Section titled “Exported Types”- [Engine] — implements stt.STT using ElevenLabs Scribe
- [New] — constructor accepting stt.Config
gladia
Section titled “gladia”import "github.com/lookatitude/beluga-ai/voice/stt/providers/gladia"Package gladia provides the Gladia STT provider for the Beluga AI voice pipeline. It uses the Gladia API for batch transcription and WebSocket streaming for real-time transcription.
Registration
Section titled “Registration”This package registers itself as “gladia” with the stt registry. Import it with a blank identifier to enable:
import _ "github.com/lookatitude/beluga-ai/voice/stt/providers/gladia"engine, err := stt.New("gladia", stt.Config{ Language: "en", Extra: map[string]any{"api_key": "..."},})text, err := engine.Transcribe(ctx, audioBytes)Batch transcription uploads audio via multipart form, creates a transcription job, and polls for the result. Streaming uses Gladia’s live WebSocket endpoint.
Configuration
Section titled “Configuration”Required configuration in Config.Extra:
- api_key — Gladia API key (required)
- base_url — Custom API base URL (optional)
Exported Types
Section titled “Exported Types”- [Engine] — implements stt.STT using Gladia
- [New] — constructor accepting stt.Config
import "github.com/lookatitude/beluga-ai/voice/stt/providers/groq"Package groq provides the Groq STT provider for the Beluga AI voice pipeline. It uses the Groq Whisper endpoint (OpenAI-compatible API) for transcription.
Registration
Section titled “Registration”This package registers itself as “groq” with the stt registry. Import it with a blank identifier to enable:
import _ "github.com/lookatitude/beluga-ai/voice/stt/providers/groq"engine, err := stt.New("groq", stt.Config{ Model: "whisper-large-v3", Extra: map[string]any{"api_key": "gsk-..."},})text, err := engine.Transcribe(ctx, audioBytes)Groq Whisper does not support native streaming. TranscribeStream buffers all audio chunks and transcribes them in a single batch request.
Configuration
Section titled “Configuration”Required configuration in Config.Extra:
- api_key — Groq API key (required)
- base_url — Custom API base URL (optional)
The default model is “whisper-large-v3”.
Exported Types
Section titled “Exported Types”- [Engine] — implements stt.STT using Groq Whisper
- [New] — constructor accepting stt.Config
whisper
Section titled “whisper”import "github.com/lookatitude/beluga-ai/voice/stt/providers/whisper"Package whisper provides the OpenAI Whisper STT provider for the Beluga AI voice pipeline. It uses the OpenAI Audio Transcriptions API for batch transcription via multipart form upload.
Registration
Section titled “Registration”This package registers itself as “whisper” with the stt registry. Import it with a blank identifier to enable:
import _ "github.com/lookatitude/beluga-ai/voice/stt/providers/whisper"engine, err := stt.New("whisper", stt.Config{ Model: "whisper-1", Extra: map[string]any{"api_key": "sk-..."},})text, err := engine.Transcribe(ctx, audioBytes)Whisper does not support native streaming. TranscribeStream transcribes each audio chunk independently as a batch request.
Configuration
Section titled “Configuration”Required configuration in Config.Extra:
- api_key — OpenAI API key (required)
- base_url — Custom API base URL (optional)
The default model is “whisper-1”.
Exported Types
Section titled “Exported Types”- [Engine] — implements stt.STT using OpenAI Whisper
- [New] — constructor accepting stt.Config