Skip to content
Docs

Voice Backends Configuration

You are building a voice agent and need to choose the right voice backend, configure providers with proper settings, switch between providers without code changes, and handle fallback when a provider is unavailable. Voice systems are complex multi-provider stacks: you might use Deepgram for STT, OpenAI for TTS, and LiveKit for transport. Each provider has unique configuration parameters, API authentication schemes, and operational characteristics. Hardcoding provider choices into application logic makes it difficult to test alternatives, respond to outages, or optimize costs by switching providers based on usage patterns.

The challenge is decoupling provider selection from application logic. You want to configure providers via environment variables or config files, swap providers without code changes, and implement fallback chains when a primary provider fails. This requires a consistent abstraction layer across heterogeneous providers, where the application code interacts with interfaces rather than concrete provider types.

Beluga AI’s voice system uses a consistent registry pattern across all voice backends. You can switch providers through configuration, implement automatic fallback, and test with mocks without changing application code. The registry pattern works by having each provider package register itself in an init() function, making it available for lookup by name. The application calls NewProvider(name, config) with a provider name string, and the registry returns the appropriate implementation. This design follows Go’s database/sql driver pattern and Beluga’s core extensibility model.

The reasoning behind this approach is operational flexibility. By standardizing on Register + New + List, you gain several benefits: provider selection becomes a runtime configuration parameter (not a compile-time dependency), testing becomes easier (register a mock provider in test init()), and fallback chains can iterate through the registry without hardcoding provider names. The functional options pattern (WithAPIKey, WithModel) provides type-safe configuration that adapts to each provider’s unique parameters while maintaining a common interface.

Set up Deepgram for real-time speech-to-text. Deepgram is chosen here for its low-latency WebSocket API and high accuracy on conversational speech. The configuration specifies EnableStreaming because real-time applications require incremental transcription rather than batch processing. The nova-2 model balances accuracy and speed for general-purpose use.

package main
import (
"context"
"os"
"github.com/lookatitude/beluga-ai/voice/stt"
_ "github.com/lookatitude/beluga-ai/voice/stt/providers/deepgram" // Register provider
)
func setupSTT(ctx context.Context) (stt.Provider, error) {
config := stt.DefaultConfig()
config.Provider = "deepgram"
config.APIKey = os.Getenv("DEEPGRAM_API_KEY")
config.Model = "nova-2"
config.Language = "en-US"
config.SampleRate = 16000
config.Channels = 1
config.EnableStreaming = true
return stt.NewProvider(ctx, "deepgram", config)
}

Set up OpenAI for text-to-speech. OpenAI’s TTS API provides good voice quality with straightforward pricing and no complex voice licensing. The tts-1-hd model offers higher fidelity than tts-1, which matters for applications where users listen for extended periods. The Speed parameter allows tuning playback rate for different use cases (faster for summaries, slower for instructions).

import (
"context"
"os"
"github.com/lookatitude/beluga-ai/voice/tts"
_ "github.com/lookatitude/beluga-ai/voice/tts/providers/openai" // Register provider
)
func setupTTS(ctx context.Context) (tts.Provider, error) {
config := tts.DefaultConfig()
config.Provider = "openai"
config.APIKey = os.Getenv("OPENAI_API_KEY")
config.Model = "tts-1-hd"
config.Voice = "alloy"
config.Speed = 1.0
config.SampleRate = 24000
return tts.NewProvider(ctx, "openai", config)
}

Set up Amazon Nova for end-to-end speech conversations. Speech-to-speech (S2S) providers eliminate the STT → LLM → TTS pipeline by handling the entire conversation flow internally, reducing latency and preserving prosody. Nova’s LatencyTarget: "low" configuration prioritizes responsiveness over maximum quality, suitable for interactive agents. The ReasoningMode: "built-in" means Nova uses its internal conversation model rather than requiring external LLM integration.

import (
"context"
"github.com/lookatitude/beluga-ai/voice/s2s"
_ "github.com/lookatitude/beluga-ai/voice/s2s/providers/amazon_nova" // Register provider
)
func setupS2S(ctx context.Context) (s2s.Provider, error) {
config := s2s.DefaultConfig()
config.Provider = "amazon_nova"
config.LatencyTarget = "low"
config.ReasoningMode = "built-in"
config.ProviderSpecific = map[string]any{
"region": "us-east-1",
"model": "nova-2-sonic",
"voice_id": "Ruth",
"language_code": "en-US",
}
return s2s.NewProvider(ctx, "amazon_nova", config)
}

Recipe 4: Switch Providers via Configuration

Section titled “Recipe 4: Switch Providers via Configuration”

Switch providers based on environment variables without code changes. This pattern externalizes provider selection, making it easy to compare providers during development or switch providers in production without redeployment. The VoiceConfig struct uses struct tags (yaml, env) to support both configuration files and environment variable overrides, following twelve-factor app principles.

import (
"context"
"github.com/lookatitude/beluga-ai/voice/stt"
_ "github.com/lookatitude/beluga-ai/voice/stt/providers/deepgram"
_ "github.com/lookatitude/beluga-ai/voice/stt/providers/google"
_ "github.com/lookatitude/beluga-ai/voice/stt/providers/azure"
)
type VoiceConfig struct {
STTProvider string `yaml:"stt_provider" env:"VOICE_STT_PROVIDER"`
STTAPIKey string `yaml:"stt_api_key" env:"VOICE_STT_API_KEY"`
STTModel string `yaml:"stt_model" env:"VOICE_STT_MODEL"`
}
func setupSTTFromConfig(ctx context.Context, cfg VoiceConfig) (stt.Provider, error) {
config := stt.DefaultConfig()
config.Provider = cfg.STTProvider
config.APIKey = cfg.STTAPIKey
config.Model = cfg.STTModel
return stt.NewProvider(ctx, cfg.STTProvider, config)
}

Automatic fallback if the primary provider fails. This pattern implements resilience through redundancy. When the primary provider fails (due to API outages, quota limits, or network issues), the fallback chain automatically tries alternative providers. The warning log helps with post-incident analysis to understand which providers failed and when. This approach trades increased cost (calling multiple providers) for higher availability.

import (
"context"
"fmt"
"log/slog"
"github.com/lookatitude/beluga-ai/voice/stt"
)
type FallbackSTT struct {
providers []stt.Provider
}
func NewFallbackSTT(providers ...stt.Provider) *FallbackSTT {
return &FallbackSTT{providers: providers}
}
func (f *FallbackSTT) Transcribe(ctx context.Context, audio []byte) (string, error) {
var lastErr error
for _, provider := range f.providers {
text, err := provider.Transcribe(ctx, audio)
if err == nil {
return text, nil
}
lastErr = err
slog.Warn("STT provider failed, trying next",
"provider", provider.GetName(),
"error", err,
)
}
return "", fmt.Errorf("all providers failed, last error: %w", lastErr)
}
ProviderStreamingLanguagesBest For
DeepgramWebSocket30+Real-time, accuracy
GooglegRPC125+Language coverage
AzureWebSocket100+Enterprise integration
OpenAI (Whisper)REST only99Batch processing
ProviderStreamingVoicesBest For
OpenAIYes6Simple, fast
ElevenLabsYes100+Voice cloning, quality
GoogleYes220+Language coverage
AzureYes400+Enterprise, SSML
ProviderStreamingLatencyBest For
Amazon NovaBidirectionalLowReal-time conversations
OpenAI RealtimeBidirectionalLowGPT-powered agents