Cartesia Voice Provider
Cartesia provides ultra-low-latency text-to-speech synthesis through the Sonic voice engine. The Beluga AI provider uses Cartesia’s HTTP API with the httpclient infrastructure for built-in retry support, producing raw PCM audio output suitable for real-time voice pipelines.
Choose Cartesia when latency is critical — for example, in conversational voice agents where every millisecond of TTS delay affects the user experience. Cartesia’s Sonic engine is optimized for speed-first synthesis with direct PCM output, avoiding the overhead of compressed audio decoding. For the highest voice quality with more voice variety, consider ElevenLabs.
Installation
Section titled “Installation”import _ "github.com/lookatitude/beluga-ai/voice/tts/providers/cartesia"The blank import registers the "cartesia" provider with the TTS registry.
Configuration
Section titled “Configuration”| Field | Type | Default | Description |
|---|---|---|---|
Voice | string | — | Cartesia voice UUID |
Model | string | "sonic-2" | Cartesia model (sonic-2, sonic-english) |
SampleRate | int | 24000 | Output sample rate in Hz |
Extra | — | — | See below |
Extra Fields
Section titled “Extra Fields”| Key | Type | Required | Description |
|---|---|---|---|
api_key | string | Yes | Cartesia API key |
base_url | string | No | Override base URL |
Basic Usage
Section titled “Basic Usage”package main
import ( "context" "log" "os"
"github.com/lookatitude/beluga-ai/voice/tts" _ "github.com/lookatitude/beluga-ai/voice/tts/providers/cartesia")
func main() { ctx := context.Background()
engine, err := tts.New("cartesia", tts.Config{ Voice: "a0e99841-438c-4a64-b679-ae501e7d6091", Extra: map[string]any{"api_key": os.Getenv("CARTESIA_API_KEY")}, }) if err != nil { log.Fatal(err) }
audio, err := engine.Synthesize(ctx, "Hello, welcome to Beluga AI.") if err != nil { log.Fatal(err) }
if err := os.WriteFile("output.pcm", audio, 0644); err != nil { log.Fatal(err) }}Direct Construction
Section titled “Direct Construction”import "github.com/lookatitude/beluga-ai/voice/tts/providers/cartesia"
engine, err := cartesia.New(tts.Config{ Voice: "a0e99841-438c-4a64-b679-ae501e7d6091", Model: "sonic-2", SampleRate: 24000, Extra: map[string]any{"api_key": os.Getenv("CARTESIA_API_KEY")},})Streaming
Section titled “Streaming”The streaming interface synthesizes each text chunk from the input stream independently:
for chunk, err := range engine.SynthesizeStream(ctx, textStream) { if err != nil { log.Printf("error: %v", err) break } transport.Send(chunk)}FrameProcessor Integration
Section titled “FrameProcessor Integration”processor := tts.AsFrameProcessor(engine, 24000, tts.WithVoice("a0e99841-438c-4a64-b679-ae501e7d6091"))pipeline := voice.Chain(sttProcessor, llmProcessor, processor)Advanced Features
Section titled “Advanced Features”Output Format
Section titled “Output Format”Cartesia outputs raw PCM audio (16-bit little-endian, pcm_s16le) by default. The output format is configured in the request body and matches the sample rate specified in the config:
engine, err := tts.New("cartesia", tts.Config{ Voice: "a0e99841-438c-4a64-b679-ae501e7d6091", SampleRate: 44100, // Override default 24000 Hz Extra: map[string]any{"api_key": os.Getenv("CARTESIA_API_KEY")},})Built-in Retry
Section titled “Built-in Retry”The Cartesia provider uses Beluga’s httpclient infrastructure, which provides automatic retry with exponential backoff (up to 2 retries by default) for transient failures.
Per-Request Options
Section titled “Per-Request Options”audio, err := engine.Synthesize(ctx, "Hello!", tts.WithVoice("different-voice-uuid"), tts.WithSampleRate(16000),)Custom Endpoint
Section titled “Custom Endpoint”engine, err := tts.New("cartesia", tts.Config{ Voice: "a0e99841-438c-4a64-b679-ae501e7d6091", Extra: map[string]any{ "api_key": os.Getenv("CARTESIA_API_KEY"), "base_url": "https://cartesia.internal.corp", },})