Skip to content
Docs

Cartesia Voice Provider

Cartesia provides ultra-low-latency text-to-speech synthesis through the Sonic voice engine. The Beluga AI provider uses Cartesia’s HTTP API with the httpclient infrastructure for built-in retry support, producing raw PCM audio output suitable for real-time voice pipelines.

Choose Cartesia when latency is critical — for example, in conversational voice agents where every millisecond of TTS delay affects the user experience. Cartesia’s Sonic engine is optimized for speed-first synthesis with direct PCM output, avoiding the overhead of compressed audio decoding. For the highest voice quality with more voice variety, consider ElevenLabs.

import _ "github.com/lookatitude/beluga-ai/voice/tts/providers/cartesia"

The blank import registers the "cartesia" provider with the TTS registry.

FieldTypeDefaultDescription
VoicestringCartesia voice UUID
Modelstring"sonic-2"Cartesia model (sonic-2, sonic-english)
SampleRateint24000Output sample rate in Hz
ExtraSee below
KeyTypeRequiredDescription
api_keystringYesCartesia API key
base_urlstringNoOverride base URL
package main
import (
"context"
"log"
"os"
"github.com/lookatitude/beluga-ai/voice/tts"
_ "github.com/lookatitude/beluga-ai/voice/tts/providers/cartesia"
)
func main() {
ctx := context.Background()
engine, err := tts.New("cartesia", tts.Config{
Voice: "a0e99841-438c-4a64-b679-ae501e7d6091",
Extra: map[string]any{"api_key": os.Getenv("CARTESIA_API_KEY")},
})
if err != nil {
log.Fatal(err)
}
audio, err := engine.Synthesize(ctx, "Hello, welcome to Beluga AI.")
if err != nil {
log.Fatal(err)
}
if err := os.WriteFile("output.pcm", audio, 0644); err != nil {
log.Fatal(err)
}
}
import "github.com/lookatitude/beluga-ai/voice/tts/providers/cartesia"
engine, err := cartesia.New(tts.Config{
Voice: "a0e99841-438c-4a64-b679-ae501e7d6091",
Model: "sonic-2",
SampleRate: 24000,
Extra: map[string]any{"api_key": os.Getenv("CARTESIA_API_KEY")},
})

The streaming interface synthesizes each text chunk from the input stream independently:

for chunk, err := range engine.SynthesizeStream(ctx, textStream) {
if err != nil {
log.Printf("error: %v", err)
break
}
transport.Send(chunk)
}
processor := tts.AsFrameProcessor(engine, 24000, tts.WithVoice("a0e99841-438c-4a64-b679-ae501e7d6091"))
pipeline := voice.Chain(sttProcessor, llmProcessor, processor)

Cartesia outputs raw PCM audio (16-bit little-endian, pcm_s16le) by default. The output format is configured in the request body and matches the sample rate specified in the config:

engine, err := tts.New("cartesia", tts.Config{
Voice: "a0e99841-438c-4a64-b679-ae501e7d6091",
SampleRate: 44100, // Override default 24000 Hz
Extra: map[string]any{"api_key": os.Getenv("CARTESIA_API_KEY")},
})

The Cartesia provider uses Beluga’s httpclient infrastructure, which provides automatic retry with exponential backoff (up to 2 retries by default) for transient failures.

audio, err := engine.Synthesize(ctx, "Hello!",
tts.WithVoice("different-voice-uuid"),
tts.WithSampleRate(16000),
)
engine, err := tts.New("cartesia", tts.Config{
Voice: "a0e99841-438c-4a64-b679-ae501e7d6091",
Extra: map[string]any{
"api_key": os.Getenv("CARTESIA_API_KEY"),
"base_url": "https://cartesia.internal.corp",
},
})