Integrating with LiveKit and Vapi
Production voice applications require infrastructure for WebRTC networking, STUN/TURN servers, jitter buffering, and room management. Building this infrastructure from scratch is a multi-month effort that is tangential to the core value of your voice agent. Platforms like LiveKit and Vapi handle this infrastructure, allowing Beluga AI to focus on the intelligence layer — STT, LLM reasoning, TTS, and conversation management. This tutorial demonstrates how to connect Beluga’s voice backend to these platforms.
What You Will Build
Section titled “What You Will Build”A voice agent that connects to a LiveKit room or Vapi endpoint, processes participant audio through Beluga’s STT/TTS or S2S pipeline, and publishes agent audio back to the room.
Prerequisites
Section titled “Prerequisites”- LiveKit API credentials (URL, API key, API secret) or Vapi API key
- Completion of Native S2S with Amazon Nova or a working STT/TTS configuration
Why Third-Party Backends?
Section titled “Why Third-Party Backends?”While Beluga provides STT, TTS, and S2S capabilities, production voice applications need additional infrastructure that is outside the scope of an AI framework:
- WebRTC Networking — STUN/TURN servers, ICE candidates, jitter buffers. These handle NAT traversal, firewall bypass, and packet reordering that are required for real-time audio over the internet.
- Room Management — Multiple participants, track subscriptions, media routing. Managing who hears whom and how audio is mixed requires a dedicated media server.
- Global Edge Networks — Low-latency audio delivery across regions. Edge servers reduce the round-trip time between the user and the nearest processing node.
- Recording and Monitoring — Session recording, quality metrics, debugging tools. Operational visibility is essential for diagnosing audio quality issues in production.
Beluga’s backend package integrates with these platforms so your agent logic remains the same regardless of the transport layer. You can switch from LiveKit to Vapi by changing the provider name without rewriting your agent callback or pipeline configuration.
Step 1: Configure the LiveKit Backend
Section titled “Step 1: Configure the LiveKit Backend”Create a voice backend using the LiveKit provider. The backend uses Beluga’s standard registry pattern — the blank import registers the "livekit" factory, and backend.NewBackend creates a configured instance. The vbiface.Config struct centralizes all backend settings including provider credentials, pipeline type, concurrency limits, and observability flags.
package main
import ( "context" "log" "os" "time"
"github.com/lookatitude/beluga-ai/voice/backend" vbiface "github.com/lookatitude/beluga-ai/voice/backend/iface" _ "github.com/lookatitude/beluga-ai/voice/backend/providers/livekit")
func main() { ctx := context.Background()
cfg := &vbiface.Config{ Provider: "livekit", PipelineType: vbiface.PipelineTypeS2S, S2SProvider: "openai_realtime", ProviderConfig: map[string]any{ "url": os.Getenv("LIVEKIT_URL"), "api_key": os.Getenv("LIVEKIT_API_KEY"), "api_secret": os.Getenv("LIVEKIT_API_SECRET"), }, MaxConcurrentSessions: 100, LatencyTarget: 500 * time.Millisecond, Timeout: 30 * time.Second, EnableTracing: true, EnableMetrics: true, }
be, err := backend.NewBackend(ctx, "livekit", cfg) if err != nil { log.Fatalf("create backend: %v", err) } defer be.Stop(ctx)
if err := be.Start(ctx); err != nil { log.Fatalf("start backend: %v", err) }
log.Println("Backend started")}Step 2: Create a Session
Section titled “Step 2: Create a Session”Each voice session represents one participant interaction in the room. The AgentCallback is the integration point between the backend infrastructure and your application logic — it receives the transcribed text and returns the agent’s response. This separation means your agent logic is transport-agnostic: the same callback works with LiveKit, Vapi, or any other backend provider.
sessionCfg := &vbiface.SessionConfig{ UserID: "user-001", Transport: "webrtc", ConnectionURL: "wss://your-app.example.com/voice", PipelineType: vbiface.PipelineTypeS2S, AgentCallback: func(ctx context.Context, transcript string) (string, error) { // Process transcript and return response return "You said: " + transcript, nil }, }
session, err := be.CreateSession(ctx, sessionCfg) if err != nil { log.Fatalf("create session: %v", err) }
if err := session.Start(ctx); err != nil { log.Fatalf("start session: %v", err) } defer session.Stop(ctx)For S2S pipelines, the callback is optional since the S2S model handles the full audio-to-audio flow. The callback is primarily used with STT/TTS pipelines where the application needs to process the transcribed text before generating a response.
Step 3: Handle Audio Events
Section titled “Step 3: Handle Audio Events”The session provides methods to receive participant audio and send agent audio back. The backend orchestrator handles the full pipeline flow: incoming audio goes through noise cancellation, VAD, and STT; the transcript is passed to the agent callback; and the response is synthesized and sent back. You only need to bridge the session’s audio channel with your transport layer.
// Process incoming audio from the participant go func() { for audio := range session.ReceiveAudio() { if err := session.ProcessAudio(ctx, audio); err != nil { log.Printf("process audio error: %v", err) } } }()Step 4: Vapi Integration
Section titled “Step 4: Vapi Integration”Vapi provides a managed voice pipeline where Beluga acts as the “brain” (LLM) behind a Vapi call. In this model, Vapi handles the telephony, STT, and TTS; your application provides only the response logic. This architecture is useful when you want telephony support (PSTN, SIP) without managing telephony infrastructure, or when Vapi’s built-in STT/TTS selection meets your needs and you want to minimize the components you operate.
package main
import ( "context" "log" "os" "time"
"github.com/lookatitude/beluga-ai/voice/backend" vbiface "github.com/lookatitude/beluga-ai/voice/backend/iface" _ "github.com/lookatitude/beluga-ai/voice/backend/providers/vapi")
func main() { ctx := context.Background()
cfg := &vbiface.Config{ Provider: "vapi", PipelineType: vbiface.PipelineTypeSTTTTS, STTProvider: "deepgram", TTSProvider: "elevenlabs", ProviderConfig: map[string]any{ "api_key": os.Getenv("VAPI_API_KEY"), }, Timeout: 30 * time.Second, EnableTracing: true, }
be, err := backend.NewBackend(ctx, "vapi", cfg) if err != nil { log.Fatalf("create Vapi backend: %v", err) } defer be.Stop(ctx)
log.Println("Vapi backend ready for incoming calls")}Architecture
Section titled “Architecture”┌───────────────────────────────────────────────────┐│ LiveKit Room ││ ││ User ◀──WebRTC──▶ LiveKit Server ◀──▶ Beluga AI ││ ││ Audio In ──▶ [Pipeline] ──▶ Audio Out ││ STT/TTS or S2S │└───────────────────────────────────────────────────┘Verification
Section titled “Verification”- Start a LiveKit room via the LiveKit dashboard or CLI.
- Run the Beluga application and verify the backend starts without errors.
- Join the room as a participant and speak.
- Confirm the agent responds with synthesized audio.
- Check
GetActiveSessionCount()reflects the active sessions.
Next Steps
Section titled “Next Steps”- Sensitivity Tuning — Optimize VAD and turn detection for your environment
- Scalable Voice Backend — Scale to hundreds of concurrent sessions
- Real-time STT Streaming — Use custom STT with backend providers