Native S2S with Amazon Nova
Speech-to-Speech (S2S) models process audio input and produce audio output directly, bypassing the traditional STT-LLM-TTS pipeline. This architectural shift eliminates the latency overhead of intermediate text conversion and preserves tonal and emotional cues that are lost when speech is transcribed to text and back. This tutorial demonstrates how to configure Amazon Nova’s S2S provider, manage bidirectional audio streams, and handle the session lifecycle for real-time voice conversations.
What You Will Build
Section titled “What You Will Build”A real-time voice conversation system using Amazon Nova’s S2S model that processes audio-to-audio with sub-second latency, preserving emotional nuance that is lost in traditional text-based pipelines.
Prerequisites
Section titled “Prerequisites”- AWS credentials with Bedrock/Nova access
- Go 1.23+
- Basic understanding of bidirectional audio streams
S2S vs. Traditional Pipeline
Section titled “S2S vs. Traditional Pipeline”Traditional voice agents use a three-stage pipeline where each stage adds latency and loses information:
Audio ──▶ STT ──▶ Text ──▶ LLM ──▶ Text ──▶ TTS ──▶ AudioEach transition introduces delay: STT must wait for enough audio to produce a transcript, the LLM processes text, and TTS synthesizes audio from scratch. Emotional cues like hesitation, excitement, or sarcasm are lost during the STT step and cannot be recovered by TTS.
Native S2S processes audio end-to-end in a single step:
Audio ──▶ S2S Model ──▶ AudioThe S2S approach eliminates latency from intermediate text conversion and preserves tonal and emotional cues from the speaker’s voice. The model can also respond with appropriate intonation because it processes the audio signal directly rather than a text approximation.
Step 1: Initialize the S2S Provider
Section titled “Step 1: Initialize the S2S Provider”Beluga uses the registry pattern for S2S providers, the same pattern used across all extensible packages. The blank import triggers the Amazon Nova provider’s init() function, which registers its factory with the S2S registry. The s2s.NewProvider call then looks up the "amazon_nova" factory and creates a configured provider instance.
package main
import ( "context" "fmt" "log" "os"
"github.com/lookatitude/beluga-ai/voice/s2s" _ "github.com/lookatitude/beluga-ai/voice/s2s/providers/amazon_nova")
func main() { ctx := context.Background()
provider, err := s2s.NewProvider(ctx, "amazon_nova", s2s.DefaultConfig(), s2s.WithAPIKey(os.Getenv("AWS_ACCESS_KEY_ID")), s2s.WithLanguage("en-US"), s2s.WithLatencyTarget("low"), ) if err != nil { log.Fatalf("create S2S provider: %v", err) }
_ = provider}The LatencyTarget option controls the tradeoff between response speed and output quality. S2S models use internal reasoning steps that affect both the depth of the response and the time before audio output begins. Choosing the right target depends on your application’s interaction style:
| Target | Behavior |
|---|---|
"low" | Fastest response, suitable for concierge-style agents |
"medium" | Balanced speed and reasoning depth |
"high" | Deepest reasoning, suitable for tutoring or complex tasks |
Step 2: Start a Streaming Session
Section titled “Step 2: Start a Streaming Session”S2S requires a bidirectional streaming session because the model processes input and generates output concurrently. Unlike request-response APIs, the session remains open for the duration of the conversation, allowing audio to flow in both directions simultaneously.
The StreamingS2SProvider interface extends the base provider with session-based interaction. This type assertion pattern is used because not all S2S providers support streaming — some may only offer batch processing for specific use cases.
// Cast to streaming provider for session-based interaction streamingProvider, ok := provider.(s2s.StreamingS2SProvider) if !ok { log.Fatal("provider does not support streaming") }
conversationCtx := &s2s.ConversationContext{ SessionID: "session-001", ConversationID: "conv-001", UserID: "user-001", }
session, err := streamingProvider.StartStreaming(ctx, conversationCtx, s2s.WithVoiceID("Brian"), s2s.WithEnableStreaming(true), ) if err != nil { log.Fatalf("start streaming: %v", err) } defer session.Close()Step 3: Handle Incoming Audio
Section titled “Step 3: Handle Incoming Audio”The session produces audio output chunks on a Go channel. Processing these concurrently in a goroutine ensures that the receive loop does not block the send path, which is essential for maintaining bidirectional flow. Blocking either direction would create backpressure that increases latency or causes dropped frames.
// Handle incoming model audio in a separate goroutine go func() { for chunk := range session.ReceiveAudio() { if chunk.Error != nil { fmt.Printf("receive error: %v\n", chunk.Error) continue }
if chunk.IsFinal { fmt.Println("[S2S] Response complete") }
// Forward audio to the user's speaker or transport layer playAudio(chunk.Audio) } }()The AudioOutputChunk contains both the audio data and metadata needed for playback synchronization:
| Field | Type | Description |
|---|---|---|
Audio | []byte | Raw audio data in the configured format |
Timestamp | int64 | Server-side timestamp |
IsFinal | bool | Whether this is the last chunk in a turn |
Error | error | Non-nil if an error occurred |
Step 4: Send User Audio
Section titled “Step 4: Send User Audio”Pipe audio from the user’s microphone (or any audio source) into the session. The send loop runs concurrently with the receive goroutine, creating the full-duplex audio channel that S2S models require.
// Stream user audio to the model for audioChunk := range microphoneStream(ctx) { if err := session.SendAudio(ctx, audioChunk); err != nil { log.Printf("send audio error: %v", err) break } }For optimal performance, send audio in small frames (20ms at 16kHz = 640 bytes per frame). Smaller frames reduce latency because the model begins processing sooner, but increase packet overhead. The 20ms frame size is the standard tradeoff used by most real-time audio systems, including WebRTC.
Step 5: Interruption Handling
Section titled “Step 5: Interruption Handling”S2S models handle interruptions naturally because they continuously monitor the input audio stream. When the user begins speaking while the model is generating a response, the model detects the interruption and adjusts its output. This is a fundamental advantage of S2S over traditional pipelines, where the application must explicitly coordinate STT silence detection with TTS playback cancellation — a complex orchestration problem that often results in clipped responses or delayed interruption recognition.
Architecture
Section titled “Architecture”User Microphone User Speaker │ ▲ ▼ │ SendAudio(ctx, audio) ReceiveAudio() │ │ ▼ │┌─────────────────── S2S Session ──────────────────┐│ ││ Audio In ──▶ Amazon Nova Model ──▶ Audio Out ││ │└───────────────────────────────────────────────────┘Verification
Section titled “Verification”- Set AWS credentials in the environment.
- Run the application and say “Hello, who are you?”
- Verify the model responds with voice within 500ms.
- Try interrupting the model while it is speaking and confirm it adapts.
Next Steps
Section titled “Next Steps”- Configuring S2S Reasoning Modes — Optimize for speed or quality
- Voice Session Interruptions — Session-level interruption management
- Scalable Voice Backend — Production-grade backend with LiveKit