Skip to content
Docs

OpenAI Whisper Voice Provider

OpenAI Whisper provides highly accurate batch speech-to-text transcription through the OpenAI Audio Transcriptions API. The Beluga AI provider uploads audio as multipart form data and returns the transcribed text. Whisper does not support native streaming; the streaming interface transcribes each audio chunk independently as a batch request.

Choose Whisper when you already use the OpenAI API and need reliable batch transcription without adding another vendor. Whisper excels at accuracy across many languages but does not provide real-time interim results. For real-time streaming with partial transcripts, use Deepgram or AssemblyAI.

import _ "github.com/lookatitude/beluga-ai/voice/stt/providers/whisper"

The blank import registers the "whisper" provider with the STT registry.

FieldTypeDefaultDescription
LanguagestringISO-639 language code (e.g., "en")
Modelstring"whisper-1"Whisper model identifier
ExtraSee below
KeyTypeRequiredDescription
api_keystringYesOpenAI API key
base_urlstringNoOverride API base URL
package main
import (
"context"
"fmt"
"log"
"os"
"github.com/lookatitude/beluga-ai/voice/stt"
_ "github.com/lookatitude/beluga-ai/voice/stt/providers/whisper"
)
func main() {
ctx := context.Background()
engine, err := stt.New("whisper", stt.Config{
Model: "whisper-1",
Extra: map[string]any{"api_key": os.Getenv("OPENAI_API_KEY")},
})
if err != nil {
log.Fatal(err)
}
audio, err := os.ReadFile("recording.wav")
if err != nil {
log.Fatal(err)
}
text, err := engine.Transcribe(ctx, audio)
if err != nil {
log.Fatal(err)
}
fmt.Println("Transcript:", text)
}
import "github.com/lookatitude/beluga-ai/voice/stt/providers/whisper"
engine, err := whisper.New(stt.Config{
Model: "whisper-1",
Language: "en",
Extra: map[string]any{"api_key": os.Getenv("OPENAI_API_KEY")},
})

Whisper does not support native real-time streaming. The TranscribeStream method transcribes each audio chunk independently as a separate batch request. Each chunk produces a final transcript event:

for event, err := range engine.TranscribeStream(ctx, audioStream) {
if err != nil {
log.Printf("error: %v", err)
break
}
// All events from Whisper are final (no partial results)
fmt.Printf("[FINAL] %s\n", event.Text)
}

For real-time transcription with interim results, consider Deepgram or AssemblyAI which support native WebSocket streaming.

processor := stt.AsFrameProcessor(engine, stt.WithLanguage("en"))
pipeline := voice.Chain(vadProcessor, processor, llmProcessor, ttsProcessor)
text, err := engine.Transcribe(ctx, audio,
stt.WithLanguage("fr"),
stt.WithModel("whisper-1"),
)

Use an alternative OpenAI-compatible endpoint (e.g., Azure OpenAI):

engine, err := stt.New("whisper", stt.Config{
Model: "whisper-1",
Extra: map[string]any{
"api_key": os.Getenv("OPENAI_API_KEY"),
"base_url": "https://my-instance.openai.azure.com/openai/deployments/whisper-1",
},
})