LLM Providers & Abstraction
A unified ChatModel interface across 22+ providers with intelligent routing, structured output parsing, context window management, and composable middleware — two methods that work with everything.
Overview
The LLM package is the foundation that every other Beluga AI capability builds on. It provides a unified ChatModel interface with exactly two methods — Generate and Stream — that work identically across 22+ providers. Whether you are calling OpenAI, Anthropic, a local Ollama instance, or AWS Bedrock, your application code stays the same. Switch providers by changing a string, not rewriting your application.
Beyond basic abstraction, the LLM package includes production-critical features: an intelligent router that distributes requests across providers based on cost, latency, or custom strategies; structured output parsing that extracts typed Go structs from LLM responses with automatic retry; and context window management with six strategies to keep your prompts within model limits without losing critical information.
Everything is composable via middleware. Wrap any ChatModel with retry logic, rate limiting, caching, guardrails, or cost tracking — each decorator follows the func(ChatModel) ChatModel pattern and can be stacked in any order. Five hook points give you fine-grained control over the request lifecycle without modifying provider implementations.
Capabilities
ChatModel Interface
The core abstraction: two methods that work with every provider. Generate returns a complete response; Stream returns an iter.Seq2[Event, error] for real-time token streaming. All provider-specific behavior (auth, API formats, error mapping) is handled internally.
// Create any provider with the same API
model, _ := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"})
// Generate a complete response
response, err := model.Generate(ctx, messages)
// Or stream tokens in real time
for event, err := range model.Stream(ctx, messages) {
fmt.Print(event.Text())
} LLM Router
Distribute requests across multiple LLM backends with pluggable routing strategies. Round-robin for load distribution, cost-optimized for budget control, latency-optimized for speed-critical paths, or learned routing that adapts based on historical performance. The router implements ChatModel, so it is a transparent drop-in replacement.
router := llm.NewRouter(
llm.RouteTarget{Model: gpt4o, Weight: 0.6},
llm.RouteTarget{Model: claude, Weight: 0.3},
llm.RouteTarget{Model: gemini, Weight: 0.1},
llm.WithStrategy(llm.CostOptimized),
llm.WithFallback(ollamaLocal),
) Structured Output
Parse LLM responses directly into typed Go structs via JSON Schema. StructuredOutput[T] wraps any ChatModel, injects the schema into the prompt, validates the response, and automatically retries on parse failure. No more manual JSON extraction or regex parsing.
type Analysis struct {
Sentiment string `json:"sentiment"`
Confidence float64 `json:"confidence"`
KeyTopics []string `json:"key_topics"`
}
structured := llm.Structured[Analysis](model)
result, err := structured.Generate(ctx, messages)
// result.Sentiment, result.Confidence, result.KeyTopics are typed Context Window Management
Six strategies to keep prompts within model token limits without losing critical information. Choose based on your use case: Truncation for simple cutoff, Sliding Window for recent history, Summarization for long conversations, Semantic Selection for relevance-based filtering, Adaptive for dynamic adjustment, or Hybrid combining multiple approaches.
model, _ := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"},
llm.WithContextManager(llm.SlidingWindow(20)), // Keep last 20 messages
llm.WithContextManager(llm.Summarize(summarizer)), // Summarize overflow
) Tokenizer
Accurate token counting and encoding/decoding across providers. Supports tiktoken (OpenAI models) and SentencePiece (open-source models). Essential for context management, cost estimation, and rate limit awareness.
tok, _ := tokenizer.New("gpt-4o")
count := tok.Count("How many tokens is this?") // 6
tokens := tok.Encode("Hello world") // []int{...}
text := tok.Decode(tokens) // "Hello world" Provider-Aware Rate Limiting
Built-in rate limiting that understands provider-specific constraints: requests per minute (RPM), tokens per minute (TPM), and concurrent request limits. Automatic cooldown and backoff prevent 429 errors without manual retry logic.
model, _ := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"},
llm.WithProviderLimits(llm.Limits{
RPM: 500,
TPM: 150000,
Concurrent: 50,
}),
) Prompt Cache Optimization
Automatically orders messages to maximize cache hits for providers that support prompt caching (Anthropic, Google). Static content — system prompts, tool definitions, few-shot examples — is placed first so it falls within the cacheable prefix. This reduces costs and latency for repeated interactions without changing your code.
// PromptBuilder automatically orders for cache optimization
builder := prompt.NewBuilder(
prompt.WithSystemPrompt(systemPrompt), // Static — cached
prompt.WithTools(tools...), // Static — cached
prompt.WithExamples(examples...), // Static — cached
prompt.WithMessages(history...), // Dynamic — after cache prefix
) Middleware and Hooks
Composable decorators follow the func(ChatModel) ChatModel pattern. Stack retry, rate limiting, caching, logging, guardrails, and cost tracking in any order. Five hook points — BeforeGenerate, AfterGenerate, OnStream, OnToolCall, OnError — give fine-grained lifecycle control.
model = llm.ApplyMiddleware(model,
llm.WithRetry(3, time.Second),
llm.WithCache(cache),
llm.WithCostTracker(tracker),
llm.WithHooks(llm.Hooks{
BeforeGenerate: func(ctx context.Context, msgs []schema.Message) error {
slog.Info("generating", "messages", len(msgs))
return nil
},
}),
) Architecture
Providers & Implementations
| Provider | Priority | Key Differentiator |
|---|---|---|
| OpenAI | Core | GPT-4o, o1/o3 reasoning, function calling, streaming |
| Anthropic | Core | Claude 3.5/4, extended thinking, prompt caching |
| Google Gemini | Core | Gemini 2.x, 1M+ context, multimodal, grounding |
| AWS Bedrock | Core | Multi-model gateway, enterprise IAM, VPC endpoints |
| Ollama | Core | Local inference, privacy-first, no API key needed |
| Groq | Core | LPU inference, lowest latency, Llama/Mixtral |
| Mistral | Extended | Mistral Large/Medium, function calling, EU-hosted |
| DeepSeek | Extended | DeepSeek-V3/R1, strong reasoning, cost-efficient |
| xAI Grok | Extended | Grok-2, real-time information, humor-aware |
| Cohere | Extended | Command R+, RAG-optimized, enterprise search |
| Together AI | Extended | Open-source model hosting, fine-tuning, fast inference |
| Fireworks AI | Extended | Optimized open-source inference, function calling |
| Azure OpenAI | Extended | Enterprise compliance, data residency, AAD auth |
| Perplexity | Extended | Search-augmented generation, real-time web access |
| SambaNova | Extended | Custom silicon, high throughput for enterprise |
| Cerebras | Extended | Wafer-scale inference, extreme speed |
| OpenRouter | Extended | Multi-provider gateway, unified API, model discovery |
| Hugging Face | Community | Inference API, open-source model access |
| Voyage AI | Community | Embedding-focused, high-quality retrieval models |
| Jina AI | Community | Embeddings and reranking, multilingual |
Full Example
A complete example showing an LLM router with multiple providers, structured output, and streaming with middleware:
package main
import (
"context"
"fmt"
"time"
"github.com/lookatitude/beluga-ai/llm"
"github.com/lookatitude/beluga-ai/schema"
)
type SentimentResult struct {
Sentiment string `json:"sentiment"`
Confidence float64 `json:"confidence"`
Reasons []string `json:"reasons"`
}
func main() {
ctx := context.Background()
// Create providers
gpt4o, _ := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"})
claude, _ := llm.New("anthropic", llm.ProviderConfig{Model: "claude-sonnet-4-20250514"})
gemini, _ := llm.New("google", llm.ProviderConfig{Model: "gemini-2.0-flash"})
// Build a cost-optimized router with fallback
router := llm.NewRouter(
llm.RouteTarget{Model: gpt4o, Weight: 0.5},
llm.RouteTarget{Model: claude, Weight: 0.3},
llm.RouteTarget{Model: gemini, Weight: 0.2},
llm.WithStrategy(llm.CostOptimized),
)
// Add middleware: retry, rate limiting, cost tracking
model := llm.ApplyMiddleware(router,
llm.WithRetry(3, time.Second),
llm.WithRateLimit(100, time.Minute),
)
// Structured output: parse LLM response into a typed struct
structured := llm.Structured[SentimentResult](model)
result, _ := structured.Generate(ctx, []schema.Message{
{Role: "user", Content: "Analyze the sentiment: 'Beluga AI makes Go fun again'"},
})
fmt.Printf("Sentiment: %s (%.0f%% confidence)\n", result.Sentiment, result.Confidence*100)
// Streaming: real-time token output
for event, err := range model.Stream(ctx, []schema.Message{
{Role: "user", Content: "Explain why Go is great for AI agents"},
}) {
if err != nil { break }
fmt.Print(event.Text())
}
}