Capability Layer

LLM Providers & Abstraction

A unified ChatModel interface across 22+ providers with intelligent routing, structured output parsing, context window management, and composable middleware — two methods that work with everything.

22+ ProvidersRouterStructured Output6 Context Strategies

Overview

The LLM package is the foundation that every other Beluga AI capability builds on. It provides a unified ChatModel interface with exactly two methods — Generate and Stream — that work identically across 22+ providers. Whether you are calling OpenAI, Anthropic, a local Ollama instance, or AWS Bedrock, your application code stays the same. Switch providers by changing a string, not rewriting your application.

Beyond basic abstraction, the LLM package includes production-critical features: an intelligent router that distributes requests across providers based on cost, latency, or custom strategies; structured output parsing that extracts typed Go structs from LLM responses with automatic retry; and context window management with six strategies to keep your prompts within model limits without losing critical information.

Everything is composable via middleware. Wrap any ChatModel with retry logic, rate limiting, caching, guardrails, or cost tracking — each decorator follows the func(ChatModel) ChatModel pattern and can be stacked in any order. Five hook points give you fine-grained control over the request lifecycle without modifying provider implementations.

Capabilities

ChatModel Interface

The core abstraction: two methods that work with every provider. Generate returns a complete response; Stream returns an iter.Seq2[Event, error] for real-time token streaming. All provider-specific behavior (auth, API formats, error mapping) is handled internally.

// Create any provider with the same API
model, _ := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"})

// Generate a complete response
response, err := model.Generate(ctx, messages)

// Or stream tokens in real time
for event, err := range model.Stream(ctx, messages) {
    fmt.Print(event.Text())
}

LLM Router

Distribute requests across multiple LLM backends with pluggable routing strategies. Round-robin for load distribution, cost-optimized for budget control, latency-optimized for speed-critical paths, or learned routing that adapts based on historical performance. The router implements ChatModel, so it is a transparent drop-in replacement.

router := llm.NewRouter(
    llm.RouteTarget{Model: gpt4o, Weight: 0.6},
    llm.RouteTarget{Model: claude, Weight: 0.3},
    llm.RouteTarget{Model: gemini, Weight: 0.1},
    llm.WithStrategy(llm.CostOptimized),
    llm.WithFallback(ollamaLocal),
)

Structured Output

Parse LLM responses directly into typed Go structs via JSON Schema. StructuredOutput[T] wraps any ChatModel, injects the schema into the prompt, validates the response, and automatically retries on parse failure. No more manual JSON extraction or regex parsing.

type Analysis struct {
    Sentiment  string   `json:"sentiment"`
    Confidence float64  `json:"confidence"`
    KeyTopics  []string `json:"key_topics"`
}
structured := llm.Structured[Analysis](model)
result, err := structured.Generate(ctx, messages)
// result.Sentiment, result.Confidence, result.KeyTopics are typed

Context Window Management

Six strategies to keep prompts within model token limits without losing critical information. Choose based on your use case: Truncation for simple cutoff, Sliding Window for recent history, Summarization for long conversations, Semantic Selection for relevance-based filtering, Adaptive for dynamic adjustment, or Hybrid combining multiple approaches.

model, _ := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"},
    llm.WithContextManager(llm.SlidingWindow(20)),   // Keep last 20 messages
    llm.WithContextManager(llm.Summarize(summarizer)), // Summarize overflow
)

Tokenizer

Accurate token counting and encoding/decoding across providers. Supports tiktoken (OpenAI models) and SentencePiece (open-source models). Essential for context management, cost estimation, and rate limit awareness.

tok, _ := tokenizer.New("gpt-4o")
count := tok.Count("How many tokens is this?")  // 6
tokens := tok.Encode("Hello world")              // []int{...}
text := tok.Decode(tokens)                       // "Hello world"

Provider-Aware Rate Limiting

Built-in rate limiting that understands provider-specific constraints: requests per minute (RPM), tokens per minute (TPM), and concurrent request limits. Automatic cooldown and backoff prevent 429 errors without manual retry logic.

model, _ := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"},
    llm.WithProviderLimits(llm.Limits{
        RPM:        500,
        TPM:        150000,
        Concurrent: 50,
    }),
)

Prompt Cache Optimization

Automatically orders messages to maximize cache hits for providers that support prompt caching (Anthropic, Google). Static content — system prompts, tool definitions, few-shot examples — is placed first so it falls within the cacheable prefix. This reduces costs and latency for repeated interactions without changing your code.

// PromptBuilder automatically orders for cache optimization
builder := prompt.NewBuilder(
    prompt.WithSystemPrompt(systemPrompt),  // Static — cached
    prompt.WithTools(tools...),             // Static — cached
    prompt.WithExamples(examples...),       // Static — cached
    prompt.WithMessages(history...),        // Dynamic — after cache prefix
)

Middleware and Hooks

Composable decorators follow the func(ChatModel) ChatModel pattern. Stack retry, rate limiting, caching, logging, guardrails, and cost tracking in any order. Five hook points — BeforeGenerate, AfterGenerate, OnStream, OnToolCall, OnError — give fine-grained lifecycle control.

model = llm.ApplyMiddleware(model,
    llm.WithRetry(3, time.Second),
    llm.WithCache(cache),
    llm.WithCostTracker(tracker),
    llm.WithHooks(llm.Hooks{
        BeforeGenerate: func(ctx context.Context, msgs []schema.Message) error {
            slog.Info("generating", "messages", len(msgs))
            return nil
        },
    }),
)

Architecture

Your Application
model.Generate(ctx, messages) / model.Stream(ctx, messages)
Middleware Stack
Retry | Cache | Rate Limit | Cost Tracking | Hooks
LLM Router
Round-Robin | Cost-Optimized | Latency-Optimized | Learned
OpenAI
Anthropic
Gemini
Bedrock
Ollama
Groq
+16 more

Providers & Implementations

Provider Priority Key Differentiator
OpenAICoreGPT-4o, o1/o3 reasoning, function calling, streaming
AnthropicCoreClaude 3.5/4, extended thinking, prompt caching
Google GeminiCoreGemini 2.x, 1M+ context, multimodal, grounding
AWS BedrockCoreMulti-model gateway, enterprise IAM, VPC endpoints
OllamaCoreLocal inference, privacy-first, no API key needed
GroqCoreLPU inference, lowest latency, Llama/Mixtral
MistralExtendedMistral Large/Medium, function calling, EU-hosted
DeepSeekExtendedDeepSeek-V3/R1, strong reasoning, cost-efficient
xAI GrokExtendedGrok-2, real-time information, humor-aware
CohereExtendedCommand R+, RAG-optimized, enterprise search
Together AIExtendedOpen-source model hosting, fine-tuning, fast inference
Fireworks AIExtendedOptimized open-source inference, function calling
Azure OpenAIExtendedEnterprise compliance, data residency, AAD auth
PerplexityExtendedSearch-augmented generation, real-time web access
SambaNovaExtendedCustom silicon, high throughput for enterprise
CerebrasExtendedWafer-scale inference, extreme speed
OpenRouterExtendedMulti-provider gateway, unified API, model discovery
Hugging FaceCommunityInference API, open-source model access
Voyage AICommunityEmbedding-focused, high-quality retrieval models
Jina AICommunityEmbeddings and reranking, multilingual

Full Example

A complete example showing an LLM router with multiple providers, structured output, and streaming with middleware:

package main

import (
    "context"
    "fmt"
    "time"

    "github.com/lookatitude/beluga-ai/llm"
    "github.com/lookatitude/beluga-ai/schema"
)

type SentimentResult struct {
    Sentiment  string   `json:"sentiment"`
    Confidence float64  `json:"confidence"`
    Reasons    []string `json:"reasons"`
}

func main() {
    ctx := context.Background()

    // Create providers
    gpt4o, _ := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"})
    claude, _ := llm.New("anthropic", llm.ProviderConfig{Model: "claude-sonnet-4-20250514"})
    gemini, _ := llm.New("google", llm.ProviderConfig{Model: "gemini-2.0-flash"})

    // Build a cost-optimized router with fallback
    router := llm.NewRouter(
        llm.RouteTarget{Model: gpt4o, Weight: 0.5},
        llm.RouteTarget{Model: claude, Weight: 0.3},
        llm.RouteTarget{Model: gemini, Weight: 0.2},
        llm.WithStrategy(llm.CostOptimized),
    )

    // Add middleware: retry, rate limiting, cost tracking
    model := llm.ApplyMiddleware(router,
        llm.WithRetry(3, time.Second),
        llm.WithRateLimit(100, time.Minute),
    )

    // Structured output: parse LLM response into a typed struct
    structured := llm.Structured[SentimentResult](model)
    result, _ := structured.Generate(ctx, []schema.Message{
        {Role: "user", Content: "Analyze the sentiment: 'Beluga AI makes Go fun again'"},
    })
    fmt.Printf("Sentiment: %s (%.0f%% confidence)\n", result.Sentiment, result.Confidence*100)

    // Streaming: real-time token output
    for event, err := range model.Stream(ctx, []schema.Message{
        {Role: "user", Content: "Explain why Go is great for AI agents"},
    }) {
        if err != nil { break }
        fmt.Print(event.Text())
    }
}

Related Features