Capability Layer

LLM Providers & Abstraction

A unified ChatModel interface across 22+ providers with intelligent routing, structured output parsing, context window management, and composable middleware — two methods that work with everything.

22+ ProvidersRouterStructured Output6 Context Strategies

Overview

The LLM package is the foundation that every other Beluga AI capability builds on. It provides a unified ChatModel interface with exactly two methods — Generate and Stream — that work identically across 22+ providers. Whether you are calling OpenAI, Anthropic, a local Ollama instance, or AWS Bedrock, your application code stays the same. Switch providers by changing a string, not rewriting your application.

Beyond basic abstraction, the LLM package includes production-critical features: an intelligent router that distributes requests across providers based on cost, latency, or custom strategies; structured output parsing that extracts typed Go structs from LLM responses with automatic retry; and context window management with six strategies to keep your prompts within model limits without losing critical information.

Everything is composable via middleware. Wrap any ChatModel with retry logic, rate limiting, caching, guardrails, or cost tracking — each decorator follows the func(ChatModel) ChatModel pattern and can be stacked in any order. Five hook points give you fine-grained control over the request lifecycle without modifying provider implementations.

Capabilities

ChatModel Interface

The core abstraction: two methods that work with every provider. Generate returns a complete response; Stream returns an iter.Seq2[Event, error] for real-time token streaming. All provider-specific behavior (auth, API formats, error mapping) is handled internally.

// Create any provider with the same API
model, _ := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"})

// Generate a complete response
response, err := model.Generate(ctx, messages)

// Or stream tokens in real time
for event, err := range model.Stream(ctx, messages) {
    fmt.Print(event.Text())
}

LLM Router

Distribute requests across multiple LLM backends with pluggable routing strategies. Round-robin for load distribution, cost-optimized for budget control, latency-optimized for speed-critical paths, or learned routing that adapts based on historical performance. The router implements ChatModel, so it is a transparent drop-in replacement.

router := llm.NewRouter(
    llm.RouteTarget{Model: gpt4o, Weight: 0.6},
    llm.RouteTarget{Model: claude, Weight: 0.3},
    llm.RouteTarget{Model: gemini, Weight: 0.1},
    llm.WithStrategy(llm.CostOptimized),
    llm.WithFallback(ollamaLocal),
)

Structured Output

Parse LLM responses directly into typed Go structs via JSON Schema. StructuredOutput[T] wraps any ChatModel, injects the schema into the prompt, validates the response, and automatically retries on parse failure. No more manual JSON extraction or regex parsing.

type Analysis struct {
    Sentiment  string   `json:"sentiment"`
    Confidence float64  `json:"confidence"`
    KeyTopics  []string `json:"key_topics"`
}
structured := llm.Structured[Analysis](model)
result, err := structured.Generate(ctx, messages)
// result.Sentiment, result.Confidence, result.KeyTopics are typed

Context Window Management

Six strategies to keep prompts within model token limits without losing critical information. Choose based on your use case: Truncation for simple cutoff, Sliding Window for recent history, Summarization for long conversations, Semantic Selection for relevance-based filtering, Adaptive for dynamic adjustment, or Hybrid combining multiple approaches.

model, _ := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"},
    llm.WithContextManager(llm.SlidingWindow(20)),   // Keep last 20 messages
    llm.WithContextManager(llm.Summarize(summarizer)), // Summarize overflow
)

Tokenizer

Accurate token counting and encoding/decoding across providers. Supports tiktoken (OpenAI models) and SentencePiece (open-source models). Essential for context management, cost estimation, and rate limit awareness.

tok, _ := tokenizer.New("gpt-4o")
count := tok.Count("How many tokens is this?")  // 6
tokens := tok.Encode("Hello world")              // []int{...}
text := tok.Decode(tokens)                       // "Hello world"

Provider-Aware Rate Limiting

Built-in rate limiting that understands provider-specific constraints: requests per minute (RPM), tokens per minute (TPM), and concurrent request limits. Automatic cooldown and backoff prevent 429 errors without manual retry logic.

model, _ := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"},
    llm.WithProviderLimits(llm.Limits{
        RPM:        500,
        TPM:        150000,
        Concurrent: 50,
    }),
)

Prompt Cache Optimization

Automatically orders messages to maximize cache hits for providers that support prompt caching (Anthropic, Google). Static content — system prompts, tool definitions, few-shot examples — is placed first so it falls within the cacheable prefix. This reduces costs and latency for repeated interactions without changing your code.

// PromptBuilder automatically orders for cache optimization
builder := prompt.NewBuilder(
    prompt.WithSystemPrompt(systemPrompt),  // Static — cached
    prompt.WithTools(tools...),             // Static — cached
    prompt.WithExamples(examples...),       // Static — cached
    prompt.WithMessages(history...),        // Dynamic — after cache prefix
)

Middleware and Hooks

Composable decorators follow the func(ChatModel) ChatModel pattern. Stack retry, rate limiting, caching, logging, guardrails, and cost tracking in any order. Five hook points — BeforeGenerate, AfterGenerate, OnStream, OnToolCall, OnError — give fine-grained lifecycle control.

model = llm.ApplyMiddleware(model,
    llm.WithRetry(3, time.Second),
    llm.WithCache(cache),
    llm.WithCostTracker(tracker),
    llm.WithHooks(llm.Hooks{
        BeforeGenerate: func(ctx context.Context, msgs []schema.Message) error {
            slog.Info("generating", "messages", len(msgs))
            return nil
        },
    }),
)

Architecture

Your Application

model.Generate(ctx, messages) / model.Stream(ctx, messages)

Middleware Stack

Retry | Cache | Rate Limit | Cost Tracking | Hooks

LLM Router

Round-Robin | Cost-Optimized | Latency-Optimized | Learned

OpenAI

Anthropic

Gemini

Bedrock

Ollama

Groq

+16 more

Providers & Implementations

Provider	Priority	Key Differentiator
OpenAI	Core	GPT-4o, o1/o3 reasoning, function calling, streaming
Anthropic	Core	Claude 3.5/4, extended thinking, prompt caching
Google Gemini	Core	Gemini 2.x, 1M+ context, multimodal, grounding
AWS Bedrock	Core	Multi-model gateway, enterprise IAM, VPC endpoints
Ollama	Core	Local inference, privacy-first, no API key needed
Groq	Core	LPU inference, lowest latency, Llama/Mixtral
Mistral	Extended	Mistral Large/Medium, function calling, EU-hosted
DeepSeek	Extended	DeepSeek-V3/R1, strong reasoning, cost-efficient
xAI Grok	Extended	Grok-2, real-time information, humor-aware
Cohere	Extended	Command R+, RAG-optimized, enterprise search
Together AI	Extended	Open-source model hosting, fine-tuning, fast inference
Fireworks AI	Extended	Optimized open-source inference, function calling
Azure OpenAI	Extended	Enterprise compliance, data residency, AAD auth
Perplexity	Extended	Search-augmented generation, real-time web access
SambaNova	Extended	Custom silicon, high throughput for enterprise
Cerebras	Extended	Wafer-scale inference, extreme speed
OpenRouter	Extended	Multi-provider gateway, unified API, model discovery
Hugging Face	Community	Inference API, open-source model access
Voyage AI	Community	Embedding-focused, high-quality retrieval models
Jina AI	Community	Embeddings and reranking, multilingual

Full Example

A complete example showing an LLM router with multiple providers, structured output, and streaming with middleware:

package main

import (
    "context"
    "fmt"
    "time"

    "github.com/lookatitude/beluga-ai/llm"
    "github.com/lookatitude/beluga-ai/schema"
)

type SentimentResult struct {
    Sentiment  string   `json:"sentiment"`
    Confidence float64  `json:"confidence"`
    Reasons    []string `json:"reasons"`
}

func main() {
    ctx := context.Background()

    // Create providers
    gpt4o, _ := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"})
    claude, _ := llm.New("anthropic", llm.ProviderConfig{Model: "claude-sonnet-4-20250514"})
    gemini, _ := llm.New("google", llm.ProviderConfig{Model: "gemini-2.0-flash"})

    // Build a cost-optimized router with fallback
    router := llm.NewRouter(
        llm.RouteTarget{Model: gpt4o, Weight: 0.5},
        llm.RouteTarget{Model: claude, Weight: 0.3},
        llm.RouteTarget{Model: gemini, Weight: 0.2},
        llm.WithStrategy(llm.CostOptimized),
    )

    // Add middleware: retry, rate limiting, cost tracking
    model := llm.ApplyMiddleware(router,
        llm.WithRetry(3, time.Second),
        llm.WithRateLimit(100, time.Minute),
    )

    // Structured output: parse LLM response into a typed struct
    structured := llm.Structured[SentimentResult](model)
    result, _ := structured.Generate(ctx, []schema.Message{
        {Role: "user", Content: "Analyze the sentiment: 'Beluga AI makes Go fun again'"},
    })
    fmt.Printf("Sentiment: %s (%.0f%% confidence)\n", result.Sentiment, result.Confidence*100)

    // Streaming: real-time token output
    for event, err := range model.Stream(ctx, []schema.Message{
        {Role: "user", Content: "Explain why Go is great for AI agents"},
    }) {
        if err != nil { break }
        fmt.Print(event.Text())
    }
}

AI Agents

Data & Retrieval

Infrastructure

Orchestration

LLM Providers & Abstraction

Overview

Capabilities

ChatModel Interface

LLM Router

Structured Output

Context Window Management

Tokenizer

Provider-Aware Rate Limiting

Prompt Cache Optimization

Middleware and Hooks

Architecture

Providers & Implementations

Full Example

Related Features

AI Agents

Data & Retrieval

Infrastructure

Orchestration

LLM Providers & Abstraction

Overview

Capabilities

ChatModel Interface

LLM Router

Structured Output

Context Window Management

Tokenizer

Provider-Aware Rate Limiting

Prompt Cache Optimization

Middleware and Hooks

Architecture

Providers & Implementations

Full Example

Related Features

Agent Runtime

RAG Pipeline

Guardrails

Observability