LLM API — ChatModel, Router, Middleware
import "github.com/lookatitude/beluga-ai/llm"Package llm provides the LLM abstraction layer for the Beluga AI framework.
It defines the ChatModel interface that all LLM providers implement,
a provider registry for dynamic instantiation, composable middleware,
lifecycle hooks, structured output parsing, context window management,
tokenization, rate limiting, and multi-backend routing.
ChatModel Interface
Section titled “ChatModel Interface”The core abstraction is ChatModel, which every provider implements:
- Generate sends messages and returns a complete [schema.AIMessage].
- Stream sends messages and returns an [iter.Seq2] of [schema.StreamChunk] values.
- BindTools returns a new ChatModel with the given tool definitions included in every request.
- ModelID returns the underlying model identifier (e.g. “gpt-4o”).
Provider Registry
Section titled “Provider Registry”Providers register themselves via init() so that importing a provider package is sufficient to make it available through the registry:
import _ "github.com/lookatitude/beluga-ai/llm/providers/openai"
model, err := llm.New("openai", cfg)Use Register to add a provider factory, New to create a ChatModel by
name, and List to discover all registered providers.
Middleware
Section titled “Middleware”Middleware wraps a ChatModel to add cross-cutting concerns. Built-in
middleware includes logging, fallback, hooks, and rate limiting:
model = llm.ApplyMiddleware(model, llm.WithLogging(logger), llm.WithFallback(backup), llm.WithHooks(hooks), llm.WithProviderLimits(limits),)Middleware is applied right-to-left: the first middleware in the list becomes the outermost wrapper and executes first.
Hooks provides optional callbacks invoked during LLM operations:
BeforeGenerate, AfterGenerate, OnStream, OnToolCall, and OnError.
All fields are optional; nil hooks are skipped. Use ComposeHooks
to merge multiple Hooks into one.
Structured Output
Section titled “Structured Output”StructuredOutput wraps a ChatModel to produce typed Go values.
It generates a JSON Schema from the type parameter, instructs the model
to respond in JSON, parses the response, and retries on parse failures:
type Sentiment struct { Label string `json:"label"` Score float64 `json:"score"`}so := llm.NewStructured[Sentiment](model)result, err := so.Generate(ctx, msgs)Context Management
Section titled “Context Management”ContextManager fits a message sequence within a token budget.
Two strategies are provided: “truncate” (drops oldest non-system messages)
and “sliding” (keeps the most recent messages that fit). Use
NewContextManager with options to configure:
cm := llm.NewContextManager( llm.WithContextStrategy("sliding"), llm.WithTokenizer(tokenizer), llm.WithKeepSystemMessages(true),)fitted, err := cm.Fit(ctx, msgs, 4096)Tokenizer
Section titled “Tokenizer”Tokenizer provides token counting and encoding/decoding.
SimpleTokenizer is a built-in word-based approximation (1 token per
4 characters) suitable for budget estimation when a model-specific
tokenizer is unavailable.
Routing
Section titled “Routing”Router implements ChatModel by delegating to one of several backend
models chosen by a pluggable ModelSelector. Built-in strategies include
RoundRobin and FailoverChain. For automatic retry across models, use
FailoverRouter:
r := llm.NewRouter( llm.WithModels(modelA, modelB), llm.WithStrategy(&llm.RoundRobin{}),)Rate Limiting
Section titled “Rate Limiting”WithProviderLimits returns middleware that enforces requests-per-minute,
tokens-per-minute, and concurrency limits per provider.
Generate Options
Section titled “Generate Options”GenerateOption functional options configure individual Generate/Stream
calls: temperature, max tokens, top-p, stop sequences, response format,
tool choice, and provider-specific metadata.
Streaming
Section titled “Streaming”Streaming uses iter.Seq2 (Go 1.23+):
for chunk, err := range model.Stream(ctx, msgs) { if err != nil { break } fmt.Print(chunk.Delta)}