LLM API — ChatModel, Router, Middleware
import "github.com/lookatitude/beluga-ai/llm"Package llm provides the LLM abstraction layer for the Beluga AI framework.
It defines the ChatModel interface that all LLM providers implement,
a provider registry for dynamic instantiation, composable middleware,
lifecycle hooks, structured output parsing, context window management,
tokenization, rate limiting, and multi-backend routing.
ChatModel Interface
Section titled “ChatModel Interface”type ChatModel interface { Generate(ctx context.Context, msgs []schema.Message, opts ...GenerateOption) (*schema.AIMessage, error) Stream(ctx context.Context, msgs []schema.Message, opts ...GenerateOption) iter.Seq2[schema.StreamChunk, error] BindTools(tools []schema.ToolDefinition) ChatModel ModelID() string}Generatesends messages and returns a complete*schema.AIMessage.Streamsends messages and returns aniter.Seq2[schema.StreamChunk, error]iterator.BindToolsreturns a newChatModelwith the given tool definitions included in every request. The original model is not modified.ModelIDreturns the underlying model identifier (e.g."gpt-4o").
Provider Registry
Section titled “Provider Registry”Providers register themselves via init() so that importing a provider
package is sufficient to make it available through the registry. The registry
accepts a config.ProviderConfig to configure the provider:
package main
import ( "context" "fmt" "log"
"github.com/lookatitude/beluga-ai/config" "github.com/lookatitude/beluga-ai/llm" "github.com/lookatitude/beluga-ai/schema"
_ "github.com/lookatitude/beluga-ai/llm/providers/openai")
func main() { ctx := context.Background()
model, err := llm.New("openai", config.ProviderConfig{ APIKey: "sk-...", Model: "gpt-4o", }) if err != nil { log.Fatal(err) }
msgs := []schema.Message{schema.NewHumanMessage("What is 2+2?")} resp, err := model.Generate(ctx, msgs) if err != nil { log.Fatal(err) } fmt.Println(resp.Text())}Use Register to add a custom provider factory, New to create a ChatModel
by name, and List to discover all registered providers:
// Register a custom provider (call from init())llm.Register("my-provider", func(cfg config.ProviderConfig) (llm.ChatModel, error) { return NewMyProvider(cfg)})
providers := llm.List() // sorted provider namesStreaming
Section titled “Streaming”Streaming uses iter.Seq2 (Go 1.23+):
for chunk, err := range model.Stream(ctx, msgs) { if err != nil { log.Fatal(err) } fmt.Print(chunk.Delta)}Generate Options
Section titled “Generate Options”GenerateOption functional options configure individual Generate and Stream
calls:
| Option | Type | Description |
|---|---|---|
WithTemperature(t float64) | GenerateOption | Sampling temperature (0.0–2.0). |
WithMaxTokens(n int) | GenerateOption | Maximum tokens to generate. |
WithTopP(p float64) | GenerateOption | Nucleus sampling (0.0–1.0). |
WithStopSequences(seqs ...string) | GenerateOption | Stop generation on these strings. |
WithResponseFormat(format ResponseFormat) | GenerateOption | Output format (text, json_object, json_schema). |
WithToolChoice(choice ToolChoice) | GenerateOption | ToolChoiceAuto, ToolChoiceNone, ToolChoiceRequired. |
WithSpecificTool(name string) | GenerateOption | Force the model to call the named tool. |
WithMetadata(kv map[string]any) | GenerateOption | Provider-specific options. |
WithReasoning(cfg ReasoningConfig) | GenerateOption | Full reasoning configuration. |
WithReasoningEffort(effort ReasoningEffort) | GenerateOption | Reasoning effort level. |
WithReasoningBudget(tokens int) | GenerateOption | Reasoning token budget. |
Reasoning Models
Section titled “Reasoning Models”Beluga AI supports reasoning/chain-of-thought models such as OpenAI o-series
and Claude with extended thinking. Use ReasoningConfig and the associated
functional options to control reasoning behaviour:
type ReasoningEffort string
const ( ReasoningEffortLow ReasoningEffort = "low" ReasoningEffortMedium ReasoningEffort = "medium" ReasoningEffortHigh ReasoningEffort = "high")
type ReasoningConfig struct { Effort ReasoningEffort BudgetTokens int}| Option | Type | Description |
|---|---|---|
WithReasoning(cfg ReasoningConfig) | GenerateOption | Set the full reasoning configuration. |
WithReasoningEffort(effort ReasoningEffort) | GenerateOption | Set reasoning effort level (creates config if nil). |
WithReasoningBudget(tokens int) | GenerateOption | Set reasoning token budget (creates config if nil). |
Example:
resp, err := model.Generate(ctx, msgs, llm.WithReasoningEffort(llm.ReasoningEffortHigh), llm.WithReasoningBudget(10000),)Reasoning tokens are tracked in schema.Usage.ReasoningTokens, and reasoning
content appears as schema.ThinkingPart in the response’s content parts.
During streaming, reasoning deltas arrive in schema.StreamChunk.ReasoningDelta.
Use the OnReasoning hook to observe reasoning deltas as they stream:
hooks := llm.Hooks{ OnReasoning: func(ctx context.Context, delta string) { fmt.Print(delta) // stream reasoning to console },}Middleware
Section titled “Middleware”Middleware has the signature func(ChatModel) ChatModel. Built-in middleware:
WithHooks(hooks Hooks) Middleware— invokes lifecycle callbacks around Generate and Stream.WithLogging(logger *slog.Logger) Middleware— logs Generate and Stream calls.WithFallback(fallback ChatModel) Middleware— falls back to an alternative model on retryable errors.WithProviderLimits(limits ProviderLimits) Middleware— enforces RPM, TPM, and concurrency limits.
Apply middleware with ApplyMiddleware. The first middleware in the list
becomes the outermost wrapper and executes first:
package main
import ( "log/slog" "os"
"github.com/lookatitude/beluga-ai/llm")
func applyMiddleware(model llm.ChatModel, backup llm.ChatModel) llm.ChatModel { logger := slog.New(slog.NewTextHandler(os.Stdout, nil)) return llm.ApplyMiddleware(model, llm.WithLogging(logger), llm.WithFallback(backup), llm.WithProviderLimits(llm.ProviderLimits{ RPM: 60, MaxConcurrent: 5, }), )}Hooks provides optional callbacks invoked during LLM operations. All fields
are optional; nil hooks are skipped. Use ComposeHooks to merge multiple
Hooks values:
type Hooks struct { BeforeGenerate func(ctx context.Context, msgs []schema.Message) error AfterGenerate func(ctx context.Context, resp *schema.AIMessage, err error) OnStream func(ctx context.Context, chunk schema.StreamChunk) OnToolCall func(ctx context.Context, call schema.ToolCall) OnReasoning func(ctx context.Context, delta string) OnError func(ctx context.Context, err error) error}BeforeGenerate can abort the call by returning an error. OnError can
suppress an error by returning nil. Example:
package main
import ( "context" "log"
"github.com/lookatitude/beluga-ai/llm" "github.com/lookatitude/beluga-ai/schema")
func loggingHooks() llm.Hooks { return llm.Hooks{ BeforeGenerate: func(ctx context.Context, msgs []schema.Message) error { log.Printf("generating with %d messages", len(msgs)) return nil }, OnError: func(ctx context.Context, err error) error { log.Printf("llm error: %v", err) return err }, }}Structured Output
Section titled “Structured Output”StructuredOutput[T] wraps a ChatModel to produce typed Go values. It
generates a JSON Schema from T, instructs the model to respond in JSON,
parses the response, and retries on parse failures (default: 2 retries):
package main
import ( "context" "fmt" "log"
"github.com/lookatitude/beluga-ai/llm" "github.com/lookatitude/beluga-ai/schema")
type Sentiment struct { Label string `json:"label"` Score float64 `json:"score"`}
func analyzeSentiment(ctx context.Context, model llm.ChatModel, text string) (Sentiment, error) { so := llm.NewStructured[Sentiment](model, llm.WithMaxRetries(3)) msgs := []schema.Message{ schema.NewHumanMessage("Analyze the sentiment of: " + text), } return so.Generate(ctx, msgs)}
func main() { ctx := context.Background() // model := ... (create a ChatModel) result, err := analyzeSentiment(ctx, model, "I love this product!") if err != nil { log.Fatal(err) } fmt.Printf("Label: %s, Score: %.2f\n", result.Label, result.Score)}NewStructured[T] accepts optional StructuredOption values. The only
built-in option is WithMaxRetries(n int).
Context Management
Section titled “Context Management”ContextManager fits a message sequence within a token budget. Two strategies
are built in: "truncate" (drops oldest non-system messages) and "sliding"
(keeps the most recent messages that fit):
package main
import ( "context" "log"
"github.com/lookatitude/beluga-ai/llm" "github.com/lookatitude/beluga-ai/schema")
func fitMessages(ctx context.Context, msgs []schema.Message) ([]schema.Message, error) { cm := llm.NewContextManager( llm.WithContextStrategy("sliding"), llm.WithKeepSystemMessages(true), ) fitted, err := cm.Fit(ctx, msgs, 4096) if err != nil { return nil, err } return fitted, nil}NewContextManager options:
| Option | Default | Description |
|---|---|---|
WithContextStrategy(name string) | "truncate" | Strategy: "truncate" or "sliding". |
WithTokenizer(t Tokenizer) | SimpleTokenizer | Tokenizer for counting tokens. |
WithKeepSystemMessages(keep bool) | true | Never remove system messages. |
ContextManager is an interface with a single method:
type ContextManager interface { Fit(ctx context.Context, msgs []schema.Message, budget int) ([]schema.Message, error)}Tokenizer
Section titled “Tokenizer”Tokenizer provides token counting. SimpleTokenizer is a built-in
approximation (1 token per 4 characters) suitable when a model-specific
tokenizer is unavailable.
Routing
Section titled “Routing”Router implements ChatModel by delegating to one of several backend models
chosen by a pluggable ModelSelector. Built-in strategies:
RoundRobin— selects models in round-robin order.FailoverChain— always returns the first model (useFailoverRouterfor actual failover).
FailoverRouter retries across models on retryable errors:
package main
import ( "log"
"github.com/lookatitude/beluga-ai/llm")
func makeRouter(modelA, modelB llm.ChatModel) llm.ChatModel { return llm.NewRouter( llm.WithModels(modelA, modelB), llm.WithStrategy(&llm.RoundRobin{}), )}
func makeFailover(primary, backup llm.ChatModel) llm.ChatModel { return llm.NewFailoverRouter(primary, backup)}NewRouter returns a *Router. If no strategy is set, RoundRobin is used.
NewFailoverRouter returns a *FailoverRouter that tries models in order,
falling back on retryable errors.
Rate Limiting
Section titled “Rate Limiting”WithProviderLimits returns middleware that enforces per-provider limits:
type ProviderLimits struct { RPM int // requests per minute TPM int // tokens per minute MaxConcurrent int // max concurrent requests CooldownOnRetry time.Duration // wait before retry after hitting limit}