Skip to content
Docs

Multi-Model LLM Gateway

When an application depends on a single LLM provider, every rate limit hit, timeout, or outage becomes a user-facing failure. Multi-provider setups solve availability but introduce new problems: each provider has different APIs, error formats, rate limit semantics, and pricing models. Teams end up writing provider-specific client code scattered across the codebase, with no unified view of cost, latency, or error rates across providers.

A gateway layer centralizes these concerns. All application code talks to one interface, and the gateway handles provider selection, rate limit enforcement, failover, and observability. This separation means adding a new provider or changing routing strategy requires zero application code changes.

Beluga AI’s LLM package provides this gateway pattern through its registry and ChatModel interface — all providers implement the same interface, so the gateway composes them with standard routing, resilience, and monitoring.

The gateway sits between application code and LLM providers. A pluggable router selects the best provider for each request based on configurable strategy (round-robin for load distribution, cost-based for budget optimization, health-based for reliability). Per-provider rate limiting prevents quota exhaustion before requests reach the provider. OpenTelemetry integration using GenAI semantic conventions (gen_ai.* attributes) provides unified observability across all providers.

graph TB
    A[Client Apps] --> B[LLM Gateway]
    B --> C[Router<br>Round Robin / Cost / Health]
    C --> D[Rate Limiter]
    D --> E[OpenAI]
    D --> F[Anthropic]
    D --> G[Bedrock]
    D --> H[Ollama]
    I[Metrics/Tracing] --> B
    J[Circuit Breaker] --> C

The gateway uses Beluga AI’s registry pattern (llm.New()) to create providers by name, keeping initialization uniform. Each provider is stored by name in a map, enabling dynamic addition and removal at runtime. The blank imports (_ "github.com/lookatitude/beluga-ai/llm/providers/...") register providers via init() so they’re available through the registry without explicit wiring.

package main
import (
"context"
"fmt"
"log"
"sync"
"time"
"github.com/lookatitude/beluga-ai/llm"
"github.com/lookatitude/beluga-ai/schema"
_ "github.com/lookatitude/beluga-ai/llm/providers/openai"
_ "github.com/lookatitude/beluga-ai/llm/providers/anthropic"
_ "github.com/lookatitude/beluga-ai/llm/providers/bedrock"
)
type LLMGateway struct {
providers map[string]llm.ChatModel
router ProviderRouter
rateLimiter *RateLimiter
mu sync.RWMutex
}
func NewLLMGateway(ctx context.Context) (*LLMGateway, error) {
gateway := &LLMGateway{
providers: make(map[string]llm.ChatModel),
router: NewRoundRobinRouter(),
rateLimiter: NewRateLimiter(),
}
// Initialize OpenAI
openaiModel, err := llm.New("openai", llm.ProviderConfig{
APIKey: os.Getenv("OPENAI_API_KEY"),
Model: "gpt-4o",
})
if err != nil {
log.Printf("Failed to create OpenAI provider: %v", err)
} else {
gateway.providers["openai"] = openaiModel
}
// Initialize Anthropic
anthropicModel, err := llm.New("anthropic", llm.ProviderConfig{
APIKey: os.Getenv("ANTHROPIC_API_KEY"),
Model: "claude-3-5-sonnet-20241022",
})
if err != nil {
log.Printf("Failed to create Anthropic provider: %v", err)
} else {
gateway.providers["anthropic"] = anthropicModel
}
// Initialize Bedrock
bedrockModel, err := llm.New("bedrock", llm.ProviderConfig{
Region: "us-east-1",
Model: "anthropic.claude-v2",
})
if err != nil {
log.Printf("Failed to create Bedrock provider: %v", err)
} else {
gateway.providers["bedrock"] = bedrockModel
}
if len(gateway.providers) == 0 {
return nil, fmt.Errorf("no providers available")
}
return gateway, nil
}
func (g *LLMGateway) Generate(ctx context.Context, msgs []schema.Message) (*schema.AIMessage, error) {
// Select provider
providerName, err := g.router.SelectProvider(ctx, g.getAvailableProviders())
if err != nil {
return nil, err
}
// Check rate limit
if !g.rateLimiter.Allow(providerName) {
return nil, fmt.Errorf("rate limit exceeded for provider %s", providerName)
}
// Get provider
g.mu.RLock()
provider, ok := g.providers[providerName]
g.mu.RUnlock()
if !ok {
return nil, fmt.Errorf("provider not found: %s", providerName)
}
// Generate response
start := time.Now()
resp, err := provider.Generate(ctx, msgs)
duration := time.Since(start)
log.Printf("Provider %s took %v", providerName, duration)
if err != nil {
// Retry with fallback provider
return g.retryWithFallback(ctx, msgs, providerName)
}
return resp, nil
}
func (g *LLMGateway) getAvailableProviders() []string {
g.mu.RLock()
defer g.mu.RUnlock()
names := make([]string, 0, len(g.providers))
for name := range g.providers {
names = append(names, name)
}
return names
}
func (g *LLMGateway) retryWithFallback(ctx context.Context, msgs []schema.Message, failedProvider string) (*schema.AIMessage, error) {
alternatives := g.getAlternativeProviders(failedProvider)
for _, providerName := range alternatives {
g.mu.RLock()
provider, ok := g.providers[providerName]
g.mu.RUnlock()
if !ok || !g.rateLimiter.Allow(providerName) {
continue
}
log.Printf("Falling back to provider %s", providerName)
resp, err := provider.Generate(ctx, msgs)
if err == nil {
return resp, nil
}
}
return nil, fmt.Errorf("all providers failed")
}
func (g *LLMGateway) getAlternativeProviders(exclude string) []string {
all := g.getAvailableProviders()
alternatives := []string{}
for _, name := range all {
if name != exclude {
alternatives = append(alternatives, name)
}
}
return alternatives
}

The router is defined as an interface (ProviderRouter) so strategies are swappable without changing the gateway. This follows the same interface-first pattern used throughout Beluga AI — define the contract, then provide implementations. Three common strategies cover most production needs:

type ProviderRouter interface {
SelectProvider(ctx context.Context, providers []string) (string, error)
}
// Round-robin routing
type RoundRobinRouter struct {
current int
mu sync.Mutex
}
func NewRoundRobinRouter() *RoundRobinRouter {
return &RoundRobinRouter{}
}
func (r *RoundRobinRouter) SelectProvider(ctx context.Context, providers []string) (string, error) {
if len(providers) == 0 {
return "", fmt.Errorf("no providers available")
}
r.mu.Lock()
defer r.mu.Unlock()
provider := providers[r.current]
r.current = (r.current + 1) % len(providers)
return provider, nil
}
// Cost-based routing
type CostBasedRouter struct {
costs map[string]float64
}
func NewCostBasedRouter() *CostBasedRouter {
return &CostBasedRouter{
costs: map[string]float64{
"openai": 0.03, // $0.03 per 1K tokens
"anthropic": 0.015, // $0.015 per 1K tokens
"bedrock": 0.008, // $0.008 per 1K tokens
},
}
}
func (r *CostBasedRouter) SelectProvider(ctx context.Context, providers []string) (string, error) {
if len(providers) == 0 {
return "", fmt.Errorf("no providers available")
}
cheapest := providers[0]
minCost := r.costs[cheapest]
for _, p := range providers[1:] {
if cost, ok := r.costs[p]; ok && cost < minCost {
cheapest = p
minCost = cost
}
}
return cheapest, nil
}
// Health-based routing with failover
type HealthBasedRouter struct {
healthChecker *HealthChecker
}
func NewHealthBasedRouter(checker *HealthChecker) *HealthBasedRouter {
return &HealthBasedRouter{healthChecker: checker}
}
func (r *HealthBasedRouter) SelectProvider(ctx context.Context, providers []string) (string, error) {
healthy := []string{}
for _, p := range providers {
if r.healthChecker.IsHealthy(p) {
healthy = append(healthy, p)
}
}
if len(healthy) == 0 {
return "", fmt.Errorf("no healthy providers")
}
return healthy[0], nil
}

LLM providers enforce rate limits (requests per minute, tokens per minute) and return HTTP 429 errors when exceeded. Hitting these limits wastes a round-trip and delays the response. Client-side rate limiting using token bucket algorithms prevents requests from reaching the provider when the quota is near exhaustion, avoiding the latency penalty of server-side rejection.

import "golang.org/x/time/rate"
type RateLimiter struct {
limiters map[string]*rate.Limiter
mu sync.RWMutex
}
func NewRateLimiter() *RateLimiter {
return &RateLimiter{
limiters: map[string]*rate.Limiter{
"openai": rate.NewLimiter(rate.Every(time.Minute), 60), // 60 req/min
"anthropic": rate.NewLimiter(rate.Every(time.Minute), 50), // 50 req/min
"bedrock": rate.NewLimiter(rate.Every(time.Minute), 100), // 100 req/min
},
}
}
func (r *RateLimiter) Allow(provider string) bool {
r.mu.RLock()
limiter, ok := r.limiters[provider]
r.mu.RUnlock()
if !ok {
return true // No limit configured
}
return limiter.Allow()
}

Gateway-level observability answers questions that per-provider metrics cannot: which provider is cheapest for a given workload, how does failover affect latency, and which routing strategy produces the best cost-quality tradeoff. The gateway records request counts, latency histograms, error rates, and token usage per provider using OTel GenAI semantic conventions, making all metrics comparable across providers in a single dashboard.

import (
"github.com/lookatitude/beluga-ai/o11y"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/metric"
)
type ObservableGateway struct {
*LLMGateway
meter metric.Meter
tracer trace.Tracer
}
func NewObservableGateway(ctx context.Context) (*ObservableGateway, error) {
gateway, err := NewLLMGateway(ctx)
if err != nil {
return nil, err
}
return &ObservableGateway{
LLMGateway: gateway,
meter: otel.Meter("llm-gateway"),
tracer: otel.Tracer("llm-gateway"),
}, nil
}
func (g *ObservableGateway) Generate(ctx context.Context, msgs []schema.Message) (*schema.AIMessage, error) {
ctx, span := g.tracer.Start(ctx, "gateway.generate")
defer span.End()
providerName, err := g.router.SelectProvider(ctx, g.getAvailableProviders())
if err != nil {
span.RecordError(err)
return nil, err
}
span.SetAttributes(
attribute.String("gen_ai.system", providerName),
attribute.Int("gen_ai.message_count", len(msgs)),
)
start := time.Now()
resp, err := g.LLMGateway.Generate(ctx, msgs)
duration := time.Since(start)
// Record metrics
counter, _ := g.meter.Int64Counter("llm.requests.total")
counter.Add(ctx, 1, metric.WithAttributes(
attribute.String("provider", providerName),
attribute.String("status", status(err)),
))
histogram, _ := g.meter.Float64Histogram("llm.request.duration")
histogram.Record(ctx, duration.Seconds(), metric.WithAttributes(
attribute.String("provider", providerName),
))
if err != nil {
span.RecordError(err)
errorCounter, _ := g.meter.Int64Counter("llm.errors.total")
errorCounter.Add(ctx, 1, metric.WithAttributes(
attribute.String("provider", providerName),
))
return nil, err
}
// Record token usage
if resp.Usage != nil {
tokenCounter, _ := g.meter.Int64Counter("llm.tokens.total")
tokenCounter.Add(ctx, int64(resp.Usage.TotalTokens), metric.WithAttributes(
attribute.String("provider", providerName),
))
}
span.SetAttributes(
attribute.Int("gen_ai.usage.prompt_tokens", resp.Usage.PromptTokens),
attribute.Int("gen_ai.usage.completion_tokens", resp.Usage.CompletionTokens),
)
return resp, nil
}
func status(err error) string {
if err == nil {
return "success"
}
return "error"
}

Implement circuit breaker pattern for failing providers:

import "github.com/lookatitude/beluga-ai/resilience"
func (g *LLMGateway) GenerateWithCircuitBreaker(ctx context.Context, msgs []schema.Message) (*schema.AIMessage, error) {
cb := resilience.NewCircuitBreaker(resilience.CircuitBreakerConfig{
FailureThreshold: 5,
SuccessThreshold: 2,
Timeout: 60 * time.Second,
})
return resilience.ExecuteWithCircuitBreaker(ctx, cb, func(ctx context.Context) (*schema.AIMessage, error) {
return g.Generate(ctx, msgs)
})
}
gateway:
routing_strategy: "health_based" # round_robin, cost_based, health_based
enable_failover: true
max_retries: 3
timeout: 60s
providers:
openai:
enabled: true
api_key: "${OPENAI_API_KEY}"
model: "gpt-4o"
rate_limit:
requests_per_minute: 60
tokens_per_minute: 90000
anthropic:
enabled: true
api_key: "${ANTHROPIC_API_KEY}"
model: "claude-3-5-sonnet-20241022"
rate_limit:
requests_per_minute: 50
bedrock:
enabled: true
region: "us-east-1"
model: "anthropic.claude-v2"
rate_limit:
requests_per_minute: 100
observability:
otel:
endpoint: "localhost:4317"
metrics:
enabled: true
tracing:
enabled: true
sample_rate: 1.0
  • Horizontal scaling: Deploy multiple gateway instances behind a load balancer. Each instance is stateless.
  • Provider management: Use health checks to automatically remove failing providers from rotation.
  • Caching: Implement prompt caching for frequently asked questions to reduce API calls.
  • Cost tracking: Monitor token usage per provider to optimize routing decisions.