Resilience

Resilience in Beluga is implemented as middleware on the same func(T) T signature that wraps every extensible interface. Circuit breakers, retry, rate limiting, and timeouts compose by function application — there is no special infrastructure to deploy.

The pattern

Every Layer 3 capability package (llm, tool, memory, rag/*, voice/*, guard, workflow, server, cache, auth, state) ships an ApplyMiddleware() helper that wraps a base instance:

import (
    _ "github.com/lookatitude/beluga-ai/llm/providers/anthropic"

    "github.com/lookatitude/beluga-ai/core"
    "github.com/lookatitude/beluga-ai/llm"
)

base, err := llm.New("anthropic", llm.Config{Model: "claude-sonnet-4-6"})
if err != nil {
    panic(err)
}

resilient := llm.ApplyMiddleware(base,
    llm.WithRateLimit(60, 150_000), // 60 req/min, 150k tok/min
    llm.WithRetry(3),               // honors core.IsRetryable()
    llm.WithCircuitBreaker(llm.CircuitBreakerConfig{
        FailureThreshold: 5,
        ResetTimeout:     30 * time.Second,
    }),
    llm.WithTimeout(20*time.Second),
)

How retry decides what to retry

graph LR
  C[Call] --> A1[Attempt 1]
  A1 -->|success| Done
  A1 -->|non-retryable error| Fail
  A1 -->|retryable error| Wait1[Backoff 100ms ± jitter]
  Wait1 --> A2[Attempt 2]
  A2 --> Wait2[Backoff 200ms ± jitter]
  Wait2 --> A3[Attempt 3]
  A3 -->|success| Done
  A3 -->|fail| Fail[Max attempts exceeded]

Beluga errors carry a typed ErrorCode. The retry middleware calls core.IsRetryable(err) before re-running a call. Retryable codes include:

core.ErrCodeRateLimit — provider rate-limited the request
core.ErrCodeUnavailable — transient upstream failure
core.ErrCodeTimeout — the call exceeded its deadline
core.ErrCodeNetwork — network-layer failure

Non-retryable errors (ErrCodeInvalid, ErrCodePermissionDenied, ErrCodeNotFound) propagate immediately. See Errors for the full model.

Rate limiting

graph TD
  Req[Request] --> RPM{RPM bucket full?}
  RPM -->|no| TPM{TPM bucket has capacity?}
  RPM -->|yes| Wait[Wait or reject]
  TPM -->|yes| Conc{Max concurrent?}
  TPM -->|no| Wait
  Conc -->|ok| Go[Proceed]
  Conc -->|full| Wait

Three buckets per provider: RPM (requests per minute), TPM (tokens per minute), MaxConcurrent (simultaneous in-flight calls). Different providers gate on different things — the limiter takes the tightest constraint. Beluga’s rate limiter is token-aware where the upstream API exposes token budgets.

Circuit breakers

stateDiagram-v2
  [*] --> Closed
  Closed --> Open: N failures in window
  Open --> HalfOpen: after cooldown
  HalfOpen --> Closed: success
  HalfOpen --> Open: failure

Three states: Closed (normal, calls pass through), Open (fail fast without calling the underlying service), Half-Open (cooldown elapsed; one probe call decides whether to close or re-open). The breaker tracks consecutive failures across a window and prevents a degraded provider from cascading slow failures upstream.

Circuit breakers operate per-instance, not per-process. Wrapping the same base provider in two different middleware chains gives two independent breakers — useful when one chain has lower latency budget than another.

Hedged requests

Fire a parallel fallback request if the primary is still running at hedge_delay. Whichever finishes first wins; the other is cancelled. Cuts P99 latency at up to 2× the base cost for calls that exceed the delay.

sequenceDiagram
  participant C as Caller
  participant P as Primary
  participant F as Fallback
  C->>P: request
  Note over C: wait hedge_delay (e.g. 500ms)
  C->>F: fallback request
  alt Primary responds first
    P-->>C: result
    C->>F: cancel
  else Fallback responds first
    F-->>C: result
    C->>P: cancel
  end

Best for search and retrieval where median latency is low but the tail is long.

Composing with other middleware

Resilience composes with observability, guardrails, and cost tracking. Read outside-in:

model := llm.ApplyMiddleware(base,
    llm.WithGuardrails(guardPipeline),
    llm.WithTracing(),               // gen_ai.* OTel spans
    llm.WithCostTracking(costCenter),
    llm.WithRateLimit(60, 150_000),
    llm.WithRetry(3),
)

The first middleware in the slice wraps the outside. Calls flow inward through guardrails → tracing → cost → rate limit → retry → base.

Observability — wire OTel exporters
Errors — the typed error model
Extensibility — the four-ring composition model

Resilience

Resilience

The pattern

How retry decides what to retry

Rate limiting

Circuit breakers

Hedged requests

Composing with other middleware

Related