Skip to content
Docs

LLM Error Recovery Service

LLM providers are external services subject to rate limits, transient network failures, and periodic outages. For an enterprise AI platform handling thousands of requests per hour, even a 3-5% failure rate translates to hundreds of user-facing errors daily, each requiring manual investigation and retry. The cost compounds: support tickets pile up, SLA commitments are missed, and engineering time is diverted from feature work to firefighting.

The fundamental challenge is that LLM failures are not uniform. Rate limit errors (HTTP 429) signal temporary throttling and resolve with backoff. Timeout errors suggest provider load and benefit from retry. Authentication errors are permanent and should fail immediately. A naive retry-everything approach wastes resources and delays permanent failures, while no retries at all exposes users to every transient glitch.

An error recovery service with intelligent retries, circuit breakers, and exponential backoff achieves 99.9% success rate even during provider issues by classifying errors and applying the right recovery strategy for each.

Beluga AI’s core/ package provides typed errors with IsRetryable() checks that distinguish transient from permanent failures. The error recovery service wraps LLM calls with retry logic, circuit breakers, and fallback mechanisms. Error analysis determines retryability, exponential backoff prevents thundering herd effects on recovering providers, and circuit breakers short-circuit requests when a provider is consistently failing — redirecting traffic to fallback providers instead of accumulating timeouts.

┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ LLM Request │───▶│ Circuit │───▶│ Primary │
│ │ │ Breaker │ │ Provider │
└──────────────┘ └──────┬───────┘ └──────┬───────┘
│ │
│ Open │ Failure
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Fallback │◀───│ Retry │
│ Provider │ │ Manager │
└──────────────┘ └──────────────┘

The recovery service wraps LLM providers with error handling and automatic retries. It implements llm.ChatModel, so it can be used as a drop-in replacement anywhere a model is expected. This composability is intentional — resilience becomes a transparent layer rather than requiring callers to change their code.

package main
import (
"context"
"errors"
"fmt"
"strings"
"time"
"github.com/lookatitude/beluga-ai/core"
"github.com/lookatitude/beluga-ai/llm"
"github.com/lookatitude/beluga-ai/schema"
)
// ErrorRecoveryService wraps LLM operations with error recovery.
type ErrorRecoveryService struct {
primary llm.ChatModel
fallback llm.ChatModel
circuitBreaker *CircuitBreaker
retryManager *RetryManager
}
// NewErrorRecoveryService creates a new error recovery service.
func NewErrorRecoveryService(
primary llm.ChatModel,
fallback llm.ChatModel,
) *ErrorRecoveryService {
return &ErrorRecoveryService{
primary: primary,
fallback: fallback,
circuitBreaker: NewCircuitBreaker(5, 30*time.Second),
retryManager: NewRetryManager(3, 1*time.Second),
}
}
// Generate implements llm.ChatModel with error recovery.
func (e *ErrorRecoveryService) Generate(ctx context.Context, msgs []schema.Message, opts ...llm.GenerateOption) (*schema.AIMessage, error) {
// Check circuit breaker
if !e.circuitBreaker.Allow() {
return e.fallback.Generate(ctx, msgs, opts...)
}
// Execute with retry
return e.executeWithRetry(ctx, e.primary, msgs, opts...)
}

The retry manager implements exponential backoff with jitter. Fixed-interval retries cause synchronized retry storms — when many clients fail at the same time, they all retry at the same time, amplifying the load spike that caused the original failure. Exponential backoff spreads retries over increasing intervals, and jitter randomizes the exact timing so clients naturally desynchronize.

type RetryManager struct {
maxRetries int
initialBackoff time.Duration
}
func NewRetryManager(maxRetries int, initialBackoff time.Duration) *RetryManager {
return &RetryManager{
maxRetries: maxRetries,
initialBackoff: initialBackoff,
}
}
func (r *RetryManager) Backoff(attempt int) time.Duration {
// Exponential backoff: 1s, 2s, 4s, 8s...
backoff := r.initialBackoff * time.Duration(1<<uint(attempt))
// Add jitter (0-50% of backoff)
jitter := time.Duration(rand.Int63n(int64(backoff / 2)))
return backoff + jitter
}
func (e *ErrorRecoveryService) executeWithRetry(
ctx context.Context,
provider llm.ChatModel,
msgs []schema.Message,
opts ...llm.GenerateOption,
) (*schema.AIMessage, error) {
var lastErr error
for attempt := 0; attempt <= e.retryManager.maxRetries; attempt++ {
if attempt > 0 {
backoff := e.retryManager.Backoff(attempt)
select {
case <-time.After(backoff):
case <-ctx.Done():
return nil, ctx.Err()
}
}
result, err := provider.Generate(ctx, msgs, opts...)
if err == nil {
e.circuitBreaker.RecordSuccess()
return result, nil
}
lastErr = err
// Check if error is retryable
if !e.isRetryable(err) {
e.circuitBreaker.RecordFailure()
return nil, err
}
// Rate limit errors get longer backoff
if e.isRateLimitError(err) {
backoff := time.Duration(attempt+1) * 5 * time.Second
select {
case <-time.After(backoff):
case <-ctx.Done():
return nil, ctx.Err()
}
}
e.circuitBreaker.RecordFailure()
}
// All retries exhausted, try fallback
if e.fallback != nil {
return e.fallback.Generate(ctx, msgs, opts...)
}
return nil, fmt.Errorf("all retries exhausted: %w", lastErr)
}

Retrying every error wastes time and resources. Authentication failures, invalid request formats, and content policy violations will never succeed on retry. The error analyzer classifies errors into retryable (rate limits, timeouts, network issues, server errors) and permanent categories, so the service fails fast on unrecoverable errors while persisting through transient ones.

func (e *ErrorRecoveryService) isRetryable(err error) bool {
// Rate limit errors are retryable
if e.isRateLimitError(err) {
return true
}
// Timeout errors are retryable
if errors.Is(err, context.DeadlineExceeded) {
return true
}
// Network errors are retryable
if strings.Contains(err.Error(), "network") ||
strings.Contains(err.Error(), "connection") {
return true
}
// Provider errors (5xx) are retryable
if strings.Contains(err.Error(), "500") ||
strings.Contains(err.Error(), "503") {
return true
}
return false
}
func (e *ErrorRecoveryService) isRateLimitError(err error) bool {
errStr := err.Error()
return strings.Contains(errStr, "rate limit") ||
strings.Contains(errStr, "429") ||
strings.Contains(errStr, "quota")
}

When a provider is down, retrying every request accumulates timeouts and delays all callers. The circuit breaker pattern prevents this cascading failure: after a threshold of consecutive failures, the breaker “opens” and immediately routes requests to the fallback provider without waiting for the primary to time out. After a cooldown period, the breaker enters “half-open” state and probes the primary with a single request — if it succeeds, the breaker closes and normal traffic resumes.

type CircuitBreaker struct {
maxFailures int
timeout time.Duration
failures int
lastFailTime time.Time
state string // "closed", "open", "half-open"
mu sync.RWMutex
}
func NewCircuitBreaker(maxFailures int, timeout time.Duration) *CircuitBreaker {
return &CircuitBreaker{
maxFailures: maxFailures,
timeout: timeout,
state: "closed",
}
}
func (cb *CircuitBreaker) Allow() bool {
cb.mu.Lock()
defer cb.mu.Unlock()
if cb.state == "closed" {
return true
}
// Check if timeout has passed
if time.Since(cb.lastFailTime) > cb.timeout {
cb.state = "half-open"
return true
}
return false
}
func (cb *CircuitBreaker) RecordSuccess() {
cb.mu.Lock()
defer cb.mu.Unlock()
cb.failures = 0
cb.state = "closed"
}
func (cb *CircuitBreaker) RecordFailure() {
cb.mu.Lock()
defer cb.mu.Unlock()
cb.failures++
cb.lastFailTime = time.Now()
if cb.failures >= cb.maxFailures {
cb.state = "open"
}
}
func (cb *CircuitBreaker) State() string {
cb.mu.RLock()
defer cb.mu.RUnlock()
return cb.state
}

Streaming introduces a complication: partial responses may have already been yielded to the caller before an error occurs mid-stream. The recovery service handles this by retrying from the beginning on retryable errors — the caller receives a fresh stream from the retry attempt. For non-retryable errors, the error is yielded immediately. This follows Beluga AI’s iter.Seq2[T, error] streaming pattern where errors are part of the iteration sequence.

func (e *ErrorRecoveryService) Stream(
ctx context.Context,
msgs []schema.Message,
opts ...llm.GenerateOption,
) iter.Seq2[schema.StreamChunk, error] {
if !e.circuitBreaker.Allow() {
return e.fallback.Stream(ctx, msgs, opts...)
}
return func(yield func(schema.StreamChunk, error) bool) {
for attempt := 0; attempt <= e.retryManager.maxRetries; attempt++ {
if attempt > 0 {
backoff := e.retryManager.Backoff(attempt)
select {
case <-time.After(backoff):
case <-ctx.Done():
yield(schema.StreamChunk{}, ctx.Err())
return
}
}
stream := e.primary.Stream(ctx, msgs, opts...)
succeeded := true
for chunk, err := range stream {
if err != nil {
if !e.isRetryable(err) {
yield(chunk, err)
return
}
succeeded = false
break
}
if !yield(chunk, nil) {
return
}
}
if succeeded {
e.circuitBreaker.RecordSuccess()
return
}
e.circuitBreaker.RecordFailure()
}
// Fallback for streaming
if e.fallback != nil {
for chunk, err := range e.fallback.Stream(ctx, msgs, opts...) {
if !yield(chunk, err) {
return
}
}
}
}
}

Track recovery metrics with OpenTelemetry:

import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/metric"
)
func (e *ErrorRecoveryService) recordMetrics(ctx context.Context, attempt int, err error) {
meter := otel.Meter("error-recovery")
if err != nil {
counter, _ := meter.Int64Counter("llm_errors_total")
counter.Add(ctx, 1,
metric.WithAttributes(
attribute.String("error_type", classifyError(err)),
attribute.Int("attempt", attempt),
),
)
} else {
counter, _ := meter.Int64Counter("llm_success_total")
counter.Add(ctx, 1,
metric.WithAttributes(
attribute.Int("attempts", attempt+1),
),
)
}
}
func classifyError(err error) string {
switch {
case strings.Contains(err.Error(), "rate limit"):
return "rate_limit"
case strings.Contains(err.Error(), "timeout"):
return "timeout"
case strings.Contains(err.Error(), "500"), strings.Contains(err.Error(), "503"):
return "provider_error"
default:
return "unknown"
}
}

Circuit breaker thresholds should be tuned based on observed failure patterns. Start conservative (5 failures, 30 second timeout) and adjust based on metrics. Monitor circuit breaker state changes to identify when providers are degraded.

Choose fallback providers with different failure modes. If the primary provider experiences rate limiting, the fallback should be a different provider or a local model that won’t have the same constraints.

MetricBeforeAfterImprovement
Success Rate95-97%99.92%3-5% improvement
Rate Limit Errors2-3%0.08%96-97% reduction
Timeout Errors1-2%0.05%97-98% reduction
Manual Interventions/Month120100% reduction
Average Recovery Time (seconds)3003.299% reduction