LLM Benchmarking Dashboard

LLM providers release new models and update existing ones frequently — performance characteristics shift, pricing changes, and new capabilities appear. Teams that benchmark models once during initial selection make decisions based on stale data. A model that was the best choice six months ago may now be outperformed by a cheaper alternative, or a provider’s latency profile may have degraded.

Manual benchmarking compounds the problem: each provider has different API semantics, error formats, and usage reporting. Running tests by hand produces inconsistent results that are hard to compare across providers and impossible to track over time.

An automated benchmarking dashboard runs standardized test suites against all configured providers on a schedule, collects comparable metrics (latency, tokens, cost, quality), and generates comparative reports that reveal which model delivers the best cost-quality tradeoff for each category of request.

Solution Architecture

Beluga AI’s registry pattern and unified llm.ChatModel interface eliminate provider-specific client code. The benchmark runner initializes providers through llm.New(), executes identical prompts against each, and collects standardized metrics. Because all providers implement the same interface, adding a new provider to the benchmark suite is a single llm.New() call — no adapter code required.

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Benchmark   │───▶│  Benchmark   │───▶│  Provider 1  │
│    Suite     │    │    Runner    │    │  Provider 2  │
└──────────────┘    └──────────────┘    │  Provider N  │
                                        └──────┬───────┘
                                               │
┌──────────────┐    ┌──────────────┐    ┌─────▼────────┐
│  Comparative │◀───│  Analytics   │◀───│   Metrics    │
│   Reports    │    │    Engine    │    │  Collector   │
└──────────────┘    └──────────────┘    └──────────────┘

Implementation

Benchmark Infrastructure

The benchmark runner executes tests across multiple providers with consistent metrics:

package main

import (
    "context"
    "fmt"
    "time"

    "github.com/lookatitude/beluga-ai/llm"
    "github.com/lookatitude/beluga-ai/schema"

    _ "github.com/lookatitude/beluga-ai/llm/providers/openai"
    _ "github.com/lookatitude/beluga-ai/llm/providers/anthropic"
    _ "github.com/lookatitude/beluga-ai/llm/providers/google"
)

type BenchmarkSuite struct {
    Name      string
    Tests     []BenchmarkTest
    Providers []string
}

type BenchmarkTest struct {
    Name           string
    Prompt         string
    ExpectedOutput string // For quality evaluation
    Category       string
}

type BenchmarkResult struct {
    Provider     string
    Model        string
    TestName     string
    Latency      time.Duration
    InputTokens  int
    OutputTokens int
    Cost         float64
    Quality      float64 // 0-1 score
    Timestamp    time.Time
}

type BenchmarkRunner struct {
    providers map[string]llm.ChatModel
}

func NewBenchmarkRunner(ctx context.Context) (*BenchmarkRunner, error) {
    providers := make(map[string]llm.ChatModel)

    // Initialize providers
    openai, err := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"})
    if err == nil {
        providers["openai"] = openai
    }

    anthropic, err := llm.New("anthropic", llm.ProviderConfig{Model: "claude-3-5-sonnet-20241022"})
    if err == nil {
        providers["anthropic"] = anthropic
    }

    google, err := llm.New("google", llm.ProviderConfig{Model: "gemini-2.0-flash-exp"})
    if err == nil {
        providers["google"] = google
    }

    return &BenchmarkRunner{
        providers: providers,
    }, nil
}

Benchmark Execution

Execute benchmarks and collect comprehensive metrics:

func (r *BenchmarkRunner) RunBenchmark(ctx context.Context, suite BenchmarkSuite) ([]BenchmarkResult, error) {
    var results []BenchmarkResult

    for _, test := range suite.Tests {
        for _, providerName := range suite.Providers {
            provider, exists := r.providers[providerName]
            if !exists {
                continue
            }

            result, err := r.runTest(ctx, provider, providerName, test)
            if err != nil {
                // Log but continue with other tests
                continue
            }

            results = append(results, result)
        }
    }

    return results, nil
}

func (r *BenchmarkRunner) runTest(ctx context.Context, provider llm.ChatModel, providerName string, test BenchmarkTest) (BenchmarkResult, error) {
    startTime := time.Now()

    messages := []schema.Message{
        &schema.HumanMessage{Parts: []schema.ContentPart{
            schema.TextPart{Text: test.Prompt},
        }},
    }

    resp, err := provider.Generate(ctx, messages)
    if err != nil {
        return BenchmarkResult{}, err
    }

    latency := time.Since(startTime)

    // Extract token usage from response metadata
    inputTokens, outputTokens := extractTokenUsage(resp)

    // Calculate cost based on provider pricing
    cost := calculateCost(providerName, inputTokens, outputTokens)

    // Evaluate quality against expected output
    quality := evaluateQuality(resp.Parts[0].(schema.TextPart).Text, test.ExpectedOutput)

    return BenchmarkResult{
        Provider:     providerName,
        Model:        provider.ModelID(),
        TestName:     test.Name,
        Latency:      latency,
        InputTokens:  inputTokens,
        OutputTokens: outputTokens,
        Cost:         cost,
        Quality:      quality,
        Timestamp:    time.Now(),
    }, nil
}

func extractTokenUsage(resp *schema.AIMessage) (int, int) {
    // Extract from response metadata if available
    if usage, ok := resp.Metadata["usage"].(map[string]interface{}); ok {
        input := int(usage["input_tokens"].(float64))
        output := int(usage["output_tokens"].(float64))
        return input, output
    }
    return 0, 0
}

func calculateCost(provider string, inputTokens, outputTokens int) float64 {
    // Pricing per million tokens
    pricing := map[string]struct{ input, output float64 }{
        "openai":    {2.50, 10.00},  // GPT-4o
        "anthropic": {3.00, 15.00},  // Claude 3.5 Sonnet
        "google":    {0.00, 0.00},   // Gemini 2.0 Flash (free tier)
    }

    p := pricing[provider]
    return (float64(inputTokens)*p.input + float64(outputTokens)*p.output) / 1_000_000
}

func evaluateQuality(output, expected string) float64 {
    // Simple quality metric - could use semantic similarity, BLEU score, etc.
    if expected == "" {
        return 1.0 // No expected output to compare
    }
    // Simplified: return 1.0 for exact match, scale down for differences
    // In production, use semantic similarity or domain-specific metrics
    return 0.85 // Placeholder
}

Comparative Analysis

Generate reports comparing providers across dimensions:

type ComparativeReport struct {
    Providers map[string]ProviderMetrics
    Timestamp time.Time
}

type ProviderMetrics struct {
    AverageLatency      time.Duration
    AverageCost         float64
    AverageQuality      float64
    TotalTests          int
    P50Latency          time.Duration
    P95Latency          time.Duration
    CostPerSuccessTest  float64
}

func (r *BenchmarkRunner) GenerateReport(ctx context.Context, results []BenchmarkResult) *ComparativeReport {
    providerResults := make(map[string][]BenchmarkResult)
    for _, result := range results {
        providerResults[result.Provider] = append(providerResults[result.Provider], result)
    }

    report := &ComparativeReport{
        Providers: make(map[string]ProviderMetrics),
        Timestamp: time.Now(),
    }

    for provider, results := range providerResults {
        report.Providers[provider] = calculateMetrics(results)
    }

    return report
}

func calculateMetrics(results []BenchmarkResult) ProviderMetrics {
    if len(results) == 0 {
        return ProviderMetrics{}
    }

    var totalLatency time.Duration
    var totalCost float64
    var totalQuality float64

    latencies := make([]time.Duration, len(results))
    for i, r := range results {
        totalLatency += r.Latency
        totalCost += r.Cost
        totalQuality += r.Quality
        latencies[i] = r.Latency
    }

    // Sort for percentile calculation
    sortDurations(latencies)

    return ProviderMetrics{
        AverageLatency: totalLatency / time.Duration(len(results)),
        AverageCost:    totalCost / float64(len(results)),
        AverageQuality: totalQuality / float64(len(results)),
        TotalTests:     len(results),
        P50Latency:     latencies[len(latencies)/2],
        P95Latency:     latencies[int(float64(len(latencies))*0.95)],
        CostPerSuccessTest: totalCost,
    }
}

Production Considerations

Automated Scheduling

Run benchmarks on a schedule to track model performance over time:

import "time"

func (r *BenchmarkRunner) RunScheduled(ctx context.Context, suite BenchmarkSuite, interval time.Duration) {
    ticker := time.NewTicker(interval)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            results, err := r.RunBenchmark(ctx, suite)
            if err != nil {
                // Log error
                continue
            }

            // Store results for historical analysis
            if err := r.storeResults(ctx, results); err != nil {
                // Log error
            }

            // Generate and publish report
            report := r.GenerateReport(ctx, results)
            if err := r.publishReport(ctx, report); err != nil {
                // Log error
            }
        }
    }
}

Observability

Track benchmark execution with OpenTelemetry:

import (
    "github.com/lookatitude/beluga-ai/o11y"
    "go.opentelemetry.io/otel/attribute"
)

func (r *BenchmarkRunner) RunBenchmarkWithTracing(ctx context.Context, suite BenchmarkSuite) ([]BenchmarkResult, error) {
    ctx, span := o11y.StartSpan(ctx, "benchmark.run")
    defer span.End()

    span.SetAttributes(
        attribute.String("suite.name", suite.Name),
        attribute.Int("suite.tests", len(suite.Tests)),
        attribute.Int("suite.providers", len(suite.Providers)),
    )

    results, err := r.RunBenchmark(ctx, suite)
    if err != nil {
        span.RecordError(err)
        return nil, err
    }

    span.SetAttributes(attribute.Int("results.count", len(results)))
    return results, nil
}

Cost Control

Rate limiting: Limit concurrent requests to avoid excessive API costs
Result caching: Cache results for identical prompts to avoid re-running tests
Budget alerts: Set spending limits and alert when thresholds are exceeded
Sampling: For large suites, use representative samples rather than exhaustive testing

Results

After implementing model benchmarking, the platform achieved:

Metric	Before	After	Improvement
Model Selection Time	8-12 hours	30 minutes	95-96% reduction
Cost Optimization	Baseline	-28%	28% cost savings
Benchmark Coverage	Manual, ad-hoc	Automated, comprehensive	New capability
Performance Visibility	None	Real-time	New capability

LLM Integration Guide for multi-provider setup
LLM Providers for available provider options
Observability Guide for metrics and tracing setup

AI Agents

Data & Retrieval

Infrastructure

Orchestration

LLM Benchmarking Dashboard

Solution Architecture

Implementation

Benchmark Infrastructure

Benchmark Execution

Comparative Analysis

Production Considerations

Automated Scheduling

Observability

Cost Control

Results

AI Agents

Data & Retrieval

Infrastructure

Orchestration

LLM Benchmarking Dashboard

Solution Architecture

Implementation

Benchmark Infrastructure

Benchmark Execution

Comparative Analysis

Production Considerations

Automated Scheduling

Observability

Cost Control

Results

Related Resources