LLM Benchmarking Dashboard
LLM providers release new models and update existing ones frequently — performance characteristics shift, pricing changes, and new capabilities appear. Teams that benchmark models once during initial selection make decisions based on stale data. A model that was the best choice six months ago may now be outperformed by a cheaper alternative, or a provider’s latency profile may have degraded.
Manual benchmarking compounds the problem: each provider has different API semantics, error formats, and usage reporting. Running tests by hand produces inconsistent results that are hard to compare across providers and impossible to track over time.
An automated benchmarking dashboard runs standardized test suites against all configured providers on a schedule, collects comparable metrics (latency, tokens, cost, quality), and generates comparative reports that reveal which model delivers the best cost-quality tradeoff for each category of request.
Solution Architecture
Section titled “Solution Architecture”Beluga AI’s registry pattern and unified llm.ChatModel interface eliminate provider-specific client code. The benchmark runner initializes providers through llm.New(), executes identical prompts against each, and collects standardized metrics. Because all providers implement the same interface, adding a new provider to the benchmark suite is a single llm.New() call — no adapter code required.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐│ Benchmark │───▶│ Benchmark │───▶│ Provider 1 ││ Suite │ │ Runner │ │ Provider 2 │└──────────────┘ └──────────────┘ │ Provider N │ └──────┬───────┘ │┌──────────────┐ ┌──────────────┐ ┌─────▼────────┐│ Comparative │◀───│ Analytics │◀───│ Metrics ││ Reports │ │ Engine │ │ Collector │└──────────────┘ └──────────────┘ └──────────────┘Implementation
Section titled “Implementation”Benchmark Infrastructure
Section titled “Benchmark Infrastructure”The benchmark runner executes tests across multiple providers with consistent metrics:
package main
import ( "context" "fmt" "time"
"github.com/lookatitude/beluga-ai/llm" "github.com/lookatitude/beluga-ai/schema"
_ "github.com/lookatitude/beluga-ai/llm/providers/openai" _ "github.com/lookatitude/beluga-ai/llm/providers/anthropic" _ "github.com/lookatitude/beluga-ai/llm/providers/google")
type BenchmarkSuite struct { Name string Tests []BenchmarkTest Providers []string}
type BenchmarkTest struct { Name string Prompt string ExpectedOutput string // For quality evaluation Category string}
type BenchmarkResult struct { Provider string Model string TestName string Latency time.Duration InputTokens int OutputTokens int Cost float64 Quality float64 // 0-1 score Timestamp time.Time}
type BenchmarkRunner struct { providers map[string]llm.ChatModel}
func NewBenchmarkRunner(ctx context.Context) (*BenchmarkRunner, error) { providers := make(map[string]llm.ChatModel)
// Initialize providers openai, err := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"}) if err == nil { providers["openai"] = openai }
anthropic, err := llm.New("anthropic", llm.ProviderConfig{Model: "claude-3-5-sonnet-20241022"}) if err == nil { providers["anthropic"] = anthropic }
google, err := llm.New("google", llm.ProviderConfig{Model: "gemini-2.0-flash-exp"}) if err == nil { providers["google"] = google }
return &BenchmarkRunner{ providers: providers, }, nil}Benchmark Execution
Section titled “Benchmark Execution”Execute benchmarks and collect comprehensive metrics:
func (r *BenchmarkRunner) RunBenchmark(ctx context.Context, suite BenchmarkSuite) ([]BenchmarkResult, error) { var results []BenchmarkResult
for _, test := range suite.Tests { for _, providerName := range suite.Providers { provider, exists := r.providers[providerName] if !exists { continue }
result, err := r.runTest(ctx, provider, providerName, test) if err != nil { // Log but continue with other tests continue }
results = append(results, result) } }
return results, nil}
func (r *BenchmarkRunner) runTest(ctx context.Context, provider llm.ChatModel, providerName string, test BenchmarkTest) (BenchmarkResult, error) { startTime := time.Now()
messages := []schema.Message{ &schema.HumanMessage{Parts: []schema.ContentPart{ schema.TextPart{Text: test.Prompt}, }}, }
resp, err := provider.Generate(ctx, messages) if err != nil { return BenchmarkResult{}, err }
latency := time.Since(startTime)
// Extract token usage from response metadata inputTokens, outputTokens := extractTokenUsage(resp)
// Calculate cost based on provider pricing cost := calculateCost(providerName, inputTokens, outputTokens)
// Evaluate quality against expected output quality := evaluateQuality(resp.Parts[0].(schema.TextPart).Text, test.ExpectedOutput)
return BenchmarkResult{ Provider: providerName, Model: provider.ModelID(), TestName: test.Name, Latency: latency, InputTokens: inputTokens, OutputTokens: outputTokens, Cost: cost, Quality: quality, Timestamp: time.Now(), }, nil}
func extractTokenUsage(resp *schema.AIMessage) (int, int) { // Extract from response metadata if available if usage, ok := resp.Metadata["usage"].(map[string]interface{}); ok { input := int(usage["input_tokens"].(float64)) output := int(usage["output_tokens"].(float64)) return input, output } return 0, 0}
func calculateCost(provider string, inputTokens, outputTokens int) float64 { // Pricing per million tokens pricing := map[string]struct{ input, output float64 }{ "openai": {2.50, 10.00}, // GPT-4o "anthropic": {3.00, 15.00}, // Claude 3.5 Sonnet "google": {0.00, 0.00}, // Gemini 2.0 Flash (free tier) }
p := pricing[provider] return (float64(inputTokens)*p.input + float64(outputTokens)*p.output) / 1_000_000}
func evaluateQuality(output, expected string) float64 { // Simple quality metric - could use semantic similarity, BLEU score, etc. if expected == "" { return 1.0 // No expected output to compare } // Simplified: return 1.0 for exact match, scale down for differences // In production, use semantic similarity or domain-specific metrics return 0.85 // Placeholder}Comparative Analysis
Section titled “Comparative Analysis”Generate reports comparing providers across dimensions:
type ComparativeReport struct { Providers map[string]ProviderMetrics Timestamp time.Time}
type ProviderMetrics struct { AverageLatency time.Duration AverageCost float64 AverageQuality float64 TotalTests int P50Latency time.Duration P95Latency time.Duration CostPerSuccessTest float64}
func (r *BenchmarkRunner) GenerateReport(ctx context.Context, results []BenchmarkResult) *ComparativeReport { providerResults := make(map[string][]BenchmarkResult) for _, result := range results { providerResults[result.Provider] = append(providerResults[result.Provider], result) }
report := &ComparativeReport{ Providers: make(map[string]ProviderMetrics), Timestamp: time.Now(), }
for provider, results := range providerResults { report.Providers[provider] = calculateMetrics(results) }
return report}
func calculateMetrics(results []BenchmarkResult) ProviderMetrics { if len(results) == 0 { return ProviderMetrics{} }
var totalLatency time.Duration var totalCost float64 var totalQuality float64
latencies := make([]time.Duration, len(results)) for i, r := range results { totalLatency += r.Latency totalCost += r.Cost totalQuality += r.Quality latencies[i] = r.Latency }
// Sort for percentile calculation sortDurations(latencies)
return ProviderMetrics{ AverageLatency: totalLatency / time.Duration(len(results)), AverageCost: totalCost / float64(len(results)), AverageQuality: totalQuality / float64(len(results)), TotalTests: len(results), P50Latency: latencies[len(latencies)/2], P95Latency: latencies[int(float64(len(latencies))*0.95)], CostPerSuccessTest: totalCost, }}Production Considerations
Section titled “Production Considerations”Automated Scheduling
Section titled “Automated Scheduling”Run benchmarks on a schedule to track model performance over time:
import "time"
func (r *BenchmarkRunner) RunScheduled(ctx context.Context, suite BenchmarkSuite, interval time.Duration) { ticker := time.NewTicker(interval) defer ticker.Stop()
for { select { case <-ctx.Done(): return case <-ticker.C: results, err := r.RunBenchmark(ctx, suite) if err != nil { // Log error continue }
// Store results for historical analysis if err := r.storeResults(ctx, results); err != nil { // Log error }
// Generate and publish report report := r.GenerateReport(ctx, results) if err := r.publishReport(ctx, report); err != nil { // Log error } } }}Observability
Section titled “Observability”Track benchmark execution with OpenTelemetry:
import ( "github.com/lookatitude/beluga-ai/o11y" "go.opentelemetry.io/otel/attribute")
func (r *BenchmarkRunner) RunBenchmarkWithTracing(ctx context.Context, suite BenchmarkSuite) ([]BenchmarkResult, error) { ctx, span := o11y.StartSpan(ctx, "benchmark.run") defer span.End()
span.SetAttributes( attribute.String("suite.name", suite.Name), attribute.Int("suite.tests", len(suite.Tests)), attribute.Int("suite.providers", len(suite.Providers)), )
results, err := r.RunBenchmark(ctx, suite) if err != nil { span.RecordError(err) return nil, err }
span.SetAttributes(attribute.Int("results.count", len(results))) return results, nil}Cost Control
Section titled “Cost Control”- Rate limiting: Limit concurrent requests to avoid excessive API costs
- Result caching: Cache results for identical prompts to avoid re-running tests
- Budget alerts: Set spending limits and alert when thresholds are exceeded
- Sampling: For large suites, use representative samples rather than exhaustive testing
Results
Section titled “Results”After implementing model benchmarking, the platform achieved:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Model Selection Time | 8-12 hours | 30 minutes | 95-96% reduction |
| Cost Optimization | Baseline | -28% | 28% cost savings |
| Benchmark Coverage | Manual, ad-hoc | Automated, comprehensive | New capability |
| Performance Visibility | None | Real-time | New capability |
Related Resources
Section titled “Related Resources”- LLM Integration Guide for multi-provider setup
- LLM Providers for available provider options
- Observability Guide for metrics and tracing setup