Skip to content
Docs

LLM Benchmarking Dashboard

LLM providers release new models and update existing ones frequently — performance characteristics shift, pricing changes, and new capabilities appear. Teams that benchmark models once during initial selection make decisions based on stale data. A model that was the best choice six months ago may now be outperformed by a cheaper alternative, or a provider’s latency profile may have degraded.

Manual benchmarking compounds the problem: each provider has different API semantics, error formats, and usage reporting. Running tests by hand produces inconsistent results that are hard to compare across providers and impossible to track over time.

An automated benchmarking dashboard runs standardized test suites against all configured providers on a schedule, collects comparable metrics (latency, tokens, cost, quality), and generates comparative reports that reveal which model delivers the best cost-quality tradeoff for each category of request.

Beluga AI’s registry pattern and unified llm.ChatModel interface eliminate provider-specific client code. The benchmark runner initializes providers through llm.New(), executes identical prompts against each, and collects standardized metrics. Because all providers implement the same interface, adding a new provider to the benchmark suite is a single llm.New() call — no adapter code required.

┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Benchmark │───▶│ Benchmark │───▶│ Provider 1 │
│ Suite │ │ Runner │ │ Provider 2 │
└──────────────┘ └──────────────┘ │ Provider N │
└──────┬───────┘
┌──────────────┐ ┌──────────────┐ ┌─────▼────────┐
│ Comparative │◀───│ Analytics │◀───│ Metrics │
│ Reports │ │ Engine │ │ Collector │
└──────────────┘ └──────────────┘ └──────────────┘

The benchmark runner executes tests across multiple providers with consistent metrics:

package main
import (
"context"
"fmt"
"time"
"github.com/lookatitude/beluga-ai/llm"
"github.com/lookatitude/beluga-ai/schema"
_ "github.com/lookatitude/beluga-ai/llm/providers/openai"
_ "github.com/lookatitude/beluga-ai/llm/providers/anthropic"
_ "github.com/lookatitude/beluga-ai/llm/providers/google"
)
type BenchmarkSuite struct {
Name string
Tests []BenchmarkTest
Providers []string
}
type BenchmarkTest struct {
Name string
Prompt string
ExpectedOutput string // For quality evaluation
Category string
}
type BenchmarkResult struct {
Provider string
Model string
TestName string
Latency time.Duration
InputTokens int
OutputTokens int
Cost float64
Quality float64 // 0-1 score
Timestamp time.Time
}
type BenchmarkRunner struct {
providers map[string]llm.ChatModel
}
func NewBenchmarkRunner(ctx context.Context) (*BenchmarkRunner, error) {
providers := make(map[string]llm.ChatModel)
// Initialize providers
openai, err := llm.New("openai", llm.ProviderConfig{Model: "gpt-4o"})
if err == nil {
providers["openai"] = openai
}
anthropic, err := llm.New("anthropic", llm.ProviderConfig{Model: "claude-3-5-sonnet-20241022"})
if err == nil {
providers["anthropic"] = anthropic
}
google, err := llm.New("google", llm.ProviderConfig{Model: "gemini-2.0-flash-exp"})
if err == nil {
providers["google"] = google
}
return &BenchmarkRunner{
providers: providers,
}, nil
}

Execute benchmarks and collect comprehensive metrics:

func (r *BenchmarkRunner) RunBenchmark(ctx context.Context, suite BenchmarkSuite) ([]BenchmarkResult, error) {
var results []BenchmarkResult
for _, test := range suite.Tests {
for _, providerName := range suite.Providers {
provider, exists := r.providers[providerName]
if !exists {
continue
}
result, err := r.runTest(ctx, provider, providerName, test)
if err != nil {
// Log but continue with other tests
continue
}
results = append(results, result)
}
}
return results, nil
}
func (r *BenchmarkRunner) runTest(ctx context.Context, provider llm.ChatModel, providerName string, test BenchmarkTest) (BenchmarkResult, error) {
startTime := time.Now()
messages := []schema.Message{
&schema.HumanMessage{Parts: []schema.ContentPart{
schema.TextPart{Text: test.Prompt},
}},
}
resp, err := provider.Generate(ctx, messages)
if err != nil {
return BenchmarkResult{}, err
}
latency := time.Since(startTime)
// Extract token usage from response metadata
inputTokens, outputTokens := extractTokenUsage(resp)
// Calculate cost based on provider pricing
cost := calculateCost(providerName, inputTokens, outputTokens)
// Evaluate quality against expected output
quality := evaluateQuality(resp.Parts[0].(schema.TextPart).Text, test.ExpectedOutput)
return BenchmarkResult{
Provider: providerName,
Model: provider.ModelID(),
TestName: test.Name,
Latency: latency,
InputTokens: inputTokens,
OutputTokens: outputTokens,
Cost: cost,
Quality: quality,
Timestamp: time.Now(),
}, nil
}
func extractTokenUsage(resp *schema.AIMessage) (int, int) {
// Extract from response metadata if available
if usage, ok := resp.Metadata["usage"].(map[string]interface{}); ok {
input := int(usage["input_tokens"].(float64))
output := int(usage["output_tokens"].(float64))
return input, output
}
return 0, 0
}
func calculateCost(provider string, inputTokens, outputTokens int) float64 {
// Pricing per million tokens
pricing := map[string]struct{ input, output float64 }{
"openai": {2.50, 10.00}, // GPT-4o
"anthropic": {3.00, 15.00}, // Claude 3.5 Sonnet
"google": {0.00, 0.00}, // Gemini 2.0 Flash (free tier)
}
p := pricing[provider]
return (float64(inputTokens)*p.input + float64(outputTokens)*p.output) / 1_000_000
}
func evaluateQuality(output, expected string) float64 {
// Simple quality metric - could use semantic similarity, BLEU score, etc.
if expected == "" {
return 1.0 // No expected output to compare
}
// Simplified: return 1.0 for exact match, scale down for differences
// In production, use semantic similarity or domain-specific metrics
return 0.85 // Placeholder
}

Generate reports comparing providers across dimensions:

type ComparativeReport struct {
Providers map[string]ProviderMetrics
Timestamp time.Time
}
type ProviderMetrics struct {
AverageLatency time.Duration
AverageCost float64
AverageQuality float64
TotalTests int
P50Latency time.Duration
P95Latency time.Duration
CostPerSuccessTest float64
}
func (r *BenchmarkRunner) GenerateReport(ctx context.Context, results []BenchmarkResult) *ComparativeReport {
providerResults := make(map[string][]BenchmarkResult)
for _, result := range results {
providerResults[result.Provider] = append(providerResults[result.Provider], result)
}
report := &ComparativeReport{
Providers: make(map[string]ProviderMetrics),
Timestamp: time.Now(),
}
for provider, results := range providerResults {
report.Providers[provider] = calculateMetrics(results)
}
return report
}
func calculateMetrics(results []BenchmarkResult) ProviderMetrics {
if len(results) == 0 {
return ProviderMetrics{}
}
var totalLatency time.Duration
var totalCost float64
var totalQuality float64
latencies := make([]time.Duration, len(results))
for i, r := range results {
totalLatency += r.Latency
totalCost += r.Cost
totalQuality += r.Quality
latencies[i] = r.Latency
}
// Sort for percentile calculation
sortDurations(latencies)
return ProviderMetrics{
AverageLatency: totalLatency / time.Duration(len(results)),
AverageCost: totalCost / float64(len(results)),
AverageQuality: totalQuality / float64(len(results)),
TotalTests: len(results),
P50Latency: latencies[len(latencies)/2],
P95Latency: latencies[int(float64(len(latencies))*0.95)],
CostPerSuccessTest: totalCost,
}
}

Run benchmarks on a schedule to track model performance over time:

import "time"
func (r *BenchmarkRunner) RunScheduled(ctx context.Context, suite BenchmarkSuite, interval time.Duration) {
ticker := time.NewTicker(interval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
results, err := r.RunBenchmark(ctx, suite)
if err != nil {
// Log error
continue
}
// Store results for historical analysis
if err := r.storeResults(ctx, results); err != nil {
// Log error
}
// Generate and publish report
report := r.GenerateReport(ctx, results)
if err := r.publishReport(ctx, report); err != nil {
// Log error
}
}
}
}

Track benchmark execution with OpenTelemetry:

import (
"github.com/lookatitude/beluga-ai/o11y"
"go.opentelemetry.io/otel/attribute"
)
func (r *BenchmarkRunner) RunBenchmarkWithTracing(ctx context.Context, suite BenchmarkSuite) ([]BenchmarkResult, error) {
ctx, span := o11y.StartSpan(ctx, "benchmark.run")
defer span.End()
span.SetAttributes(
attribute.String("suite.name", suite.Name),
attribute.Int("suite.tests", len(suite.Tests)),
attribute.Int("suite.providers", len(suite.Providers)),
)
results, err := r.RunBenchmark(ctx, suite)
if err != nil {
span.RecordError(err)
return nil, err
}
span.SetAttributes(attribute.Int("results.count", len(results)))
return results, nil
}
  • Rate limiting: Limit concurrent requests to avoid excessive API costs
  • Result caching: Cache results for identical prompts to avoid re-running tests
  • Budget alerts: Set spending limits and alert when thresholds are exceeded
  • Sampling: For large suites, use representative samples rather than exhaustive testing

After implementing model benchmarking, the platform achieved:

MetricBeforeAfterImprovement
Model Selection Time8-12 hours30 minutes95-96% reduction
Cost OptimizationBaseline-28%28% cost savings
Benchmark CoverageManual, ad-hocAutomated, comprehensiveNew capability
Performance VisibilityNoneReal-timeNew capability