Skip to content
Docs

Evaluation Providers

Beluga AI v2 includes a built-in evaluation framework for measuring the quality of LLM outputs, RAG pipelines, and agent behavior. The framework defines a Metric interface that external evaluation platforms implement, and an EvalRunner that orchestrates parallel evaluation across datasets.

All evaluation providers implement the Metric interface:

type Metric interface {
Name() string
Score(ctx context.Context, sample EvalSample) (float64, error)
}

Each Score call returns a value in the range [0.0, 1.0], where higher values indicate better quality.

The EvalSample struct carries all the data needed for evaluation:

type EvalSample struct {
Input string // Original question or prompt
Output string // AI-generated response
ExpectedOutput string // Ground-truth reference answer
RetrievedDocs []schema.Document // Context documents used for generation
Metadata map[string]any // Metric-specific data (latency, tokens, model)
}

The EvalRunner orchestrates evaluation across a dataset of samples with configurable parallelism, timeouts, and lifecycle hooks:

import (
"context"
"fmt"
"log"
"github.com/lookatitude/beluga-ai/eval"
_ "github.com/lookatitude/beluga-ai/eval/providers/ragas"
)
func main() {
metric, err := ragas.New(
ragas.WithMetricName("faithfulness"),
ragas.WithBaseURL("http://localhost:8080"),
)
if err != nil {
log.Fatal(err)
}
runner := eval.NewRunner(
eval.WithMetrics(metric),
eval.WithDataset(samples),
eval.WithParallel(4),
eval.WithTimeout(5 * time.Minute),
)
report, err := runner.Run(context.Background())
if err != nil {
log.Fatal(err)
}
for name, score := range report.Metrics {
fmt.Printf("%s: %.3f\n", name, score)
}
}
OptionTypeDefaultDescription
WithMetrics(metrics ...Metric)[]Metric(required)Metrics to evaluate
WithDataset(samples []EvalSample)[]EvalSample(required)Evaluation dataset
WithParallel(n int)int1Concurrent sample evaluation
WithTimeout(d time.Duration)time.Duration0 (none)Maximum duration for entire run
WithStopOnError(stop bool)boolfalseStop on first metric error
WithHooks(hooks Hooks)HooksLifecycle callbacks

The evaluation runner supports lifecycle hooks for logging, progress tracking, and integration with CI systems:

hooks := eval.Hooks{
BeforeRun: func(ctx context.Context, samples []eval.EvalSample) error {
log.Printf("Starting evaluation with %d samples", len(samples))
return nil
},
AfterRun: func(ctx context.Context, report *eval.EvalReport) {
log.Printf("Evaluation complete: %d samples in %s", len(report.Samples), report.Duration)
},
BeforeSample: func(ctx context.Context, sample eval.EvalSample) error {
log.Printf("Evaluating: %s", sample.Input[:50])
return nil
},
AfterSample: func(ctx context.Context, result eval.SampleResult) {
for name, score := range result.Scores {
log.Printf(" %s: %.3f", name, score)
}
},
}
runner := eval.NewRunner(
eval.WithMetrics(metric),
eval.WithDataset(samples),
eval.WithHooks(hooks),
)

The EvalReport aggregates results across all samples:

type EvalReport struct {
Samples []SampleResult // Per-sample results
Metrics map[string]float64 // Average score per metric
Duration time.Duration // Total evaluation time
Errors []error // Collected errors
}
type SampleResult struct {
Sample EvalSample
Scores map[string]float64 // Metric name to score
Error error
}

Load and save evaluation datasets as JSON:

dataset, err := eval.LoadDataset("testdata/qa_pairs.json")
if err != nil {
log.Fatal(err)
}
runner := eval.NewRunner(
eval.WithMetrics(metric),
eval.WithDataset(dataset.Samples),
)
// Save results for later analysis
err = dataset.Save("testdata/results.json")
ProviderPrefixDefault MetricDescription
Braintrustbraintrust_factualityCloud-hosted evaluation via Braintrust API
DeepEvaldeepeval_faithfulnessEvaluation via DeepEval server
RAGASragas_faithfulnessRAG-focused evaluation via RAGAS server

Combine metrics from different providers in a single evaluation run:

btMetric, err := braintrust.New(
braintrust.WithAPIKey(os.Getenv("BRAINTRUST_API_KEY")),
braintrust.WithMetricName("factuality"),
)
if err != nil {
log.Fatal(err)
}
ragasMetric, err := ragas.New(
ragas.WithMetricName("answer_relevancy"),
ragas.WithBaseURL("http://localhost:8080"),
)
if err != nil {
log.Fatal(err)
}
runner := eval.NewRunner(
eval.WithMetrics(btMetric, ragasMetric),
eval.WithDataset(samples),
eval.WithParallel(4),
)
report, err := runner.Run(ctx)
// report.Metrics contains both "braintrust_factuality" and "ragas_answer_relevancy"