Evaluation Providers
Beluga AI v2 includes a built-in evaluation framework for measuring the quality of LLM outputs, RAG pipelines, and agent behavior. The framework defines a Metric interface that external evaluation platforms implement, and an EvalRunner that orchestrates parallel evaluation across datasets.
Core Interface
Section titled “Core Interface”All evaluation providers implement the Metric interface:
type Metric interface { Name() string Score(ctx context.Context, sample EvalSample) (float64, error)}Each Score call returns a value in the range [0.0, 1.0], where higher values indicate better quality.
EvalSample
Section titled “EvalSample”The EvalSample struct carries all the data needed for evaluation:
type EvalSample struct { Input string // Original question or prompt Output string // AI-generated response ExpectedOutput string // Ground-truth reference answer RetrievedDocs []schema.Document // Context documents used for generation Metadata map[string]any // Metric-specific data (latency, tokens, model)}EvalRunner
Section titled “EvalRunner”The EvalRunner orchestrates evaluation across a dataset of samples with configurable parallelism, timeouts, and lifecycle hooks:
import ( "context" "fmt" "log"
"github.com/lookatitude/beluga-ai/eval" _ "github.com/lookatitude/beluga-ai/eval/providers/ragas")
func main() { metric, err := ragas.New( ragas.WithMetricName("faithfulness"), ragas.WithBaseURL("http://localhost:8080"), ) if err != nil { log.Fatal(err) }
runner := eval.NewRunner( eval.WithMetrics(metric), eval.WithDataset(samples), eval.WithParallel(4), eval.WithTimeout(5 * time.Minute), )
report, err := runner.Run(context.Background()) if err != nil { log.Fatal(err) }
for name, score := range report.Metrics { fmt.Printf("%s: %.3f\n", name, score) }}Runner Options
Section titled “Runner Options”| Option | Type | Default | Description |
|---|---|---|---|
WithMetrics(metrics ...Metric) | []Metric | (required) | Metrics to evaluate |
WithDataset(samples []EvalSample) | []EvalSample | (required) | Evaluation dataset |
WithParallel(n int) | int | 1 | Concurrent sample evaluation |
WithTimeout(d time.Duration) | time.Duration | 0 (none) | Maximum duration for entire run |
WithStopOnError(stop bool) | bool | false | Stop on first metric error |
WithHooks(hooks Hooks) | Hooks | — | Lifecycle callbacks |
The evaluation runner supports lifecycle hooks for logging, progress tracking, and integration with CI systems:
hooks := eval.Hooks{ BeforeRun: func(ctx context.Context, samples []eval.EvalSample) error { log.Printf("Starting evaluation with %d samples", len(samples)) return nil }, AfterRun: func(ctx context.Context, report *eval.EvalReport) { log.Printf("Evaluation complete: %d samples in %s", len(report.Samples), report.Duration) }, BeforeSample: func(ctx context.Context, sample eval.EvalSample) error { log.Printf("Evaluating: %s", sample.Input[:50]) return nil }, AfterSample: func(ctx context.Context, result eval.SampleResult) { for name, score := range result.Scores { log.Printf(" %s: %.3f", name, score) } },}
runner := eval.NewRunner( eval.WithMetrics(metric), eval.WithDataset(samples), eval.WithHooks(hooks),)EvalReport
Section titled “EvalReport”The EvalReport aggregates results across all samples:
type EvalReport struct { Samples []SampleResult // Per-sample results Metrics map[string]float64 // Average score per metric Duration time.Duration // Total evaluation time Errors []error // Collected errors}
type SampleResult struct { Sample EvalSample Scores map[string]float64 // Metric name to score Error error}Datasets
Section titled “Datasets”Load and save evaluation datasets as JSON:
dataset, err := eval.LoadDataset("testdata/qa_pairs.json")if err != nil { log.Fatal(err)}
runner := eval.NewRunner( eval.WithMetrics(metric), eval.WithDataset(dataset.Samples),)
// Save results for later analysiserr = dataset.Save("testdata/results.json")Available Providers
Section titled “Available Providers”| Provider | Prefix | Default Metric | Description |
|---|---|---|---|
| Braintrust | braintrust_ | factuality | Cloud-hosted evaluation via Braintrust API |
| DeepEval | deepeval_ | faithfulness | Evaluation via DeepEval server |
| RAGAS | ragas_ | faithfulness | RAG-focused evaluation via RAGAS server |
Multiple Metrics
Section titled “Multiple Metrics”Combine metrics from different providers in a single evaluation run:
btMetric, err := braintrust.New( braintrust.WithAPIKey(os.Getenv("BRAINTRUST_API_KEY")), braintrust.WithMetricName("factuality"),)if err != nil { log.Fatal(err)}
ragasMetric, err := ragas.New( ragas.WithMetricName("answer_relevancy"), ragas.WithBaseURL("http://localhost:8080"),)if err != nil { log.Fatal(err)}
runner := eval.NewRunner( eval.WithMetrics(btMetric, ragasMetric), eval.WithDataset(samples), eval.WithParallel(4),)
report, err := runner.Run(ctx)// report.Metrics contains both "braintrust_factuality" and "ragas_answer_relevancy"