Braintrust Evaluation Provider
The Braintrust provider connects Beluga AI’s evaluation framework to the Braintrust cloud platform. It implements the eval.Metric interface by sending samples to the Braintrust scoring API and returning normalized scores.
Choose Braintrust when you want a managed evaluation platform with a dashboard for tracking scores over time. Braintrust supports factuality, relevancy, and other LLM-specific metrics with built-in project organization. For self-hosted evaluation with Python-native metrics, consider DeepEval. For RAG-specific metrics, consider RAGAS.
Installation
Section titled “Installation”go get github.com/lookatitude/beluga-ai/eval/providers/braintrustConfiguration
Section titled “Configuration”| Option | Type | Default | Description |
|---|---|---|---|
WithAPIKey(key) | string | (required) | Braintrust API key |
WithMetricName(name) | string | "factuality" | Metric to evaluate |
WithProjectName(name) | string | "default" | Braintrust project name |
WithBaseURL(url) | string | https://api.braintrust.dev | API endpoint |
WithTimeout(d) | time.Duration | 30s | HTTP request timeout |
Environment variables:
| Variable | Maps to |
|---|---|
BRAINTRUST_API_KEY | WithAPIKey |
Basic Usage
Section titled “Basic Usage”package main
import ( "context" "fmt" "log" "os"
"github.com/lookatitude/beluga-ai/eval" "github.com/lookatitude/beluga-ai/eval/providers/braintrust")
func main() { metric, err := braintrust.New( braintrust.WithAPIKey(os.Getenv("BRAINTRUST_API_KEY")), braintrust.WithMetricName("factuality"), braintrust.WithProjectName("my-project"), ) if err != nil { log.Fatal(err) }
sample := eval.EvalSample{ Input: "What is the capital of France?", Output: "The capital of France is Paris.", ExpectedOutput: "Paris", }
score, err := metric.Score(context.Background(), sample) if err != nil { log.Fatal(err) }
fmt.Printf("%s: %.3f\n", metric.Name(), score) // Output: braintrust_factuality: 0.950}With EvalRunner
Section titled “With EvalRunner”Use Braintrust metrics with the evaluation runner for batch evaluation:
metric, err := braintrust.New( braintrust.WithAPIKey(os.Getenv("BRAINTRUST_API_KEY")), braintrust.WithMetricName("factuality"),)if err != nil { log.Fatal(err)}
runner := eval.NewRunner( eval.WithMetrics(metric), eval.WithDataset(samples), eval.WithParallel(4), eval.WithTimeout(5 * time.Minute),)
report, err := runner.Run(context.Background())if err != nil { log.Fatal(err)}
fmt.Printf("Average factuality: %.3f\n", report.Metrics["braintrust_factuality"])RAG Evaluation
Section titled “RAG Evaluation”When evaluating RAG pipelines, include retrieved documents as context:
sample := eval.EvalSample{ Input: "How does photosynthesis work?", Output: "Photosynthesis converts CO2 and water into glucose using sunlight.", ExpectedOutput: "Plants use light energy to convert carbon dioxide and water into glucose and oxygen.", RetrievedDocs: []schema.Document{ {Content: "Photosynthesis is the process by which plants convert light energy..."}, {Content: "The process occurs in chloroplasts and involves two stages..."}, },}
score, err := metric.Score(ctx, sample)The provider extracts content from RetrievedDocs and sends it as context to the Braintrust API alongside the input, output, and expected output.
Metric Naming
Section titled “Metric Naming”Braintrust metrics are prefixed with braintrust_ to distinguish them from metrics from other providers. For example, a metric configured with WithMetricName("factuality") reports its name as braintrust_factuality.
Error Handling
Section titled “Error Handling”score, err := metric.Score(ctx, sample)if err != nil { // Errors include HTTP failures, invalid API responses, and authentication issues log.Printf("Braintrust scoring failed: %v", err)}Scores are clamped to the [0.0, 1.0] range. If the API returns a score outside this range, it is automatically normalized.