Skip to content
Docs

Braintrust Evaluation Provider

The Braintrust provider connects Beluga AI’s evaluation framework to the Braintrust cloud platform. It implements the eval.Metric interface by sending samples to the Braintrust scoring API and returning normalized scores.

Choose Braintrust when you want a managed evaluation platform with a dashboard for tracking scores over time. Braintrust supports factuality, relevancy, and other LLM-specific metrics with built-in project organization. For self-hosted evaluation with Python-native metrics, consider DeepEval. For RAG-specific metrics, consider RAGAS.

Terminal window
go get github.com/lookatitude/beluga-ai/eval/providers/braintrust
OptionTypeDefaultDescription
WithAPIKey(key)string(required)Braintrust API key
WithMetricName(name)string"factuality"Metric to evaluate
WithProjectName(name)string"default"Braintrust project name
WithBaseURL(url)stringhttps://api.braintrust.devAPI endpoint
WithTimeout(d)time.Duration30sHTTP request timeout

Environment variables:

VariableMaps to
BRAINTRUST_API_KEYWithAPIKey
package main
import (
"context"
"fmt"
"log"
"os"
"github.com/lookatitude/beluga-ai/eval"
"github.com/lookatitude/beluga-ai/eval/providers/braintrust"
)
func main() {
metric, err := braintrust.New(
braintrust.WithAPIKey(os.Getenv("BRAINTRUST_API_KEY")),
braintrust.WithMetricName("factuality"),
braintrust.WithProjectName("my-project"),
)
if err != nil {
log.Fatal(err)
}
sample := eval.EvalSample{
Input: "What is the capital of France?",
Output: "The capital of France is Paris.",
ExpectedOutput: "Paris",
}
score, err := metric.Score(context.Background(), sample)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s: %.3f\n", metric.Name(), score)
// Output: braintrust_factuality: 0.950
}

Use Braintrust metrics with the evaluation runner for batch evaluation:

metric, err := braintrust.New(
braintrust.WithAPIKey(os.Getenv("BRAINTRUST_API_KEY")),
braintrust.WithMetricName("factuality"),
)
if err != nil {
log.Fatal(err)
}
runner := eval.NewRunner(
eval.WithMetrics(metric),
eval.WithDataset(samples),
eval.WithParallel(4),
eval.WithTimeout(5 * time.Minute),
)
report, err := runner.Run(context.Background())
if err != nil {
log.Fatal(err)
}
fmt.Printf("Average factuality: %.3f\n", report.Metrics["braintrust_factuality"])

When evaluating RAG pipelines, include retrieved documents as context:

sample := eval.EvalSample{
Input: "How does photosynthesis work?",
Output: "Photosynthesis converts CO2 and water into glucose using sunlight.",
ExpectedOutput: "Plants use light energy to convert carbon dioxide and water into glucose and oxygen.",
RetrievedDocs: []schema.Document{
{Content: "Photosynthesis is the process by which plants convert light energy..."},
{Content: "The process occurs in chloroplasts and involves two stages..."},
},
}
score, err := metric.Score(ctx, sample)

The provider extracts content from RetrievedDocs and sends it as context to the Braintrust API alongside the input, output, and expected output.

Braintrust metrics are prefixed with braintrust_ to distinguish them from metrics from other providers. For example, a metric configured with WithMetricName("factuality") reports its name as braintrust_factuality.

score, err := metric.Score(ctx, sample)
if err != nil {
// Errors include HTTP failures, invalid API responses, and authentication issues
log.Printf("Braintrust scoring failed: %v", err)
}

Scores are clamped to the [0.0, 1.0] range. If the API returns a score outside this range, it is automatically normalized.