Guardrails & Safety
Three-stage guard pipeline with prompt injection detection, PII filtering, content moderation, and capability-based sandboxing with default-deny policies.
Overview
Production AI systems require safety guarantees that go beyond prompt engineering. Beluga AI provides a three-stage guard pipeline that inspects every interaction at the input, output, and tool execution boundaries. Each stage operates independently, so you can mix and match guards from different providers or write your own.
The guard system includes built-in detectors for common threats: prompt injection attacks (heuristic, classifier-based, and Spotlighting strategies), PII leakage (with configurable redaction), and content moderation (toxicity, hate speech, inappropriate content). Guards can block, modify, or flag content depending on your policy configuration.
For tool execution, Beluga enforces capability-based sandboxing with a default-deny policy. Every tool
must declare the capabilities it requires -- such as CapToolExec, CapMemoryRead,
or CapNetworkAccess -- and the runtime grants only the minimum privileges needed. This
prevents unauthorized data access and limits the blast radius of any single tool failure.
Capabilities
Three-Stage Pipeline
Guards execute at three distinct boundaries: Input guards validate user messages before they reach the LLM, Output guards check model responses before they reach the user, and Tool guards verify tool calls and results. Each stage is independent and composable.
Prompt Injection Detection
Multiple detection strategies work together to catch injection attempts. Heuristic rules detect known patterns. Classifier-based detection uses trained models for nuanced attacks. Spotlighting transforms input to make injected instructions visible to the detector. Strategies are composable and can run in parallel.
PII Filtering
Detect and redact personally identifiable information before it reaches LLMs or appears in logs. Supports names, emails, phone numbers, addresses, SSNs, credit cards, and custom patterns. Redaction is configurable: mask, hash, or replace with entity tags.
Content Moderation
Filter toxic, hateful, or inappropriate content from both inputs and outputs. Configurable severity thresholds let you tune sensitivity per use case. Supports custom category definitions for domain-specific moderation rules.
Capability-Based Sandboxing
Every tool operates under a default-deny security model. Tools must declare required capabilities:
CapToolExec, CapMemoryRead, CapMemoryWrite,
CapNetworkAccess, CapFileRead, CapFileWrite.
The runtime grants only what is explicitly permitted by policy.
Safety Provider Adapters
Integrate with dedicated safety platforms via provider adapters. Each adapter normalizes the provider's API into the Beluga guard interface, so you can swap providers without changing application code.
Architecture
Providers & Implementations
| Name | Priority | Key Differentiator |
|---|---|---|
| Built-in Guards | P0 | Zero-dependency heuristic detectors for injection, PII, and moderation |
| NVIDIA NeMo Guardrails | P1 | Programmable guardrails with Colang dialog rules and topical control |
| Guardrails AI | P1 | Validator-based approach with a large hub of community validators |
| LLM Guard | P2 | Open-source scanner suite for prompt injection, toxicity, and bias |
| Lakera Guard | P2 | Real-time API for prompt injection with continuously updated threat models |
| Azure AI Content Safety | P2 | Microsoft-hosted moderation with severity scoring and custom categories |
Full Example
Set up a three-stage guard pipeline with input, output, and tool guards:
package main
import (
"context"
"fmt"
"log"
"github.com/lookatitude/beluga-ai/agent"
"github.com/lookatitude/beluga-ai/guard"
"github.com/lookatitude/beluga-ai/llm"
)
func main() {
ctx := context.Background()
// Create guard pipeline with three stages
pipeline := guard.NewPipeline(
// Stage 1: Input guards
guard.WithInputGuards(
guard.NewPromptInjectionDetector(
guard.WithStrategy(guard.StrategyHeuristic),
guard.WithStrategy(guard.StrategyClassifier),
guard.WithStrategy(guard.StrategySpotlighting),
),
guard.NewPIIFilter(
guard.WithEntities("email", "phone", "ssn", "credit_card"),
guard.WithRedactionMode(guard.RedactMask),
),
guard.NewContentModerator(
guard.WithCategories("toxicity", "hate_speech"),
guard.WithThreshold(0.8),
),
),
// Stage 2: Output guards
guard.WithOutputGuards(
guard.NewPIIFilter(
guard.WithEntities("email", "phone", "ssn"),
guard.WithRedactionMode(guard.RedactReplace),
),
guard.NewContentModerator(
guard.WithThreshold(0.7),
),
),
// Stage 3: Tool guards with capability sandboxing
guard.WithToolGuards(
guard.NewCapabilitySandbox(
guard.WithDefaultDeny(),
guard.WithAllowedCapabilities(
guard.CapToolExec,
guard.CapMemoryRead,
// CapNetworkAccess deliberately omitted
),
),
),
)
// Create model and agent with guard pipeline
model, _ := llm.New("openai", llm.WithModel("gpt-4o"))
myAgent, _ := agent.New("safe-agent",
agent.WithModel(model),
agent.WithGuardPipeline(pipeline),
agent.WithGuardHooks(guard.Hooks{
OnBlock: func(ctx context.Context, stage string, reason string) {
fmt.Printf("[BLOCKED] Stage: %s, Reason: %s\n", stage, reason)
},
OnRedact: func(ctx context.Context, stage string, count int) {
fmt.Printf("[REDACTED] Stage: %s, Items: %d\n", stage, count)
},
}),
)
result, err := myAgent.Run(ctx, "Analyze this document for key insights")
if err != nil {
log.Fatal(err)
}
fmt.Println(result)
}