Skip to content
Docs

Safety and Guard Pipeline

AI applications face safety challenges at every stage of the request lifecycle. Malicious users may attempt prompt injection to override system instructions. Models may produce responses containing personally identifiable information (PII) or harmful content. Tool calls may target sensitive operations that require human oversight. The guard and hitl packages provide a comprehensive safety layer that addresses each of these concerns at the appropriate pipeline stage.

The guard system uses a defense-in-depth strategy with three validation stages, each positioned at a critical boundary in the agent lifecycle. This three-stage approach ensures that unsafe content is caught as early as possible while still providing defense at later stages for content that passes through.

  • Input guards run before the LLM call, catching prompt injection attempts and invalid input before they consume tokens or influence model behavior.
  • Output guards run after the LLM responds, validating and sanitizing the response before it reaches the user. This is where PII redaction and content filtering happen.
  • Tool guards run before tool execution, validating arguments and confirming that the requested operation is safe to perform.
graph LR
  A[User Input] --> B["Input Guards"] --> C[LLM] --> D["Output Guards"] --> E[Response]
  C --> F[Tool Call] --> G["Tool Guards"] --> H[Execute]
StageValidatesPurpose
InputUser messagesBlock prompt injection, validate format, enforce length limits
OutputModel responsesRedact PII, filter harmful content, enforce compliance rules
ToolTool argumentsValidate parameters, check permissions, confirm destructive operations

Every guard implements the same interface, which makes guards composable and interchangeable. You can combine built-in guards with custom domain-specific guards in the same pipeline, and they all participate in the same validation flow.

type Guard interface {
Name() string
Validate(ctx context.Context, input GuardInput) (GuardResult, error)
}

GuardResult indicates whether content is allowed, optionally provides a modified version (for redaction), and explains blocking reasons for auditability:

type GuardResult struct {
Allowed bool // True if content passes validation
Reason string // Why it was blocked or modified
Modified string // Optional sanitized version
GuardName string // Which guard produced this result
}

The pipeline constructor accepts guards grouped by stage. Guards within each stage execute in order, and the first guard that blocks content short-circuits the remaining guards in that stage. This ordering matters: place the cheapest, most common checks first to avoid unnecessary work.

import "github.com/lookatitude/beluga-ai/guard"
pipeline := guard.NewPipeline(
guard.Input(
guard.NewPromptInjectionDetector(),
guard.NewContentFilter(),
),
guard.Output(
guard.NewPIIRedactor(guard.DefaultPIIPatterns...),
),
guard.Tool(
guard.NewToolValidator(),
),
)
// Validate user input
result, err := pipeline.ValidateInput(ctx, "Tell me about security")
if err != nil {
log.Fatal(err)
}
if !result.Allowed {
fmt.Printf("Blocked: %s (by %s)\n", result.Reason, result.GuardName)
return
}
// Validate model output before sending to user
result, err = pipeline.ValidateOutput(ctx, modelResponse)
if result.Modified != "" {
modelResponse = result.Modified // Use sanitized version
}

Prompt injection is the most common attack vector against LLM applications. Attackers embed instructions in user input that attempt to override the system prompt, causing the model to ignore its original instructions. The injection detector analyzes input for known patterns and returns a confidence score. Content above the threshold is blocked.

injectionGuard := guard.NewPromptInjectionDetector(
guard.WithInjectionThreshold(0.7),
)
result, err := injectionGuard.Validate(ctx, guard.GuardInput{
Content: userMessage,
Role: "input",
})
if !result.Allowed {
// Prompt injection detected
}

Models may inadvertently include PII in their responses, either by echoing user-provided data or generating plausible-looking personal information. The PII redactor scans output for patterns matching common PII types and replaces them with redaction markers.

Pattern ordering matters in PII detection. Credit card patterns are evaluated before phone number patterns because a 16-digit credit card number could partially match a phone number regex, leading to incomplete redaction. By checking credit cards first, the more specific pattern matches before the broader one has a chance to produce a false partial match.

piiGuard := guard.NewPIIRedactor(
guard.PIIPatternEmail,
guard.PIIPatternPhone,
guard.PIIPatternCreditCard,
guard.PIIPatternSSN,
)
result, err := piiGuard.Validate(ctx, guard.GuardInput{
Content: "Contact john@example.com at 555-123-4567",
Role: "output",
})
fmt.Println(result.Modified)
// Output: "Contact [EMAIL REDACTED] at [PHONE REDACTED]"

The content filter checks for harmful or inappropriate content across configurable categories. It uses a scoring model to evaluate content and blocks responses that exceed the threshold for any category.

contentGuard := guard.NewContentFilter(
guard.WithCategories("hate", "violence", "self-harm"),
guard.WithThreshold(0.8),
)

Indirect prompt injection occurs when untrusted external content (retrieved documents, tool outputs, user-provided files) contains embedded instructions. Spotlighting wraps untrusted content in explicit delimiters so the model can distinguish between system instructions and external data, reducing the effectiveness of injection attempts embedded in retrieved content.

spotlight := guard.NewSpotlighter(guard.SpotlightConfig{
Delimiter: "<<<UNTRUSTED>>>",
})

Guards follow the same interface as every other extensible component in Beluga: implement the interface, and the pipeline accepts it alongside built-in guards. This makes it straightforward to add domain-specific validation rules, compliance checks, or business logic without modifying the framework.

type ComplianceGuard struct {
bannedTopics []string
}
func (g *ComplianceGuard) Name() string { return "compliance" }
func (g *ComplianceGuard) Validate(ctx context.Context, input guard.GuardInput) (guard.GuardResult, error) {
for _, topic := range g.bannedTopics {
if strings.Contains(strings.ToLower(input.Content), topic) {
return guard.GuardResult{
Allowed: false,
Reason: fmt.Sprintf("Content discusses banned topic: %s", topic),
GuardName: g.Name(),
}, nil
}
}
return guard.GuardResult{Allowed: true, GuardName: g.Name()}, nil
}
// Use in pipeline
pipeline := guard.NewPipeline(
guard.Output(&ComplianceGuard{
bannedTopics: []string{"competitor pricing", "internal roadmap"},
}),
)

Not every safety decision can be automated. The hitl package manages approval workflows for actions where human judgment is required, such as deleting production data, sending external communications, or executing financial transactions. The system evaluates each action against configurable policies to determine whether it can be auto-approved or requires explicit human confirmation.

The HITL manager coordinates approval requests, manages timeouts for pending decisions, and notifies reviewers through configurable channels.

import "github.com/lookatitude/beluga-ai/hitl"
mgr := hitl.NewManager(
hitl.WithTimeout(30 * time.Second),
hitl.WithNotifier(hitl.NewLogNotifier(slog.Default())),
)

Policies determine which actions need human approval based on the tool being called, the model’s confidence score, and the risk level of the operation. Policies are evaluated in order, and the first matching policy wins. This means you should order policies from most specific to most general, placing your auto-approve rules for safe operations before the catch-all rules for dangerous ones.

// Auto-approve read-only operations with high confidence
mgr.AddPolicy(hitl.ApprovalPolicy{
Name: "read-only-auto",
ToolPattern: "get_*",
MinConfidence: 0.5,
MaxRiskLevel: hitl.RiskReadOnly,
})
// Auto-approve data modifications with very high confidence
mgr.AddPolicy(hitl.ApprovalPolicy{
Name: "write-auto",
ToolPattern: "update_*",
MinConfidence: 0.9,
MaxRiskLevel: hitl.RiskDataModification,
})
// Always require approval for destructive operations
mgr.AddPolicy(hitl.ApprovalPolicy{
Name: "delete-manual",
ToolPattern: "delete_*",
RequireExplicit: true,
})

Policies are evaluated in order — the first matching policy wins.

Risk levels form a hierarchy that maps to increasingly cautious approval thresholds. Lower-risk operations can be auto-approved with lower confidence, while higher-risk operations demand either very high confidence or explicit human approval.

LevelValueDescription
RiskReadOnly"read_only"Read-only operations, minimal risk
RiskDataModification"data_modification"Modifies data, moderate risk
RiskIrreversible"irreversible"Cannot be undone, highest risk

Before requesting human review, the manager checks whether an action matches any auto-approve policy. This avoids unnecessary human interruptions for safe, high-confidence operations.

autoApproved, err := mgr.ShouldApprove(ctx,
"get_user_profile", // Tool name
0.95, // Model confidence
hitl.RiskReadOnly, // Risk level
)
if autoApproved {
// Execute directly
} else {
// Request human approval
}

When auto-approval is not granted, the manager creates an interaction request that is delivered to a human reviewer. The reviewer can approve, reject, or modify the proposed action. A timeout ensures that pending requests do not block the system indefinitely.

resp, err := mgr.RequestInteraction(ctx, hitl.InteractionRequest{
Type: hitl.TypeApproval,
ToolName: "delete_account",
Description: "Delete user account #12345",
Input: map[string]any{"user_id": "12345"},
RiskLevel: hitl.RiskIrreversible,
Confidence: 0.85,
Timeout: 60 * time.Second,
})
if err != nil {
log.Fatal(err)
}
switch resp.Decision {
case hitl.DecisionApprove:
// Proceed with deletion
case hitl.DecisionReject:
// Cancel the operation
fmt.Printf("Rejected: %s\n", resp.Feedback)
case hitl.DecisionModify:
// Use modified inputs
fmt.Printf("Modified: %v\n", resp.Modified)
}

The HITL system supports multiple interaction types beyond simple approval gates, enabling workflows where the human reviewer provides feedback, additional information, or data annotations.

TypePurpose
TypeApprovalYes/no/modify decision on an action
TypeFeedbackRequest feedback on a result
TypeInputRequest additional information
TypeAnnotationRequest data annotation

Guards integrate with agents through the hooks system. By attaching guard validation to the OnStart hook, every user message is validated before it reaches the LLM. This approach follows the middleware composition pattern used throughout Beluga: guards are applied as wrappers around existing behavior, not as modifications to the agent itself.

// Create a guarded agent pipeline
pipeline := guard.NewPipeline(
guard.Input(guard.NewPromptInjectionDetector()),
guard.Output(guard.NewPIIRedactor(guard.DefaultPIIPatterns...)),
)
a := agent.New("safe-assistant",
agent.WithLLM(model),
agent.WithHooks(agent.Hooks{
OnStart: func(ctx context.Context, input any) error {
result, err := pipeline.ValidateInput(ctx, input.(string))
if err != nil {
return err
}
if !result.Allowed {
return fmt.Errorf("input blocked: %s", result.Reason)
}
return nil
},
}),
)

The HITL system integrates naturally with the tool system through wrapper tools. The ApprovedTool pattern wraps any existing tool with an approval check, transparently adding human oversight without modifying the original tool implementation. This keeps safety concerns separated from business logic.

type ApprovedTool struct {
inner tool.Tool
manager hitl.Manager
}
func (t *ApprovedTool) Execute(ctx context.Context, input map[string]any) (*tool.Result, error) {
autoApproved, err := t.manager.ShouldApprove(ctx, t.inner.Name(), 0.8, hitl.RiskDataModification)
if err != nil {
return nil, err
}
if !autoApproved {
resp, err := t.manager.RequestInteraction(ctx, hitl.InteractionRequest{
Type: hitl.TypeApproval,
ToolName: t.inner.Name(),
Description: fmt.Sprintf("Execute %s", t.inner.Name()),
Input: input,
RiskLevel: hitl.RiskDataModification,
})
if err != nil {
return nil, err
}
if resp.Decision != hitl.DecisionApprove {
return tool.ErrorResult(fmt.Errorf("action rejected: %s", resp.Feedback)), nil
}
}
return t.inner.Execute(ctx, input)
}