Skip to content
Docs

Design Concepts — Beluga AI

Beluga AI v2 is a Go-native framework for building production agentic AI systems. It targets teams building enterprise applications that require extensibility, observability, type safety, and performance — without sacrificing developer ergonomics.

The framework draws from production patterns in Google ADK, OpenAI Agents SDK, LangGraph, ByteDance Eino, and LiveKit, unifying them into a single coherent Go library with streaming-first design, protocol interoperability (MCP + A2A), and pluggable everything. Each of these frameworks solves a subset of the agentic AI problem well; Beluga’s contribution is combining their best ideas into a single system where the patterns are consistent and the boundaries are clean.

These principles are not aspirational guidelines — they are enforced constraints that shape every API in the framework. Each one addresses a specific failure mode observed in production agentic systems.

Every component produces results as iter.Seq2[T, error] — Go 1.23+ push-based iterators. Synchronous Invoke() is a convenience wrapper that collects the stream. This ensures low time-to-first-token for LLM responses, real-time event propagation through agent pipelines, and natural backpressure without goroutine overhead. The key insight is that request/response is a degenerate case of streaming (collect all events and return the last one), while the reverse is not true — retrofitting streaming onto a request/response API requires fundamental architectural changes.

Every extensible component is defined by a Go interface with 1-4 methods. Implementations are concrete structs returned by factory functions. This enables compile-time verification, easy mocking for tests, and ad-hoc polymorphism through type assertions for optional capabilities. The 1-4 method constraint is the key discipline: it forces interface designers to identify the minimal contract that captures the abstraction, rather than the maximal set of operations a component might support. When an interface needs growth, the answer is type assertions for optional capabilities, not expanding the core contract that every provider must implement.

Every extensible package (LLM, embedding, vectorstore, STT, TTS, etc.) uses the same pattern: a global registry with Register(), New(), and List(). Providers self-register in init(). Users import providers with blank identifiers. This is the standard Go pattern used by database/sql, image, and Terraform. The consistency across all 19 registries means that understanding one teaches you the pattern for all of them — the same three functions, the same import mechanism, the same discovery API.

Go has no inheritance. Beluga uses struct embedding for code reuse (BaseAgent provides defaults) and interface composition for capability extension. Middleware (func(T) T) wraps interfaces for cross-cutting concerns. Hooks provide lifecycle interception without wrapping.

Every public function takes context.Context as its first parameter. Cancellation propagates through the entire call chain — from HTTP handler to LLM streaming to tool execution. Timeouts, tracing, and tenant isolation all flow through context.

All errors are core.Error with an operation name, error code, human-readable message, and wrapped cause. Error codes (rate_limit, timeout, provider_unavailable) carry retry semantics via IsRetryable(). Provider-specific errors are mapped to framework error codes at the boundary.

7. Zero External Dependencies in Foundation

Section titled “7. Zero External Dependencies in Foundation”

core/ and schema/ depend only on the Go standard library and OpenTelemetry. This guarantees stability, fast compilation, and no transitive dependency conflicts for users who only need the core types. This constraint is especially important in a framework with 100+ providers: if the foundation types pulled in any provider SDK, every user would inherit those dependencies. By keeping the foundation clean, provider dependencies stay isolated in their own packages.

LLMs, embedders, vector stores, memory stores, voice providers, workflow engines, document loaders, guardrails, observability exporters — all are swappable providers behind interfaces. Adding a new provider means implementing an interface and calling Register() in init().

OpenTelemetry is integrated at the framework level using gen_ai.* semantic conventions. Tracing, metrics, and structured logging (via slog) are built in, not bolted on. Every LLM call, tool execution, and agent event emits spans and metrics automatically.

Agents can expose and consume capabilities via MCP (Model Context Protocol) and A2A (Agent-to-Agent protocol). MCP uses Streamable HTTP transport. A2A uses protobuf-generated types with JSON-RPC and gRPC bindings. Agents are not locked into a single communication pattern.

The architecture is organized into four layers with strict dependency rules. The layering ensures that foundation types are stable and dependency-free, capability packages add AI primitives on top, infrastructure provides cross-cutting concerns, and protocol packages handle external communication. Dependencies flow strictly downward — no package ever imports from a layer above it.

graph TB
    subgraph foundation [Foundation Layer]
        core["core/<br/>Stream, Runnable, Lifecycle, Errors"]
        schemaP["schema/<br/>Message, ContentPart, Event, Document"]
        configP["config/<br/>Load, Validate, Watch"]
        o11yP["o11y/<br/>Tracer, Meter, Logger, Health"]
    end

    subgraph capability [Capability Layer]
        llmP["llm/<br/>ChatModel, Router, StructuredOutput"]
        toolP["tool/<br/>Tool, FuncTool, MCP Client"]
        memoryP["memory/<br/>Core, Recall, Archival, Graph"]
        ragP["rag/<br/>Embedding, VectorStore, Retriever"]
        agentP["agent/<br/>Agent, Planner, Executor, Handoffs"]
        voiceP["voice/<br/>STT, TTS, S2S, VAD, Transport"]
    end

    subgraph infra [Infrastructure Layer]
        guardP["guard/<br/>Input, Output, Tool Guards"]
        resilienceP["resilience/<br/>Retry, CircuitBreaker, Hedge"]
        cacheP["cache/<br/>Exact, Semantic"]
        authP["auth/<br/>RBAC, ABAC, Capabilities"]
        hitlP["hitl/<br/>Approval, Feedback"]
        workflowP["workflow/<br/>DurableExecutor, Activities"]
        evalP["eval/<br/>Metrics, Runner, Datasets"]
        stateP["state/<br/>Shared State, Watch"]
        promptP["prompt/<br/>Templates, Builder, Versioning"]
    end

    subgraph protocol [Protocol Layer]
        mcpP["protocol/mcp/<br/>Server, Client"]
        a2aP["protocol/a2a/<br/>Server, Client, AgentCard"]
        restP["protocol/rest/<br/>REST, SSE"]
        serverP["server/<br/>Gin, Chi, Echo, Fiber, gRPC"]
    end

    foundation --> capability
    capability --> infra
    capability --> protocol
    infra --> protocol

Dependency flow is strictly downward: Foundation has zero knowledge of upper layers. Capability packages import foundation. Infrastructure and protocol packages import both.

Each decision below was made after evaluating alternatives and studying how other frameworks handle the same problem. The “Rationale” sections explain not just what was chosen, but why the alternatives were rejected.

Decision 1: iter.Seq2[T, error] for Streaming

Section titled “Decision 1: iter.Seq2[T, error] for Streaming”

Choice: Go 1.23+ range-over-func iterators, not channels.

Rationale:

  • No goroutine per stream — zero scheduling overhead
  • Natural backpressure — yield returning false stops the producer immediately
  • Composable — MapStream, FilterStream, MergeStreams, FanOut compose without allocations
  • Standard Go — uses range for consumption, not custom .Next()/.Err() patterns
  • iter.Pull2() available when pull semantics are needed (rare)

Where channels are OK: Internal goroutine communication (e.g., voice frame processors between pipeline stages). Public API boundaries always use iter.Seq2.

graph LR
    producer["Producer<br/>iter.Seq2[T, error]"] -->|yield| consumer["Consumer<br/>for val, err := range ..."]
    consumer -->|false| producer
    note["yield returns false = stop"]

Choice: Agent-to-agent transfers are auto-generated transfer_to_{id} tools.

Rationale:

  • The LLM decides when to hand off — it sees handoffs in its tool list
  • No special routing logic needed — the executor handles tool calls uniformly
  • Dynamic availability — IsEnabled can disable handoffs at runtime
  • Input filtering — InputFilter controls what context passes to the target agent
  • Validated by OpenAI Agents SDK and Google ADK patterns
sequenceDiagram
    participant User
    participant AgentA as Agent A
    participant LLM
    participant AgentB as Agent B

    User->>AgentA: "Book a flight"
    AgentA->>LLM: Generate(msgs + tools including transfer_to_agent_b)
    LLM-->>AgentA: ToolCall: transfer_to_agent_b("Book flight to NYC")
    AgentA->>AgentB: Invoke("Book flight to NYC")
    AgentB-->>AgentA: "Flight booked: AA123"
    AgentA->>LLM: Generate(msgs + tool_result)
    LLM-->>AgentA: "Your flight AA123 is booked."
    AgentA-->>User: "Your flight AA123 is booked."

Choice: Separate Planner interface from Executor. The executor loop is planner-agnostic.

Rationale:

  • Different reasoning strategies (ReAct, Reflexion, Plan-and-Execute, GoT) have different strengths
  • The executor loop (receive actions, execute, observe, replan) is universal
  • Planners register via RegisterPlanner() — same registry pattern as providers
  • Custom planners can be added without modifying the executor

Built-in planners: ReAct (default), Reflexion, Plan-and-Execute, Structured, Conversational.

Decision 4: Three-Tier Memory (MemGPT Model)

Section titled “Decision 4: Three-Tier Memory (MemGPT Model)”

Choice: Separate memory into Core (always in context), Recall (searchable history), and Archival (vector + graph).

Rationale:

  • Core memory provides persistent persona and user context without retrieval latency
  • Recall memory enables efficient search over conversation history
  • Archival memory supports large-scale knowledge bases via vector similarity and graph traversal
  • CompositeMemory combines all tiers transparently

Choice: Guards run at three stages: input, output, and tool execution.

Rationale:

  • Input guards catch prompt injection, PII, and policy violations before LLM processing
  • Output guards validate LLM responses for toxicity, hallucination, and compliance
  • Tool guards enforce authorization and validation before side effects
  • Validated by OpenAI Agents SDK guardrail architecture
graph LR
    input["User Input"] --> inputGuards["Input Guards<br/>PII, Injection, Policy"]
    inputGuards --> llm["LLM Generate"]
    llm --> outputGuards["Output Guards<br/>Toxicity, Compliance"]
    outputGuards --> toolGuards["Tool Guards<br/>Auth, Validation"]
    toolGuards --> toolExec["Tool Execute"]
    toolExec --> response["Response"]

Choice: Beluga provides its own workflow engine. Temporal is a provider option, not the default.

Rationale:

  • Not all deployments need Temporal’s infrastructure overhead
  • The default in-process engine works for development and simple production use
  • Teams using Temporal, NATS, Kafka, or Dapr can plug in their preferred engine
  • All workflow engines implement the same DurableExecutor interface

Choice: FrameProcessor interface with goroutine-connected stages, not a monolithic pipeline.

Rationale:

  • Each processor (VAD, STT, LLM, TTS) is an independent unit
  • Cascading (STT→LLM→TTS), S2S (native audio), and Hybrid modes compose from the same processors
  • LiveKit/Daily/WebSocket are transports, not framework dependencies
  • Target: <800ms end-to-end latency

Decision 8: Hybrid Search as Default Retrieval

Section titled “Decision 8: Hybrid Search as Default Retrieval”

Choice: Vector + BM25 + RRF (Reciprocal Rank Fusion) as the default retriever.

Rationale:

  • Pure vector search misses keyword-specific queries
  • Pure BM25 misses semantic similarity
  • RRF fusion combines rankings without needing score normalization
  • Default pipeline: BM25 ~200 → Dense retrieval ~100 → RRF fusion (k=60) → Cross-encoder reranker → Top 10

Choice: Every extensible package uses Register() + New() + List() with the same implementation.

Rationale:

  • Consistency — learn the pattern once, apply everywhere
  • Discovery — List() shows what’s available at runtime
  • Testing — register mock providers in tests
  • 19 registries across the framework, all following the same contract

Decision 10: Middleware + Hooks Dual System

Section titled “Decision 10: Middleware + Hooks Dual System”

Choice: Both middleware (func(T) T) and hooks (struct with optional callbacks) in every extensible package.

Rationale:

  • Middleware wraps the entire interface — good for retry, rate-limit, cache, logging, tracing
  • Hooks fire at specific lifecycle points — good for audit, cost tracking, modification, validation
  • Middleware applies outermost (first in chain), hooks fire within execution
  • Both are composable: ApplyMiddleware() and ComposeHooks()
  • 6 packages implement middleware, 11 packages implement hooks

These diagrams show how packages collaborate at runtime. Each example traces a request from user input through the framework layers and back, illustrating how the abstract interfaces connect in practice.

sequenceDiagram
    participant User
    participant Server as REST/SSE Server
    participant Agent
    participant Planner
    participant LLM as ChatModel
    participant Tools as Tool Registry
    participant Memory

    User->>Server: POST /chat
    Server->>Agent: Stream(ctx, input)
    Agent->>Memory: Load(ctx, input)
    Memory-->>Agent: history messages
    Agent->>Planner: Plan(ctx, state)
    Planner->>LLM: Generate(ctx, msgs)
    LLM-->>Planner: AIMessage with ToolCalls
    Planner-->>Agent: [Action{Tool}, Action{Tool}]
    Agent->>Tools: Execute(ctx, input)
    Tools-->>Agent: Result
    Agent->>Planner: Replan(ctx, state + observations)
    Planner->>LLM: Generate(ctx, msgs + tool_results)
    LLM-->>Planner: AIMessage{finish}
    Planner-->>Agent: [Action{Finish}]
    Agent->>Memory: Save(ctx, input, output)
    Agent-->>Server: iter.Seq2[Event, error]
    Server-->>User: SSE stream
sequenceDiagram
    participant User
    participant Router as Router Agent
    participant Sales as Sales Agent
    participant Support as Support Agent
    participant LLM

    User->>Router: "I need help with billing"
    Router->>LLM: Generate(msgs + [transfer_to_sales, transfer_to_support])
    LLM-->>Router: ToolCall: transfer_to_support("billing issue")
    Router->>Support: Invoke("billing issue")
    Support->>LLM: Generate(msgs + support_tools)
    LLM-->>Support: "Let me look up your account..."
    Support-->>Router: result
    Router-->>User: "Let me look up your account..."
graph LR
    transport["Transport<br/>WebSocket/LiveKit"] --> vad["VAD<br/>Silero/Semantic"]
    vad --> stt["STT<br/>Deepgram/Whisper"]
    stt --> llm["LLM<br/>GPT-4/Claude"]
    llm --> tts["TTS<br/>ElevenLabs/Cartesia"]
    tts --> transport

This comparison highlights where Beluga differs from other agentic frameworks. The key differentiators are Go-native streaming with iter.Seq2, the pluggable planner architecture, built-in voice pipeline support, and dual protocol support (MCP + A2A).

FeatureBeluga AI v2Google ADKOpenAI SDKLangGraphEino
LanguageGoPython/Go/JavaPythonPythonGo
Streamingiter.Seq2Events/yieldrun_streamedStateGraphStreamReader
Agent transferHandoffs as toolsAgentToolHandoffsGraph edgesGraph edges
ReasoningPluggable PlannerBuilt-inBuilt-inCustom nodesBuilt-in
VoiceFrame-based pipelineNoNoNoNo
ProtocolsMCP + A2AMCPMCPNoNo
OrchestrationChain/Graph/WorkflowSequential/Parallel/LoopSequentialStateGraphChain/Graph/Workflow
Guardrails3-stage pipelineBuilt-in3-stageCustomAspects
Memory3-tier (MemGPT)Session/StateNoCheckpointsNo
DurabilityOwn engine + providersNoNoCheckpointsNo