Skip to content
Docs

Workflow Durability

Long-running agents in production must survive process restarts. Beluga’s workflow/ package provides durable execution by recording every step to an event log so a redeploy, pod restart, or machine failure does not lose progress.

graph TD
  subgraph Workflow[Workflow: deterministic orchestration]
    WF[Plan → Act → Observe loop]
  end
  subgraph Activities[Activities: non-deterministic]
    L[LLM call]
    T[Tool.Execute]
    R[Retrieval]
  end
  Workflow --> Activities
  Activities -.results.-> Workflow

A workflow is a deterministic function whose every observable side effect — LLM call, tool invocation, sleep, signal — is recorded to the workflow store. On replay, completed steps return their cached result and execution resumes at the first step that had not yet completed.

This is the same model used by Temporal, Cadence, Inngest, and Dapr Workflows. Beluga abstracts over them so you write the workflow once and choose the backend per deployment.

sequenceDiagram
  participant W as Worker
  participant L as Event log
  participant E as External service
  W->>L: start workflow
  W->>E: activity 1
  E-->>W: result
  W->>L: append result
  W->>E: activity 2
  E-->>W: result
  W->>L: append result
  Note over W: crash (OOM, pod kill, SIGTERM)
  W->>L: replay from log
  L-->>W: activities 1 and 2 results
  W->>E: activity 3 (first time)

On recovery, the worker replays the workflow from the start. For each already-completed activity, it reads the result from the event log instead of re-calling the service. When it reaches the first unfinished activity, it resumes from there. This is how an agent can survive a 10-hour workflow with a mid-run crash.

BackendWhen to use
temporalProduction at scale; you already run Temporal or want its operational tooling
inngestYou want a managed event-driven runtime with no infrastructure
daprYou run Dapr in your cluster and want to share its state store
natsNATS JetStream is your existing message bus
kafkaKafka is your existing event log; you need at-least-once semantics
inmemoryTests and local development
import (
"github.com/lookatitude/beluga-ai/workflow"
_ "github.com/lookatitude/beluga-ai/workflow/providers/temporal"
)
store, err := workflow.New("temporal", workflow.Config{
Endpoint: "temporal:7233",
Namespace: "agents",
})
if err != nil {
panic(err)
}
// runID is the deterministic identifier for the workflow instance.
// On a fresh process, this picks up from the last durable checkpoint.
wf, err := store.Resume(ctx, runID)
if err != nil {
panic(err)
}
for event, err := range wf.Events(ctx) {
if err != nil {
panic(err)
}
handle(event)
}

Each step of the Plan → Act → Observe loop is a durable activity. Crashes between steps are transparent — the loop resumes from the last recorded activity.

graph LR
  subgraph WF[Durable Workflow]
    P[Plan activity] --> A[Action activity]
    A --> O[Observe activity]
    O --> R{Continue?}
    R -->|yes| P
    R -->|no| End[Finish]
  end

You do not need to implement application-level checkpointing. You do not need to design idempotent step handlers — Beluga’s executor records the result of each step before returning to the workflow function. You do not need to choose between durable and non-durable agents at design time — the same Agent interface works in both modes.

Workflow functions must be deterministic. That means:

  • No reads of time.Now(), rand, or other non-deterministic system calls inside the workflow body. Use workflow.Now(ctx) and workflow.Random(ctx) instead.
  • No direct network or file I/O. Wrap external calls as workflow activities so they record to the event log.
  • No goroutines spawned by your workflow code. Use workflow.Go(ctx, fn) for parallel branches.

These rules are enforced at runtime by the executor — violations panic on the first replay.