Workflow Durability
Workflow Durability
Section titled “Workflow Durability”Long-running agents in production must survive process restarts. Beluga’s workflow/ package provides durable execution by recording every step to an event log so a redeploy, pod restart, or machine failure does not lose progress.
The model
Section titled “The model”graph TD
subgraph Workflow[Workflow: deterministic orchestration]
WF[Plan → Act → Observe loop]
end
subgraph Activities[Activities: non-deterministic]
L[LLM call]
T[Tool.Execute]
R[Retrieval]
end
Workflow --> Activities
Activities -.results.-> Workflow
A workflow is a deterministic function whose every observable side effect — LLM call, tool invocation, sleep, signal — is recorded to the workflow store. On replay, completed steps return their cached result and execution resumes at the first step that had not yet completed.
This is the same model used by Temporal, Cadence, Inngest, and Dapr Workflows. Beluga abstracts over them so you write the workflow once and choose the backend per deployment.
Crash recovery
Section titled “Crash recovery”sequenceDiagram participant W as Worker participant L as Event log participant E as External service W->>L: start workflow W->>E: activity 1 E-->>W: result W->>L: append result W->>E: activity 2 E-->>W: result W->>L: append result Note over W: crash (OOM, pod kill, SIGTERM) W->>L: replay from log L-->>W: activities 1 and 2 results W->>E: activity 3 (first time)
On recovery, the worker replays the workflow from the start. For each already-completed activity, it reads the result from the event log instead of re-calling the service. When it reaches the first unfinished activity, it resumes from there. This is how an agent can survive a 10-hour workflow with a mid-run crash.
Backends
Section titled “Backends”| Backend | When to use |
|---|---|
temporal | Production at scale; you already run Temporal or want its operational tooling |
inngest | You want a managed event-driven runtime with no infrastructure |
dapr | You run Dapr in your cluster and want to share its state store |
nats | NATS JetStream is your existing message bus |
kafka | Kafka is your existing event log; you need at-least-once semantics |
inmemory | Tests and local development |
Wiring a workflow store
Section titled “Wiring a workflow store”import ( "github.com/lookatitude/beluga-ai/workflow" _ "github.com/lookatitude/beluga-ai/workflow/providers/temporal")
store, err := workflow.New("temporal", workflow.Config{ Endpoint: "temporal:7233", Namespace: "agents",})if err != nil { panic(err)}Resuming an interrupted run
Section titled “Resuming an interrupted run”// runID is the deterministic identifier for the workflow instance.// On a fresh process, this picks up from the last durable checkpoint.wf, err := store.Resume(ctx, runID)if err != nil { panic(err)}
for event, err := range wf.Events(ctx) { if err != nil { panic(err) } handle(event)}Agent loop as activities
Section titled “Agent loop as activities”Each step of the Plan → Act → Observe loop is a durable activity. Crashes between steps are transparent — the loop resumes from the last recorded activity.
graph LR
subgraph WF[Durable Workflow]
P[Plan activity] --> A[Action activity]
A --> O[Observe activity]
O --> R{Continue?}
R -->|yes| P
R -->|no| End[Finish]
end
What you do not need to do
Section titled “What you do not need to do”You do not need to implement application-level checkpointing. You do not need to design idempotent step handlers — Beluga’s executor records the result of each step before returning to the workflow function. You do not need to choose between durable and non-durable agents at design time — the same Agent interface works in both modes.
Determinism rules
Section titled “Determinism rules”Workflow functions must be deterministic. That means:
- No reads of
time.Now(),rand, or other non-deterministic system calls inside the workflow body. Useworkflow.Now(ctx)andworkflow.Random(ctx)instead. - No direct network or file I/O. Wrap external calls as workflow activities so they record to the event log.
- No goroutines spawned by your workflow code. Use
workflow.Go(ctx, fn)for parallel branches.
These rules are enforced at runtime by the executor — violations panic on the first replay.
Related
Section titled “Related”- Resilience — middleware that complements durability
- Observability — workflow runs emit
gen_ai.workflow.*spans - Architecture · 16 — Durable Workflows — the design rationale