Workflow Durability

Long-running agents in production must survive process restarts. Beluga’s workflow/ package provides durable execution by recording every step to an event log so a redeploy, pod restart, or machine failure does not lose progress.

The model

graph TD
  subgraph Workflow[Workflow: deterministic orchestration]
    WF[Plan → Act → Observe loop]
  end
  subgraph Activities[Activities: non-deterministic]
    L[LLM call]
    T[Tool.Execute]
    R[Retrieval]
  end
  Workflow --> Activities
  Activities -.results.-> Workflow

A workflow is a deterministic function whose every observable side effect — LLM call, tool invocation, sleep, signal — is recorded to the workflow store. On replay, completed steps return their cached result and execution resumes at the first step that had not yet completed.

This is the same model used by Temporal, Cadence, Inngest, and Dapr Workflows. Beluga abstracts over them so you write the workflow once and choose the backend per deployment.

Crash recovery

sequenceDiagram
  participant W as Worker
  participant L as Event log
  participant E as External service
  W->>L: start workflow
  W->>E: activity 1
  E-->>W: result
  W->>L: append result
  W->>E: activity 2
  E-->>W: result
  W->>L: append result
  Note over W: crash (OOM, pod kill, SIGTERM)
  W->>L: replay from log
  L-->>W: activities 1 and 2 results
  W->>E: activity 3 (first time)

On recovery, the worker replays the workflow from the start. For each already-completed activity, it reads the result from the event log instead of re-calling the service. When it reaches the first unfinished activity, it resumes from there. This is how an agent can survive a 10-hour workflow with a mid-run crash.

Backends

Backend	When to use
`temporal`	Production at scale; you already run Temporal or want its operational tooling
`inngest`	You want a managed event-driven runtime with no infrastructure
`dapr`	You run Dapr in your cluster and want to share its state store
`nats`	NATS JetStream is your existing message bus
`kafka`	Kafka is your existing event log; you need at-least-once semantics
`inmemory`	Tests and local development

Wiring a workflow store

import (
    "github.com/lookatitude/beluga-ai/workflow"
    _ "github.com/lookatitude/beluga-ai/workflow/providers/temporal"
)

store, err := workflow.New("temporal", workflow.Config{
    Endpoint:  "temporal:7233",
    Namespace: "agents",
})
if err != nil {
    panic(err)
}

Resuming an interrupted run

// runID is the deterministic identifier for the workflow instance.
// On a fresh process, this picks up from the last durable checkpoint.
wf, err := store.Resume(ctx, runID)
if err != nil {
    panic(err)
}

for event, err := range wf.Events(ctx) {
    if err != nil {
        panic(err)
    }
    handle(event)
}

Agent loop as activities

Each step of the Plan → Act → Observe loop is a durable activity. Crashes between steps are transparent — the loop resumes from the last recorded activity.

graph LR
  subgraph WF[Durable Workflow]
    P[Plan activity] --> A[Action activity]
    A --> O[Observe activity]
    O --> R{Continue?}
    R -->|yes| P
    R -->|no| End[Finish]
  end

What you do not need to do

You do not need to implement application-level checkpointing. You do not need to design idempotent step handlers — Beluga’s executor records the result of each step before returning to the workflow function. You do not need to choose between durable and non-durable agents at design time — the same Agent interface works in both modes.

Determinism rules

Workflow functions must be deterministic. That means:

No reads of time.Now(), rand, or other non-deterministic system calls inside the workflow body. Use workflow.Now(ctx) and workflow.Random(ctx) instead.
No direct network or file I/O. Wrap external calls as workflow activities so they record to the event log.
No goroutines spawned by your workflow code. Use workflow.Go(ctx, fn) for parallel branches.

These rules are enforced at runtime by the executor — violations panic on the first replay.

Resilience — middleware that complements durability
Observability — workflow runs emit gen_ai.workflow.* spans
Architecture · 16 — Durable Workflows — the design rationale

Workflow Durability

Workflow Durability

The model

Crash recovery

Backends

Wiring a workflow store

Resuming an interrupted run

Agent loop as activities

What you do not need to do

Determinism rules

Related