← All insights
ENGINEERING · APR 28, 2026 · 6 MIN · Engineering team

Durable workflows are not retries.

A reasonable mental model for when you actually need a durable orchestrator vs. just better backoff.

Every team that ships an agentic system eventually asks the same question, usually around month three: "do we need Temporal / Inngest / Restate, or can we just add retries?"

The answer is almost always "you need the durable orchestrator, but for a reason that isn't 'retries.'"

What retries actually solve

Retries are the right tool when a transient failure is observed by the same process that issued the call. Network blip, rate-limit response, a flaky downstream — the caller has the failed result in hand, knows what it was trying to do, and can simply try again.

For this you need: a retry policy with exponential backoff, a max attempts cap, and idempotency on the receiver. That's a library, not a system. p-retry, tenacity, tokio-retry — pick one and move on.

What durable execution actually solves

Durable execution is the right tool when the process itself may die between steps. Crash, deploy, k8s evicting the pod, a 30-minute LLM call that exceeds your request timeout — the question isn't "can we try again" but "where were we when the lights went out, and how do we resume without redoing the parts that already succeeded?"

This is a fundamentally different problem. It needs:

You can write this from scratch. People do. It's six months of debugging race conditions before you realize you've built a worse version of Temporal.

The actual decision

If your agentic system has:

…you need a durable orchestrator. Not "you might want one for nice-to-haves." You need one, and adding it later will be much more expensive than starting with it.

If your system is short, synchronous, and stateless (one prompt, one response, done in 5 seconds), retries are fine. Don't over-engineer.

What we ship

In practice, our default is Temporal or Restate (depending on the language stack), with workflow code structured so each LLM call, tool invocation, and approval is a separate step. The cost is some learning curve and a runtime to operate. The benefit is that "what happens when this process crashes mid-workflow" stops being a question we have to answer for each new feature — the orchestrator answers it once.

Retries solve "this call failed." Durable execution solves "this workflow needs to survive across failures of the thing running it." They're not alternatives — most production systems need both, and it's the second one that people underestimate.

START

Ship the first system.

Fixed-price discovery in 2 weeks. You leave with an architecture, a working spike, and a build plan.