Every team that ships an agentic system eventually asks the same question, usually around month three: "do we need Temporal / Inngest / Restate, or can we just add retries?"
The answer is almost always "you need the durable orchestrator, but for a reason that isn't 'retries.'"
What retries actually solve
Retries are the right tool when a transient failure is observed by the same process that issued the call. Network blip, rate-limit response, a flaky downstream — the caller has the failed result in hand, knows what it was trying to do, and can simply try again.
For this you need: a retry policy with exponential backoff, a max attempts cap, and idempotency on the receiver. That's a library, not a system. p-retry, tenacity, tokio-retry — pick one and move on.
What durable execution actually solves
Durable execution is the right tool when the process itself may die between steps. Crash, deploy, k8s evicting the pod, a 30-minute LLM call that exceeds your request timeout — the question isn't "can we try again" but "where were we when the lights went out, and how do we resume without redoing the parts that already succeeded?"
This is a fundamentally different problem. It needs:
- Persistent step state — a log of which steps completed, with their results.
- Replayable history — when the process resumes, it walks the log, skips finished steps, and continues from the first unfinished one.
- Deterministic step boundaries — code structured so the orchestrator can identify steps unambiguously across replays.
You can write this from scratch. People do. It's six months of debugging race conditions before you realize you've built a worse version of Temporal.
The actual decision
If your agentic system has:
- Tool calls that take more than a few seconds individually,
- Workflows with more than ~3 sequential steps,
- A handoff or human-in-the-loop somewhere in the middle,
- SLAs that can't tolerate a deploy restart causing a user-visible failure,
…you need a durable orchestrator. Not "you might want one for nice-to-haves." You need one, and adding it later will be much more expensive than starting with it.
If your system is short, synchronous, and stateless (one prompt, one response, done in 5 seconds), retries are fine. Don't over-engineer.
What we ship
In practice, our default is Temporal or Restate (depending on the language stack), with workflow code structured so each LLM call, tool invocation, and approval is a separate step. The cost is some learning curve and a runtime to operate. The benefit is that "what happens when this process crashes mid-workflow" stops being a question we have to answer for each new feature — the orchestrator answers it once.
Retries solve "this call failed." Durable execution solves "this workflow needs to survive across failures of the thing running it." They're not alternatives — most production systems need both, and it's the second one that people underestimate.