← All insights
PROCESS · MAR 20, 2026 · 7 MIN · Operations team

The runbook that ships with every agent.

Failure modes, escalations, rollback conditions. Boring on purpose.

Every agentic system we ship has a runbook. Six pages, no diagrams beyond a flowchart, written in the tone of a 2010-era SRE doc. It's boring on purpose.

We've come to see the runbook as a primary deliverable, not paperwork. If a runbook can't be written for the agent we just built, the agent isn't ready for production. Here's why, and what's actually in it.

Why the runbook is the spec

A runbook forces you to answer six questions you'd otherwise leave open:

  1. What does "working correctly" look like? Not vibes — a concrete check. "p95 latency under X, eval pass rate over Y, cost per request between A and B."
  2. What does "broken" look like? Specific symptoms with thresholds, not just "users complain."
  3. Who is paged when broken? A name, an escalation chain, a phone number.
  4. What's the playbook for each known failure? Step-by-step, including "do nothing for 5 minutes" if that's the answer.
  5. When do we roll back? Not "if it gets bad." A specific condition that triggers rollback without requiring a meeting.
  6. What state survives, and what doesn't? If we roll back the agent, what happens to in-flight workflows, partial completions, queued requests?

If the team can't answer all six in writing, the agent isn't production-grade yet. The runbook isn't documentation of what's already true — writing it is what makes it true.

The structure

Every runbook we ship has the same six sections:

1. Service summary

One paragraph. What this agent does, what it depends on, what depends on it. Updated whenever the agent's responsibilities change.

2. Health

The dashboard URL. The four or five metrics that define "healthy." Thresholds for green / yellow / red. No more than five — if you have ten, no one watches them.

3. Common failures

Each failure mode gets a section:

We aim for 6–10 failure modes documented. Most operators will only need three of them, but the long tail is what saves you at 3am.

4. Escalation

Who to contact, in what order, for what kinds of issues. Phone numbers, not Slack handles. Include the path for "this is beyond on-call's scope — escalate to engineering."

5. Rollback

The specific conditions under which we roll back, and the procedure. Crucially: who is authorized to decide to roll back without a meeting. Usually the on-call.

The rollback procedure itself should be one command. If it's three steps and a coordination call, it's not a rollback procedure — it's a project plan.

6. Change log

Every change since the last runbook review: prompt version, model version, retrieval index version, dependency upgrades. A timeline that lets the on-call see "what changed in the last 24 hours" when something starts misbehaving.

What's not in it

The runbook is not for explaining how the agent works. There's an architecture doc for that. The runbook is for operating it when it stops working.

It's also not for new-hire onboarding. The runbook assumes you know the system; it tells you what to do when it's broken. Don't dilute it with explanatory prose.

What changes when there's a runbook

The team's relationship to the agent changes. With no runbook, every incident is an unknown — engineers improvise, and the post-mortem is mostly invention.

With a runbook, incidents become exceptions to a documented expectation. The post-mortem becomes a runbook update: "this failure wasn't in section 3; here's the new entry." Each incident teaches the runbook.

Over six months, a well-maintained runbook covers 90% of the failures the system actually encounters. The remaining 10% are the interesting incidents — the ones that need real engineering, not playbook execution.

The handoff signal

The runbook is also our handoff document. When we leave an engagement, the client's team owns a system and a runbook that lets them operate it without us. We've seen too many handovers go bad because the leaving team kept the operating knowledge in their heads.

Boring on purpose. The runbook is what makes the agent boring to operate. And boring is what production-grade looks like.

START

Ship the first system.

Fixed-price discovery in 2 weeks. You leave with an architecture, a working spike, and a build plan.