Why your eval suite belongs in CI.

There's a pattern we see in almost every agentic system that's been running for more than six months: a folder called evals/ with 40 scenarios written enthusiastically in week two, and a last_run.json from before the most recent prompt change.

This is not an eval suite. This is a screenshot from a meeting in March.

The two-jobs problem

When an eval suite lives outside CI, it ends up serving two jobs that pull in opposite directions:

Convincing yourself the system works (run once, demo, archive)
Catching regressions before users do (run on every change, gate releases)

The first job rewards thoroughness — write 200 cases covering every edge. The second rewards speed — run in under five minutes or no one will wait.

Outside of CI, the suite tends to drift toward the first job, because the second job has no enforcement. The result is a suite that's impressive to demo and useless in practice.

What "in CI" actually means

Three concrete things, in order of cost:

A subset runs on every PR. Pick 20–40 fast, high-coverage cases. They should finish in under 90 seconds and block the merge if they fail.
The full suite runs nightly against the main branch. Failures page someone — not the on-call rotation, the prompt's owner.
Score deltas are tracked over time. Not just pass/fail. A 4% regression in citation accuracy that doesn't fail a hard gate is still a regression.

The middle one — failures page an owner — is the part teams skip and then wonder why drift compounds.

What we ship with every system

A pnpm eval (or equivalent) target that runs the fast subset locally in under two minutes. A GitHub Action that runs the same thing on PR. A nightly that runs the full set against prod prompts. And one dashboard with two charts: pass rate by tag, and pass rate over time.

That's it. Nothing in this list is novel. The novelty is that it's wired up on day one and stays wired up after we leave.

Why this works

Engineers will not write evals for systems that don't enforce them. The eval suite is not a moral artifact, it's a load-bearing piece of the test pyramid. Put it where the rest of the tests live.

The two-jobs problem

What "in CI" actually means

What we ship with every system

Why this works

Ship the first system.