← All insights
EVALS · APR 02, 2026 · 10 MIN · Research team

How to build a retrieval eval set in a week.

A practical recipe. 200 questions, your real corpus, baselines, and a "do not optimize this" set.

If your RAG system has no eval set, every prompt change is a vibes-based release. The good news: you can build a useful eval set in a week. Not a perfect one — a useful one. Perfect is the trap that keeps teams from having any evals at all.

Here's the recipe we use on engagements where the client has 0–5 evals on day one and needs 200 by end of week.

The shape

You want three things:

  1. A frozen corpus — the actual documents your system retrieves over, snapshotted at one point in time.
  2. 200 questions that an actual user might ask, with ground-truth answers and citations.
  3. A held-out set — 30 of those 200, locked away, never used to tune anything.

The held-out set is the most important and the most often skipped. We'll come back to it.

Day 1–2: Mine real questions

Skip "what would users ask?" workshops. They produce questions no user has ever actually asked. Go to your sources:

Aim for 250 raw questions. Dedupe, normalize phrasing, throw out the obviously-out-of-scope ones. You'll land at ~200.

Day 3: Tag for difficulty and topic

Bucket each question:

You want a mix. A suite that's 90% trivial questions will tell you your system is great, and you'll deploy a regression because the medium and hard cases broke.

Day 4: Ground truth

For each question, you need:

This is the slow part. Plan ~3 minutes per question = ~10 hours for 200 questions. Two people in parallel, half a week. Use a domain expert.

Worth it. This is the part that turns "the LLM said something plausible" into "the LLM cited the right source and got the right answer."

Day 5: Baselines + metrics

Run your current system (or a stub if you don't have one) against the 170 visible questions. Record:

You now have baselines. Every future prompt or retrieval change shows up as a delta from these.

The "do not optimize this" set

The 30 held-out questions exist for one reason: to catch yourself.

Every team that builds an eval set eventually starts tuning the system against the eval set. New prompt rev, run evals, score went up, ship. Three months in, your in-suite score is 95% and your user satisfaction is unchanged because you've been overfitting.

The held-out 30 questions are run only during quarterly health checks. If they regress while the rest of the suite improves, that's the smoke alarm for overfitting. They're a check on the check.

Don't look at them. Don't add to them. Don't substitute equivalents. Lock them in a vault.

What this gets you

Build the v1 in a week. Iterate the questions monthly. Replace the held-out set once a year.

You'll never have time to build the perfect eval set. You can absolutely have time to build the useful one.

START

Ship the first system.

Fixed-price discovery in 2 weeks. You leave with an architecture, a working spike, and a build plan.