If your RAG system has no eval set, every prompt change is a vibes-based release. The good news: you can build a useful eval set in a week. Not a perfect one — a useful one. Perfect is the trap that keeps teams from having any evals at all.
Here's the recipe we use on engagements where the client has 0–5 evals on day one and needs 200 by end of week.
The shape
You want three things:
- A frozen corpus — the actual documents your system retrieves over, snapshotted at one point in time.
- 200 questions that an actual user might ask, with ground-truth answers and citations.
- A held-out set — 30 of those 200, locked away, never used to tune anything.
The held-out set is the most important and the most often skipped. We'll come back to it.
Day 1–2: Mine real questions
Skip "what would users ask?" workshops. They produce questions no user has ever actually asked. Go to your sources:
- Support tickets that mention any topic the RAG covers.
- Search logs from your existing search bar, if you have one.
- Slack channels where people ask the question internally.
- The chat logs from your current LLM system, if you have one running.
Aim for 250 raw questions. Dedupe, normalize phrasing, throw out the obviously-out-of-scope ones. You'll land at ~200.
Day 3: Tag for difficulty and topic
Bucket each question:
- Difficulty: trivial (answer is in one paragraph), medium (answer requires synthesizing 2–3 docs), hard (requires understanding a procedure or rule).
- Topic area — usually 5–15 buckets, depending on your domain.
- Adversarial flag — questions designed to trigger known failure modes (out-of-scope, contradictory info, recently-updated info).
You want a mix. A suite that's 90% trivial questions will tell you your system is great, and you'll deploy a regression because the medium and hard cases broke.
Day 4: Ground truth
For each question, you need:
- The expected answer — written, 1–3 sentences, the answer a domain expert would give.
- The supporting passages — pointers (doc IDs, line ranges, chunk IDs) to the parts of the corpus that should be retrieved to answer the question.
This is the slow part. Plan ~3 minutes per question = ~10 hours for 200 questions. Two people in parallel, half a week. Use a domain expert.
Worth it. This is the part that turns "the LLM said something plausible" into "the LLM cited the right source and got the right answer."
Day 5: Baselines + metrics
Run your current system (or a stub if you don't have one) against the 170 visible questions. Record:
- Retrieval@k: did the top-k retrieved chunks include the expected supporting passages?
- Answer correctness: rated by an LLM judge against the expected answer, with the held-out reference. Calibrate the judge against ~30 human ratings to make sure it's not just rubber-stamping.
- Citation precision: of the citations returned, what fraction actually support the claim?
You now have baselines. Every future prompt or retrieval change shows up as a delta from these.
The "do not optimize this" set
The 30 held-out questions exist for one reason: to catch yourself.
Every team that builds an eval set eventually starts tuning the system against the eval set. New prompt rev, run evals, score went up, ship. Three months in, your in-suite score is 95% and your user satisfaction is unchanged because you've been overfitting.
The held-out 30 questions are run only during quarterly health checks. If they regress while the rest of the suite improves, that's the smoke alarm for overfitting. They're a check on the check.
Don't look at them. Don't add to them. Don't substitute equivalents. Lock them in a vault.
What this gets you
- A reproducible signal that gates prompt and retrieval changes.
- A baseline you can compare against when evaluating new LLM versions.
- A defensible answer to "how do you know your RAG is any good?"
Build the v1 in a week. Iterate the questions monthly. Replace the held-out set once a year.
You'll never have time to build the perfect eval set. You can absolutely have time to build the useful one.