RAG Evaluation

A reliable RAG needs measurable quality. Start with a 50–200 question golden set and track accuracy, citations, latency, and cost.

Answer correctness

Does the answer match the expected ground truth?

Citation faithfulness

Are claims supported by retrieved sources? No extra‑context fabrications.

Recall & precision

Do retrieved chunks cover needed facts, and are they mostly relevant?

Latency & cost

Is p95 latency and token spend acceptable for your SLA and budget?

Evaluation process

Collect 50–200 representative questions with expected answers and source citations.
Run your pipeline and record outputs, retrieved sources, tokens, and latency.
Score correctness and faithfulness; track recall/precision for retrieval.
Iterate on chunking, k, re‑ranking, and prompts; re‑test and compare results.