RAG Evaluation
A reliable RAG needs measurable quality. Start with a 50β200 question golden set and track accuracy, citations, latency, and cost.
Answer correctness
Does the answer match the expected ground truth?
Citation faithfulness
Are claims supported by retrieved sources? No extraβcontext fabrications.
Recall & precision
Do retrieved chunks cover needed facts, and are they mostly relevant?
Latency & cost
Is p95 latency and token spend acceptable for your SLA and budget?
Evaluation process
- Collect 50β200 representative questions with expected answers and source citations.
- Run your pipeline and record outputs, retrieved sources, tokens, and latency.
- Score correctness and faithfulness; track recall/precision for retrieval.
- Iterate on chunking, k, reβranking, and prompts; reβtest and compare results.