RAG Evaluation
A reliable RAG needs measurable quality. Start with a 50โ200 question golden set and track accuracy, citations, latency, and cost.
Answer correctness
Does the answer match the expected ground truth?
Citation faithfulness
Are claims supported by retrieved sources? No extraโcontext fabrications.
Recall & precision
Do retrieved chunks cover needed facts, and are they mostly relevant?
Latency & cost
Is p95 latency and token spend acceptable for your SLA and budget?
Evaluation process
- Collect 50โ200 representative questions with expected answers and source citations.
- Run your pipeline and record outputs, retrieved sources, tokens, and latency.
- Score correctness and faithfulness; track recall/precision for retrieval.
- Iterate on chunking, k, reโranking, and prompts; reโtest and compare results.