LLM Evaluation
Theory
Classical ML has clear metrics — accuracy, F1, RMSE. LLM outputs are open-ended text: the same question has many valid answers, phrased differently. That is why eval is harder.
Three approaches compared:
| Approach | Speed | Cost | Best for |
|---|---|---|---|
| Exact match / BLEU / ROUGE | Fast | Free | Classification, structured output, translation |
| LLM-as-judge (1–5 rubric) | Medium | Low-Med | Open generation, summarization |
| Human eval | Slow | High | Ambiguous tasks, final validation |
RAG-specific metrics (RAGAS):
| RAGAS metric | What it measures |
|---|---|
| Context Recall | Were the relevant documents actually retrieved? |
| Context Precision | How many retrieved docs were actually relevant? |
| Answer Faithfulness | Is the answer grounded in the retrieved context? |
| Response Relevancy | Does the answer address the original question? |
Eval set construction: sample real user queries, label reference answers, include hard cases, keep separate from training data.
LLM-as-judge goes deeper next — rubric design, calibration, and when to trust the judge.