LLM Evaluation

Act 3 · ~5 min

Theory

Classical ML has clear metrics — accuracy, F1, RMSE. LLM outputs are open-ended text: the same question has many valid answers, phrased differently. That is why eval is harder.

Three approaches compared:

Approach	Speed	Cost	Best for
Exact match / BLEU / ROUGE	Fast	Free	Classification, structured output, translation
LLM-as-judge (1–5 rubric)	Medium	Low-Med	Open generation, summarization
Human eval	Slow	High	Ambiguous tasks, final validation

RAG-specific metrics (RAGAS):

RAGAS metric	What it measures
Context Recall	Were the relevant documents actually retrieved?
Context Precision	How many retrieved docs were actually relevant?
Answer Faithfulness	Is the answer grounded in the retrieved context?
Response Relevancy	Does the answer address the original question?

Eval set construction: sample real user queries, label reference answers, include hard cases, keep separate from training data.

LLM-as-judge goes deeper next — rubric design, calibration, and when to trust the judge.

Application

LLM Evaluation

Theory