0
Act 3

Application

6 / 9

LLM Evaluation

Act 3 · ~5 min

Theory

Classical ML has clear metrics — accuracy, F1, RMSE. LLM outputs are open-ended text: the same question has many valid answers, phrased differently. That is why eval is harder.

Three approaches compared:

ApproachSpeedCostBest for
Exact match / BLEU / ROUGEFastFreeClassification, structured output, translation
LLM-as-judge (1–5 rubric)MediumLow-MedOpen generation, summarization
Human evalSlowHighAmbiguous tasks, final validation

RAG-specific metrics (RAGAS):

RAGAS metricWhat it measures
Context RecallWere the relevant documents actually retrieved?
Context PrecisionHow many retrieved docs were actually relevant?
Answer FaithfulnessIs the answer grounded in the retrieved context?
Response RelevancyDoes the answer address the original question?

Eval set construction: sample real user queries, label reference answers, include hard cases, keep separate from training data.

LLM-as-judge goes deeper next — rubric design, calibration, and when to trust the judge.