0
Act 3

Application

7 / 9

LLM-as-Judge

Act 3 · ~4 min

Theory

LLM-as-judge uses a capable model as an automated rater. You prompt a judge with the original question and a response; it returns structured scores. Three modes:

ModeDescriptionTypical use
PointwiseScore one response on criteria (1–5)Quality audit at scale
PairwiseCompare two responses, pick winnerA/B model comparison
Reference-basedScore against a gold answerKnown-answer benchmarks

Strengths: scales to thousands of examples, captures nuanced quality, costs far less than human panels.

Mitigations: swap response order in pairwise runs, ensemble two or more judges, calibrate against human labels on 100+ examples (target Pearson r above 0.8).

When not to use alone: medical, legal, and safety-critical decisions require human validation. The judge is a scalable screening layer, not a final authority.