LLM-as-Judge
Theory
LLM-as-judge uses a capable model as an automated rater. You prompt a judge with the original question and a response; it returns structured scores. Three modes:
| Mode | Description | Typical use |
|---|---|---|
| Pointwise | Score one response on criteria (1–5) | Quality audit at scale |
| Pairwise | Compare two responses, pick winner | A/B model comparison |
| Reference-based | Score against a gold answer | Known-answer benchmarks |
Strengths: scales to thousands of examples, captures nuanced quality, costs far less than human panels.
Mitigations: swap response order in pairwise runs, ensemble two or more judges, calibrate against human labels on 100+ examples (target Pearson r above 0.8).
When not to use alone: medical, legal, and safety-critical decisions require human validation. The judge is a scalable screening layer, not a final authority.