LLM-as-Judge

Act 3 · ~4 min

Theory

LLM-as-judge uses a capable model as an automated rater. You prompt a judge with the original question and a response; it returns structured scores. Three modes:

Mode	Description	Typical use
Pointwise	Score one response on criteria (1–5)	Quality audit at scale
Pairwise	Compare two responses, pick winner	A/B model comparison
Reference-based	Score against a gold answer	Known-answer benchmarks

Strengths: scales to thousands of examples, captures nuanced quality, costs far less than human panels.

Mitigations: swap response order in pairwise runs, ensemble two or more judges, calibrate against human labels on 100+ examples (target Pearson r above 0.8).

When not to use alone: medical, legal, and safety-critical decisions require human validation. The judge is a scalable screening layer, not a final authority.

Application

LLM-as-Judge

Theory