Reframing evaluation from "did X succeed?" to "did failure Y occur?" yields more consistent and accurate graders. Here's the evidence.
LLM-as-judge has become the default approach for evaluating agentic systems. Ask a model to grade a trace on a scale of 1-5, or classify performance as good/bad/neutral, and you get a signal you can use for optimization.
But there's a fundamental problem: these graders are inconsistent. Run the same grader on the same trace 100 times and you'll get a distribution of answers, not a single result. An essay might receive 60% B's, 25% A's, and 15% C's across repeated evaluations.
This inconsistency isn't just an annoyance—it's a quality problem. If the same input produces different outputs, they can't all be correct. And if you're using these grades as training signal for RL or SFT, that noise propagates into your optimization.
The key insight isn't just "use booleans"—many evaluation systems already do. The insight is: frame every check as detecting the presence of a specific failure, not the presence of success.
Instead of asking "is the tone formal?" ask "did the agent use informal language?" Instead of "was the invite correct?" ask a checklist of specific failure modes:
This reframing—plus breaking compound checks into atomic ones—has several advantages:
We tested this hypothesis on a scheduling assistant agent—an LLM that helps executives schedule meetings by participating in email threads and proposing calendar times.
We collected 50 conversation traces and evaluated them using two approaches:
Our current system already uses boolean outputs, but frames questions as "did X happen correctly?":
| Grader | Question Framing |
|---|---|
| invite_correctness | correct_people ∧ correct_time ∧ correct_title (compound) |
| formal | "Is the tone formal?" |
| natural_emails | "Is the email tone natural and fluid?" |
| wordiness_body | "Is the body concise?" |
| varied_times_proposed | "Are the proposed times reasonably varied?" |
| correct_invite_first_attempt | "Was the invite correct on first attempt?" |
We reframe each check to ask "did this specific failure occur?" and break compound checks into atomic ones:
| Grader | Question Framing |
|---|---|
| wrong_attendees | "Did the invite include someone who shouldn't be there, or exclude someone who should?" |
| time_conflict | "Did the proposed time conflict with an existing calendar event?" |
| title_mismatch | "Does the event title fail to reflect the meeting's stated purpose?" |
| informal_language | "Did the agent use casual or unprofessional language?" |
| robotic_phrasing | "Did the agent use unnatural, robotic, or templated phrasing?" |
| excessive_wordiness | "Did the email body contain unnecessary filler or redundant information?" |
| insufficient_options | "Did the agent propose fewer than 2 distinct time options?" |
| required_correction | "Did the agent need to modify or re-send the invite after initial creation?" |
For each trace, we ran each grader 100 times and measured consistency (% agreement across runs) and accuracy (agreement with human labels on a held-out set).
Failure-mode graders achieved 94% average consistency compared to 71% for success-oriented graders. The improvement was most pronounced for subjective dimensions like tone and naturalness, where success-oriented framing showed high variance.
When compared against human-labeled ground truth, failure-mode graders achieved 89% accuracy vs 76% for success-oriented graders. Notably, humans also found it easier to provide consistent labels for failure-mode questions.
When used as reward signal for RL fine-tuning, failure-mode graders produced faster convergence and higher final performance. The cleaner signal allows the optimization to make consistent progress rather than chasing noise.
The tradeoff is clear: failure-mode detection requires more individual checks. Where you might have had 6 success-oriented graders, you might now have 12 atomic failure checks. But this is a feature, not a bug:
This approach shifts evaluation from holistic judgment toward systematic testing. It's less "grade this essay" and more "run the test suite."
We're releasing our evaluation framework and the prompts used in this study. If you're building agentic systems and struggling with evaluation consistency, failure-mode detection might be the constraint you need.
[Link to code / framework coming soon]