Your LLM Grader Is Asking the Wrong Question

The Problem with LLM-as-Judge

LLM-as-judge has become the default approach for evaluating agentic systems. Ask a model to grade a trace on a scale of 1-5, or classify performance as good/bad/neutral, and you get a signal you can use for optimization.

But there's a fundamental problem: these graders are inconsistent. Run the same grader on the same trace 100 times and you'll get a distribution of answers, not a single result. An essay might receive 60% B's, 25% A's, and 15% C's across repeated evaluations.

This inconsistency isn't just an annoyance—it's a quality problem. If the same input produces different outputs, they can't all be correct. And if you're using these grades as training signal for RL or SFT, that noise propagates into your optimization.

            The core issue: Open-ended questions like "rate this on a scale of 1-5" or "did the agent complete the task correctly?" give LLMs too much room for interpretation. The result is variance where you need precision.
        

The Hypothesis: Failure-Mode Detection

The key insight isn't just "use booleans"—many evaluation systems already do. The insight is: frame every check as detecting the presence of a specific failure, not the presence of success.

Instead of asking "is the tone formal?" ask "did the agent use informal language?" Instead of "was the invite correct?" ask a checklist of specific failure modes:

Did the invite include wrong attendees?
Did the proposed time conflict with existing events?
Does the title fail to match the meeting's purpose?
Did the agent use robotic or templated phrasing?

This reframing—plus breaking compound checks into atomic ones—has several advantages:

Higher consistency: "Did X error occur?" is more concrete than "was X done well?"
Higher accuracy: Specific failures are easier to spot than holistic quality
Better DX: Feels like unit testing—catch an error in production, add it to the checklist
Human-verifiable: Easier for QA teams to audit "did this specific thing go wrong?"
Atomic checks: Breaking compound graders into single-concern checks reduces ambiguity

Experiment Design

We tested this hypothesis on a scheduling assistant agent—an LLM that helps executives schedule meetings by participating in email threads and proposing calendar times.

We collected 50 conversation traces and evaluated them using two approaches:

Baseline: Success-Oriented Boolean Graders

Our current system already uses boolean outputs, but frames questions as "did X happen correctly?":

Grader	Question Framing
invite_correctness	`correct_people ∧ correct_time ∧ correct_title` (compound)
formal	"Is the tone formal?"
natural_emails	"Is the email tone natural and fluid?"
wordiness_body	"Is the body concise?"
varied_times_proposed	"Are the proposed times reasonably varied?"
correct_invite_first_attempt	"Was the invite correct on first attempt?"

Proposed: Failure-Mode Detection Graders

We reframe each check to ask "did this specific failure occur?" and break compound checks into atomic ones:

Grader	Question Framing
wrong_attendees	"Did the invite include someone who shouldn't be there, or exclude someone who should?"
time_conflict	"Did the proposed time conflict with an existing calendar event?"
title_mismatch	"Does the event title fail to reflect the meeting's stated purpose?"
informal_language	"Did the agent use casual or unprofessional language?"
robotic_phrasing	"Did the agent use unnatural, robotic, or templated phrasing?"
excessive_wordiness	"Did the email body contain unnecessary filler or redundant information?"
insufficient_options	"Did the agent propose fewer than 2 distinct time options?"
required_correction	"Did the agent need to modify or re-send the invite after initial creation?"

For each trace, we ran each grader 100 times and measured consistency (% agreement across runs) and accuracy (agreement with human labels on a held-out set).

Results

Consistency: Failure-Mode Graders Show Less Variance

Consistency (% Agreement Across 100 Runs)

Failure-mode graders achieved 94% average consistency compared to 71% for success-oriented graders. The improvement was most pronounced for subjective dimensions like tone and naturalness, where success-oriented framing showed high variance.

Accuracy: Failure-Mode Graders Match Human Judgment Better

Accuracy (Agreement with Human Labels)

When compared against human-labeled ground truth, failure-mode graders achieved 89% accuracy vs 76% for success-oriented graders. Notably, humans also found it easier to provide consistent labels for failure-mode questions.

Downstream Impact: Better Training Signal

Agent Performance Over Training Iterations

When used as reward signal for RL fine-tuning, failure-mode graders produced faster convergence and higher final performance. The cleaner signal allows the optimization to make consistent progress rather than chasing noise.

Discussion

The tradeoff is clear: failure-mode detection requires more individual checks. Where you might have had 6 success-oriented graders, you might now have 12 atomic failure checks. But this is a feature, not a bug:

Explicit failure modes: You're forced to enumerate what "bad" actually means
Incremental improvement: When you find a new failure mode in production, add a check
Debuggable: When an agent fails, you know exactly which check triggered

This approach shifts evaluation from holistic judgment toward systematic testing. It's less "grade this essay" and more "run the test suite."

Try It Yourself

We're releasing our evaluation framework and the prompts used in this study. If you're building agentic systems and struggling with evaluation consistency, failure-mode detection might be the constraint you need.

[Link to code / framework coming soon]