Agent reliability
A narrow look at what the last ~90 days of research is teaching us
The field is moving extremely fast right now. New agent stacks, new evals, new post-training tricks - the whole ecosystem shifts weekly.
But if you ship agents, you learn a painful lesson fast:
An agent that succeeds once is not a reliable agent.
Single-run success rates are demo metrics. Production reliability is a different game:
Consistency across runs (the same task, same setup, multiple attempts)
Robustness to “equivalent” user inputs (paraphrases, small spec changes, harmless reorderings)
Grace under tool/API failures (because they will fail - timeouts, rate limits, partial responses, schema drift)
If I had to compress the theme of the last ~90 days into one line, it’s this:
Reliability is a surface, not a score.
This post is intentionally narrow: recent work that treats agent reliability as a first-class object - not a vibe.
The production reality check: humans are still the reliability layer
A paper I keep pointing people to is “Measuring Agents in Production.” It’s one of the rare efforts that asks practitioners what’s actually working (and what’s breaking).
A few findings that stuck with me:
Many production agents are built to be simple and controllable: 68% run at most 10 steps before requiring human intervention.
Most teams lean on prompting off-the-shelf models vs weight tuning (70%), and rely primarily on human evaluation (74%).
Reliability shows up as the top challenge - especially “ensuring and evaluating correctness.”
That’s the current equilibrium: humans as circuit breakers.
The real question is how we scale beyond that without lying to ourselves about what “reliable” means.
ReliabilityBench: measuring reliability as a surface, not a score
ReliabilityBench is exactly the kind of benchmark we’ve needed.
Instead of asking “did it succeed,” it asks:
Does it succeed again (consistency)
Does it succeed under equivalent variations of the task (robustness)
Does it survive tool/API failures (fault tolerance)
They formalize this across three dimensions:
pass^k for repeated execution
perturbation intensity ε
fault intensity λ
...and propose a unified reliability surface: R(k, ε, λ).
Two ideas here that I think will stick:
Action metamorphic relations: judge correctness by end-state equivalence rather than brittle text matching.
Chaos-style fault injection: simulate timeouts, rate limits, partial responses, schema drift.
The reported results are the point:
Perturbations alone reduced success from 96.9% at ε=0 to 88.1% at ε=0.2.
Rate limiting was especially damaging.
This is what “production-like” really means: not one clean run, but performance under stress.
Why we should care:
If you only track single-run pass rates, you end up optimizing for demos.
A reliability surface forces the conversation into repeatability, robustness, and failure modes.
E-valuator: turn “judge scores” into runtime decisions (with guarantees)
Assume you’ve built a verifier (LLM judge, PRM, heuristics). You can score trajectories - but can you trust the score enough to make a runtime decision?
E-valuator reframes this as a sequential hypothesis testing problem: distinguish successful vs unsuccessful trajectories as actions unfold, using a statistically valid test at every step.
They propose converting any black-box verifier score into a decision rule with controlled false-alarm rates, and show it can both improve monitoring and terminate problematic trajectories early to save tokens.
Why we should care:
“Judge reliability” is now a core dependency for agent reliability.
This is one path from heuristics to operational control.
LLMdoctor: test-time steering as a reliability tool
Benchmarks and verifiers tell you “it broke.”
But reliability also requires you “fix it now.”
That’s why I like test-time alignment approaches that are modular. LLMdoctor has a clean patient-doctor framing: steer a frozen model with a smaller controller trained on token-level preference signals, via token-level flow-guided preference optimization.
Even if you ignore the specific algorithm, the pattern matters:
You can steer without retraining the foundation.
You can make reliability interventions fast and reversible.
You can version and evaluate the controller like a product.
Why we should care:
Most teams treat reliability fixes as either “change the prompt” or “fine-tune and pray.”
Controller-style steering gives a third option: a scoped, testable intervention layer.
Human-in-the-loop rubrics: reliability is often a “shared standard” problem
The hardest part of agent reliability isn’t always the model.
Sometimes it’s the absence of a shared, auditable definition of “correct.”
A recent paper on patch evaluation proposes a simple but scalable framework:
use an LLM to draft a task-specific rubric,
have a human review/refine it once,
use the rubric-guided LLM judge to evaluate many candidates.
They report improved agreement with human consensus (e.g., Cohen’s kappa 0.75 on the subset with unanimous human agreement), plus high recall/precision in that setting.
Even though the domain is program repair, the reliability lesson generalizes:
When humans disagree, it’s often because the rubric is implicit. Make it explicit once - then scale it.
A narrow reliability loop I’d actually run
If I had to condense the above into a practical loop (without turning it into a platform pitch), it would look like this:
Define correctness in end-states, not text: Use metamorphic relations / end-state equivalence where possible.
Stress-test, don’t just benchmark: Measure a reliability surface across repeated runs (k), perturbations (ε), and tool failures (λ).
Monitor online with calibrated decision rules: Turn verifier scores into stop / continue / escalate decisions you can defend.
Keep humans as reviewers of standards, not full-time graders: Use human time to approve/refine rubrics and resolve disagreements.
Treat steering as a first-class intervention: Controller models (doctor -> patient) are a pragmatic way to improve behavior without turning every fix into a full retrain.
Agent reliability is not a single feature and not a single metric. It’s a contract:
measured under stress
monitored online
improved with small, controlled interventions
audited through shared standards humans can actually read.
The best recent work is finally treating reliability like an object we can engineer - not a hope we can prompt.
Quick credit: Andy Wong consistently finds great new papers early, and we end up debating the implications together before they show up in writing.



Insightful. The point about reliability being a surface and humans still being the layer makes perfect sense for today, but what's the next step to automate that 'human' part safely?