Reliability knobs for agents

Spring 2026 update

Mar 30, 2026

The agent conversation is as noisy as it has ever been:

One camp says base models are now good enough, just give them tools.
Another (the moi is mostly in. it) says evaluation is the bottleneck, just build better judges.

While I would assign greater weight to the second camp, they are both directionally right and still too coarse.

The durable gains are showing up somewhere less glamorous and much more useful: new reliability knobs. Not one magical agent architecture, but a stack of control surfaces that help systems lose less intent, preserve more capability, remember more of the right state, and fail in ways teams can actually inspect and correct.

Several papers of note along these problems:

1. PPS: make intent explicit before the model starts guessing

Natural-language prompts have a hidden failure mode: intent transmission loss. The user knows what they mean; the model only sees the compressed, underspecified surface form. The PPS paper attacks that directly. Across 60 tasks, 3 domains, 3 models, and 540 generations, natural-language-rendered PPS outperformed both simple prompts and raw JSON on goal alignment. The gains were strongest in ambiguous business tasks, weaker in technical tasks, and actually reversed in low-ambiguity travel planning. That last part matters. It means structured prompting is not magic; it is most valuable when the user’s objective is fuzzy enough to be misread. The preliminary survey result is also practical: fewer follow-up rounds, from 3.33 to 1.13 on average.

Why we should care: a lot of agent failure is downstream of an upstream ambiguity. If the goal, audience, constraints, tone, and success criteria are hazy at turn one, the rest of the trajectory is just confident error propagation. The other useful lesson here is that raw structure is not enough. In the study, rendered PPS beat raw JSON. So the knob is not “add schema everywhere.” It is “make intent explicit in a form the model can actually use.”

Paper: https://arxiv.org/abs/2603.18976

2. Prompt repetition: a surprisingly cheap robustness hack

This one sounds silly until you think about what causal attention is doing.

The core idea is simple: when reasoning is not enabled, repeat the prompt. The paper shows that prompt repetition improved accuracy across popular models without increasing output length or materially increasing latency in most settings. On their experiments, it won 47 of 70 benchmark-model combinations with 0 losses when reasoning was disabled. The gains were especially strong when the prompt order was hostile to the model, such as options-first multiple choice, and on custom tasks like NameIndex and MiddleMatch.

Why we should care: not every model call inside an agent should be a long reasoning trace. Some calls are cheap subroutines: route this, extract that, normalize this, draft tool arguments, re-check the user constraint. For those non-reasoning calls, prompt repetition looks like a genuinely useful default knob. It is not a substitute for reasoning, and it is not free on arbitrarily long prompts, but for short operational subcalls it is exactly the kind of low-cost robustness trick teams tend to underrate.

Paper: https://ar5iv.labs.arxiv.org/html/2512.14982v1

3. GLM-5: do not buy agentic RL by deleting earlier skills

Sequential post-training has a nasty habit: each new stage can quietly sand down the thing the previous stage got good at.

GLM-5 is interesting partly because it says that out loud. Their pipeline runs sequential RL stages for reasoning, then agentic behavior, then general helpfulness, and then uses on-policy cross-stage distillation as a final refinement to recover skills from earlier stages. Previous stage checkpoints become teachers; the final pass is meant to stop the classic “more agentic, less sharp” tradeoff from becoming acceptable collateral damage. On their reported benchmarks, GLM-5 posts about a 20% average gain over GLM-4.7 across agentic, reasoning, and coding tasks, including 77.8 on SWE-bench Verified.

The more durable lesson is even lower-level than that. The paper is refreshingly explicit that agentic RL stability lives in systems details as much as in objectives. They switched to a deterministic top-k operator because nondeterministic sparse-attention selection caused sharp RL degradation, froze the indexer during RL for stability, and emphasized token-in-token-out handling so the trainer learns on exactly the same token stream produced by the rollout engine. That is the kind of detail that separates “agent demo” from “agent training system.”

Why we should care: a lot of agent improvement work still acts as if new capabilities can simply be stacked. In practice, they interfere. If post-training adds planning but dulls reasoning, or adds autonomy but destabilizes the learning loop, you have not really improved the agent. You just moved the failure somewhere harder to notice.

Paper: https://ar5iv.labs.arxiv.org/html/2602.15763v2

4. FullStack-Agent: round-trip the artifact, then test the hidden surfaces

I like this one because it attacks a very real agent failure mode: the frontend looks right, the demo works, and the backend is still fake.

FullStack-Agent combines three ideas. First, a multi-agent development workflow with specialized debugging tools for frontend and backend work. Second, FullStack-Bench, which evaluates not just frontend behavior but backend APIs and database state as well. Third, Repository Back-Translation, which converts existing real-world repositories into agent trajectories the model can learn from. The benchmark itself is notable: 101 instructions, 647 frontend tests, 604 backend tests, and 389 database tests. Even better, frontend success is not counted unless the required database interaction is real. That is exactly the kind of hidden-surface check agent evaluation needs more of.

The results are strong, but the more interesting pattern is the training and evaluation shape. FullStack-Dev with a Qwen backbone reached 64.7 frontend, 77.8 backend, and 77.9 database accuracy, while FullStack-Learn improved a 30B model through self-improvement using repository back-translation and augmentation. The debugging tools also mattered a lot: removing the backend debugging tool increased average backend iterations from 74.9 to 115.5. That is not just a model story. It is a workflow design story.

Why we should care: reliable coding agents need falsifiable artifacts. A useful practical extension of this idea is round-tripping: code to spec, spec back to code, compare the two, and inspect the mismatch. That creates a verifier surface instead of treating the codebase as one opaque blob. More broadly, the paper is a reminder that real artifacts and real tests are better teachers than synthetic vibes.

Paper: https://arxiv.org/html/2602.03798v1

5. MSA: memory should be part of the model, not a retrieval afterthought

Agents with long histories do not just need more context. They need memory that stays usable when the history becomes absurd.

MSA pushes that idea hard. The paper proposes an end-to-end trainable memory framework with sparse attention, document-wise RoPE, KV compression, and a Memory Parallel inference path. The headline is the kind of number people usually ignore until it becomes operationally relevant: less than 9% degradation while scaling from 16K to 100M tokens, with 100M-token inference on 2xA800 GPUs. The other important piece is Memory Interleave, which alternates retrieval, context expansion, and generation so the model can reason across scattered memory segments instead of just pulling one flat chunk and hoping.

Why we should care: a lot of current agent memory stacks are really retrieval pipelines wearing a memory costume. That works until the task needs long-range consistency, multi-hop evidence integration, or stable persona/state over time. MSA is interesting because it tries to make memory intrinsic and differentiable rather than bolted on. The real caveat is operational: the current setup still relies on offline pre-encoding of the corpus. So it is not a universal replacement for dynamic knowledge systems yet. But as a direction, it is much closer to agent memory than “just add bigger RAG.”

Paper: https://arxiv.org/abs/2603.23516

6. Verifier–compiler loops: verification is becoming its own stack

This is the one I keep coming back to.

The core production fact is ugly and simple: long workflows multiply small defects. In the verifier–compiler loop framing, a 1% failure rate across 100 steps leaves only about 36.6% end-to-end success. Even 0.1% per-step failure still leaves only about 90.5%. That is the march-of-nines problem. The implication is that agent reliability is not mainly a prompt problem. It is an error-correction problem. The system needs to observe the episode, judge it against institutional standards, intervene conservatively, replay changes before release, and keep durable evidence of what changed and why. That is also why the distinction between execution knowledge and institutional judgment matters: the agent can know the facts and still fail the organization.

Recent judge work mostly points in the same direction. JudgeBench shows hard evaluator tasks are genuinely hard, with strong models like GPT-4o only slightly above random on some challenging judge settings. RewardBench 2 makes reward evaluation meaningfully harder than RewardBench 1 and emphasizes correlation with downstream use. DeepSeek’s GRM/SPCT line is also important because it argues that reward modeling itself can scale with more inference compute through principle generation, critique, and voting, not just with bigger training runs.

But the field is also getting more honest about calibration. Evaluative Fingerprints found near-zero inter-judge agreement while also showing that judges are individually stable enough to be fingerprinted from their rubric behavior. In other words: they are not random, they are systematically different. Separate work on LLM-as-a-judge reporting shows that evaluator bias and uncertainty should be corrected statistically, not hand-waved. On user simulation, the news is similarly mixed: SimulatorArena suggests profile-conditioned simulators can track human judgments reasonably well on some tasks, but Lost in Simulation shows simulator choice can move measured success rates by up to 9 points and systematically miscalibrate difficulty.

Why we should care: one judge score is not a control system. High-reliability agents are going to need judge stacks, not judge monocultures: crisp gates for obvious defects, stronger reasoning judges for nuance, replay before release, disagreement review for hard cases, and humans on the highest-risk boundaries. Simulation will help widen coverage, but only if it is continuously calibrated against real traces.

Blog: https://www.equationblog.com/p/the-verifiercompiler-loop-turning

7. IndexCache: systems work is reliability work too

This one is more infrastructure than alignment, but it belongs in the same conversation.

IndexCache starts from a simple observation: in sparse attention, adjacent layers often choose very similar top-k token sets. So instead of recomputing the indexer at every layer, reuse it across layers. On the reported results, that removes up to 75% of indexer computation with negligible quality loss, while reaching 1.82x prefill speedup and 1.48x decode speedup at 200K context. The paper also reports 70–100% top-k overlap across adjacent layers, which is the structural reason the trick works.

Why we should care: efficiency is not separate from reliability. Every unit of inference cost you remove from the serving path can be reinvested into something reliability-shaped: longer context, more retrieval, more search, more verifier passes, more replay budget, or simply lower latency at the same control quality. That is why inference-side engineering keeps mattering more than people think.

Paper: https://arxiv.org/html/2603.12201v1

The connective tissue

If I had to compress the direction into one line, it is this: reliable agents are becoming layered control systems.

Structured intent reduces loss before the trajectory begins. Prompt repetition stabilizes cheap non-reasoning subcalls. Post-training methods like cross-stage distillation try to make new capabilities additive instead of destructive. Artifact-grounded training and hidden-surface testing make agent outputs more falsifiable. Long-memory work tries to decouple memory capacity from reasoning quality. Judge research is forcing evaluation to become calibrated, replayable, and auditable. Systems work buys the budget to do more of all of it in real time.

Just a growing stack of knobs that make agent behavior narrower, more inspectable, and a little less mysterious week over week.

Discussion about this post

Ready for more?