Control knobs from recent LLM papers
January 2026 Update
The field is moving extremely fast right now. The half-life of a new idea is measured in weeks (sometimes days), and it is getting harder to tell what will actually stick.
Most weeks, the discourse around language models collapses into one of two modes:
“Look at this new leaderboard bump.”
“Agents are coming, everything changes.”
Both can be true - and still miss the point.
The stuff that actually moves production outcomes tends to look like new control knobs: better ways to steer models, keep them stable, make them faster, and make failures more legible.
Credit where it is due: a bunch of these paper finds came from Andy Wong, who consistently surfaces great work. We usually end up debating the implications together before it shows up here.
So here is my January reading stack: papers that feel unusually primitive-shaped. Each one adds a knob I expect we will keep using.
1. Recursive Language Models: “infinite” context via self-calls (no retraining)
Recursive Language Models (RLMs) are a different answer to long context: do not cram the prompt into the transformer - treat it like part of the environment.
One concrete instantiation: the prompt becomes a variable inside a Python REPL, and the model writes code to inspect the prompt, decompose it, and recursively call sub-instances of itself over slices of the prompt.
They report handling inputs two orders of magnitude beyond typical context windows - and even mention strong performance at the 10M+ token scale.
Why we should care:
This pushes long context from architecture into systems. Not ‘train longer,’ but ‘reason out-of-core.’
It is an agent-shaped pattern: when the model can write the loop it thinks inside, you get a new class of tool-use + decomposition behaviors.
It reframes the bottleneck: the limit becomes less context length and more how good the model is at building the right indexing + recursion strategy.
2. LLMdoctor: alignment at test-time, token by token
Most alignment work still assumes you either fine-tune the whole model (expensive, slow, often brittle), or you do test-time tricks that are coarse (trajectory-level) and compute-hungry.
LLMdoctor proposes a clean separation: keep a big ‘patient’ model frozen, and steer it with a smaller ‘doctor’ model using token-level signals.
Their claim is that many test-time alignment methods rely on distorted trajectory-level rewards or inefficient sampling that caps performance and harms diversity. The patient-doctor setup extracts token-level preference signals from the patient’s behavioral variations, then trains the doctor via token-level flow-guided preference optimization (TFPO) to preserve diversity while aligning outputs.
Why we should care:
Steering becomes modular. Iterate on the doctor without re-baking the patient.
Granularity matters. Token-level intervention is the difference between ‘mostly aligned’ and ‘aligned where it counts.’
Closer to how agents fail. Agents do not fail at the end of the trajectory - they fail mid-trajectory.
3. Entropy-Adaptive Fine-Tuning: a practical take on “don’t forget”
Supervised fine-tuning is still the workhorse for specialization - and “catastrophic” (or I would rather say “annoying”) forgetting is still the bill we pay.
This paper’s framing is crisp: it contrasts SFT with on-policy RL and argues the gap comes from distribution mismatch. In RL, the model’s learning signal is more consistent with its internal beliefs; in SFT, the model is forced to fit external supervision even when that conflicts sharply with what it ‘knows.’
They focus on confident conflicts: cases where the label token is low-probability under the model, while the model’s distribution is low entropy (i.e., it is confidently predicting something else). That is where gradients get destructive.
Their proposal, Entropy-Adaptive Fine-Tuning (EAFT), uses token-level entropy as a gating mechanism: learn aggressively when the model is uncertain; suppress gradients when the model is confident-but-disagreeing.
From my most recent post: I also think EAFT is a genuinely useful alternative to LoRA in the ‘don’t wreck the base model’ sense - rather than constraining where we update (parameter-efficient adapters), EAFT constrains when updates should matter (skip the destructive ones).
Why we should care:
This is the kind of idea that turns continuous updates from scary to feasible.
It maps to a real production vibe: most of the time we want to learn; sometimes we want to refuse the lesson.
It is a different safety knob than LoRA - but it is targeting the same anxiety: regressions.
4. From Entropy to Epiplexity: measuring “useful information” for bounded learners
Data quality is still the hidden kingmaker. The hard part is: we are not data-rich, we are signal-poor.
This paper asks a deceptively simple question: can we quantify learnable content in data without tying it to a downstream task?
They argue classic information measures (Shannon entropy, Kolmogorov complexity) do not capture what matters for computationally bounded learners, and they propose a new measure: epiplexity.
The vibe: epiplexity is meant to capture structural content while excluding ‘time-bounded entropy’ (random/unpredictable content), and the authors claim it helps explain why deterministic transformations and data ordering can still create useful learnable structure in practice.
Why we should care:
If inputs are becoming the product, we eventually want a metric for the informational value of inputs.
Epiplexity feels like a step toward data selection as an engineering discipline, not an art project.
5. LLaDA2.0: diffusion language models to 100B
Autoregressive decoding is powerful, but it is fundamentally serial.
LLaDA2.0 pushes discrete diffusion language models to 100B parameters via a conversion process: take a pretrained AR model and convert it to a dLLM using a 3-phase block-level training scheme (warm-up with increasing block size, stable full-sequence diffusion, decay back to compact block diffusion).
They also discuss post-training alignment with SFT and DPO, framing this as a path to frontier-scale efficiency while preserving parallel decoding advantages.
Why we should care:
Parallel decoding is not just a speed story. It changes how we can spend compute at inference time.
Faster sampling = more room for verification, search, and self-checking within real latency budgets.
6. PoPE: decoupling the “what” and “where” in positional embeddings
I have a soft spot for papers that say: ‘this popular thing is entangled in a way that quietly hurts you,’ and then fix it cleanly.
PoPE (Polar Coordinate Positional Embeddings) argues RoPE entangles content (what) and position (where), which can impair tasks requiring independent matching on the two. They propose PoPE to remove the confound, show better performance on diagnostics and across sequence modeling domains, and highlight strong zero-shot length extrapolation vs RoPE - and even vs YaRN.
Why we should care:
Long context is now table stakes for serious agent workflows.
Works at 8k is not the same as behaves at 80k.
DeepSeek + stuff that ships
Engram: conditional memory as a second sparsity axis
The Whale does it again. MoE gave us conditional computation. But knowledge lookup is still mostly simulated via dense compute.
Engram proposes conditional memory - a complementary axis of sparsity - implemented via an O(1) lookup module modernizing classic N-gram embeddings.
They describe a ‘Sparsity Allocation’ tradeoff between neural computation (MoE) and static memory (Engram), claim a U-shaped scaling law, and report scaling Engram to 27B parameters with gains not just on knowledge tasks but also reasoning, code/math, and long-context retrieval.
mHC: Manifold-Constrained Hyper-Connections
This zooms in on a real training pathology: expanding residual streams/connectivity can improve performance, but it can also break the identity mapping property residual connections rely on - leading to instability and scalability issues.
mHC proposes projecting the residual connection space onto a manifold to restore identity mapping while keeping things efficient.
DeepSeek-R1: RL-first reasoning + the GRPO refresher
Even if you are excited about the next base model drop, R1’s training recipe is the more durable lesson.
The core claim: reasoning behaviors can emerge via pure RL (with a cold-start SFT phase for readability/stability), and they lean on GRPO - which is worth revisiting if your PPO mental model is rusty.
Quick intuition: GRPO drops the critic and estimates a baseline from grouped samples, which matters a lot for scaling RL in LLM land.
Solving LLM repetition in production
This one earns points for being unapologetically real: repetition loops that stall batch tasks.
They identify repetition patterns, frame the root cause via Markov analysis + greedy decoding getting stuck in loops, and evaluate mitigations: beam search with early_stopping=True (universal post-hoc), presence_penalty (case-specific), and DPO fine-tuning (model-level universal).
The connective tissue
If I had to summarize the direction across these papers in one line: we are entering the era of systems that add knobs: inference-time recursion for extreme context, token-level steering, entropy-gated learning, explicit memory, better information measures, disentangled position representations, and faster decoding.
The fun part: these knobs compound.


