The Equation

Who Owns Engineering Judgment?

Ruslan Belkin — Sat, 20 Jun 2026 14:15:34 GMT

Price per token is not a cost model. It is an input to a cost model.

The AI engineering conversation is still too focused on the wrong budget.

Tokens, model pricing, and vendor leverage all matter. But the real budget is not tokens.

The real budget is senior human attention: the moments when a model output needs correction, interpretation, architectural judgment, security review, release approval, customer-risk ownership, or institutional context.

That is where the system breaks.

There is an even deeper question underneath it: As models get better, who owns engineering judgment?

One answer is: the model vendors.

That answer is less crazy than people want to admit.

Coding agents are moving beyond autocomplete into planning, design, implementation, testing, review, deployment, and operations. Full software development lifecycle is increasingly in scope for AI assistance, including planning, design, development, testing, review, and deployment. Agents can draft implementations, modify code across many files, fix build errors, write tests, and produce diff-ready change sets.

So yes, more workflow logic will move closer to the model.

That includes planning, review, risk detection, test selection, escalation, human handoff, evidence production, and learned standards of “good engineering.”

The question is not whether this happens. It already has. The question is what companies should still own?

The uncomfortable fork

There are two lazy versions of the future.

The first says:

Model vendors will own everything.

They have the data. They have the compute. They have the distribution. They have the feedback loop. They will absorb all relevant engineering intelligence, including when to involve humans. Your internal workflows will become a thin wrapper around their agents.

The second says:

Models are just utilities.

They are like electricity or cloud compute. Useful, interchangeable, commoditized. The real value stays entirely in business context, process, data, and execution.

Both are partially true.

Both are incomplete.

The more interesting version is this:

Model vendors will own more generic intelligence than most companies expect. Great companies will still own situated intelligence: context, authority, verification, accountability, and the compounding memory of their own decisions.

That is the distinction I care about.

Not “AI versus human.”

Not “frontier model versus cheap model.”

Not “central platform team versus product teams.”

The real boundary is:

Let the model own heuristics. Keep institutional authority outside the model.

The price card has become strategically relevant again

For a while, it was fashionable to say model cost would round to zero.

Maybe someday. Not yet.

As of June 15, 2026, public token prices still vary by orders of magnitude across model families, context tiers, output volume, caching, priority mode, batch mode, and whether you are using a seat plan, credit system, or direct API.

The point is not to memorize today’s prices. They will change. The point is that model choice is now an operating decision.

OpenAI’s public API page currently lists GPT-5.5 at $5.00 per million input tokens, $0.50 per million cached input tokens, and $30.00 per million output tokens under standard processing for shorter context; it lists GPT-5.4 at $2.50 / $0.25 / $15.00 and GPT-5.4 mini at $0.75 / $0.075 / $4.50 for input, cached input, and output respectively.

Anthropic’s Claude API pricing page currently lists Claude Fable 5 at $10 input / $1 cache hit / $50 output per million tokens, Claude Opus 4.8 at $5 / $0.50 / $25, Claude Sonnet 4.6 at $3 / $0.30 / $15, and Claude Haiku 4.5 at $1 / $0.10 / $5. The same page says prompt-cache reads cost 0.1x base input price and batch processing applies a 50% discount to both input and output tokens.

Google’s Gemini API pricing currently lists Gemini 3.1 Pro Preview at $2 input / $12 output per million tokens for prompts up to 200K tokens, and $4 / $18 above 200K. It also lists Gemini 3.5 Flash at $1.50 / $9, Gemini 3.1 Flash-Lite at $0.25 / $1.50, and Gemini 2.5 Flash-Lite at $0.10 / $0.40, with output pricing including thinking tokens.

Public token prices still vary by more than 100x across model tiers; model selection is now an operating discipline.

The first data point executives should internalize: the same million-token workflow can cost cents, dollars, tens of dollars, or more depending on model, context, output, caching, priority, and tool mode.

The second data point matters more: the token bill is often not the expensive part.

The expensive part is the human recovery loop.

Price per token is the wrong denominator

The wrong denominator is: cost per token

The better denominator is: cost per accepted, reviewed, shipped change

A cheaper model that takes three attempts, creates drift, burns an hour of senior review, and leaves uncertainty around a risky migration may be more expensive than a frontier model that completes the work cleanly in one pass.

The token bill is visible.

The senior-engineer recovery cost is usually hidden.

The bad loop looks like this:

Human gives a loose prompt.
Agent starts coding.
Human watches nervously.
Agent drifts.
Human interrupts.
Agent patches around the correction.
Tests fail.
Human re-explains the system.
Final diff is stitched together from partial attempts.

That is not autonomy.

That is interrupt-driven delegation.

The better loop looks like this:

Human approves intent, constraints, risks, and success criteria.
Agent produces a plan before implementation.
Human gates the plan.
Agent implements against the approved plan.
A separate review agent attacks the diff.
Output is packaged as evidence.
Human reviews the evidence, resolves judgment calls, and owns merge/release readiness.
Useful discoveries become future context, tests, runbooks, evals, or skills.

The practical target is not “more AI usage.”

The practical target is:

fewer human corrections per accepted diff
fewer repeated attempts
shorter review time
lower escaped-defect rate
higher useful concurrency per senior engineer
more institutional knowledge captured per completed task

That is the engineering budget.

A simple cost model

Here is the model I would use for budgeting:

monthly model cost ≈ fresh input tokens + cached input tokens + output tokens + tool/runtime costs

More specifically:

Fresh input tokens: repo context, prompts, docs, issues, logs, traces.
Cached input tokens: repeated context, stable repo instructions, runbooks, API docs, prior plans.
Output tokens: code, tests, plans, reviews, explanations, retries.
Tool/runtime costs: web search, containers, managed agent runtime, code execution, priority/fast mode, storage, observability.

The simple version:

monthly cost ≈ fresh input MTok × input price + cached input MTok × cache-read price + output MTok × output price + tools

The scary part is output and retries.

The saving grace is caching.

Caching turns stable company context into a cost advantage. Anthropic’s pricing page says prompt-cache reads cost 0.1x base input price and that cache reads pay off quickly after reuse; OpenAI’s pricing page also shows cached-input rates materially below fresh-input rates for its flagship models.

So the first-order governance question is not: How do we cap tokens?

It is: How do we make the same context useful repeatedly without reloading the company from scratch every time?

Planning ranges I would use

Do not treat the following as a benchmark.

Treat it as a planning model.

Also: do not plan any serious AI use case below $100 per user per month.

That does not mean every user will spend $100. It means that below that floor, you are usually not planning for real usage, variance, retries, training, tool overhead, or bursts.

For engineering, I would not plan below $200 per engineer per month.

Again, not every engineer will spend it every month. But if the budget starts below $200, the organization is implicitly designing for toy usage, not production behavior.

OpenAI’s Codex rate card says a typical Codex task using GPT-5.5 may consume 5–45 credits, and that average Codex cost is roughly $100–$200 per developer per month, with large variance depending on model, instances, automations, and fast mode. That is a useful public anchor, but I would still set the engineering planning floor at $200 because real production engineering needs room for variance.

Planning envelopes are not entitlements or hard caps; govern through outcomes, not token volume.

Baseline AI access

Expected range: $100–$250 per user per month

Use for:

general productivity
research
writing
summarization
lightweight analysis
occasional technical help

This is the minimum serious floor. Below this, the budget usually becomes administrative theater.

Light engineering assistant

Expected range: $200–$750 per engineer per month

Use for:

code explanations
small scripts
autocomplete
simple unit tests
light PR summaries
documentation
low-risk helper workflows

This is where engineering access should start. A lower floor encourages underuse, rationing, or avoidance.

Routine agentic product engineer

Expected range: $500–$1,500 per engineer per month

Use for:

daily code generation
test generation
codebase exploration
small refactors
PR summaries
routine debugging
first-pass reviews

This is the zone where agents become part of the daily engineering loop, but not yet the primary implementation substrate.

AI-heavy product engineer

Expected range: $1,500–$5,000 per engineer per month

Use for:

long-running implementation loops
multi-file refactors
design-to-plan-to-code workflows
substantial test generation
repeated review cycles
larger repo context
high cached-context reuse

This is plausible when daily work moves from “assistant” to “delegated workflow.” It should be governed by accepted diffs, review load, and defect outcomes, not by token volume alone.

Senior, staff, or tech lead supervising parallel streams

Expected range: $3,000–$10,000 per user per month

Use for:

architecture exploration
cross-service refactors
high-risk PR review
migration planning
release readiness
adversarial review
parallel agent supervision
incident/post-incident investigation

This is the category where hard caps are most dangerous.

These people amplify many other people and many other agents. Blocking them to save tokens can waste the scarcest resource in the company: senior judgment.

AI platform, release, security, or agent-heavy team pool

Expected range: $10,000–$50,000+ per team per month

Use for:

private evals
routing experiments
skill registries
repo-context services
model comparisons
automated review infrastructure
autonomous regression hunts
release and incident automation
enterprise observability and governance

This should usually be a pooled budget, not a per-seat cap.

Platform, incident, release, and evaluation work are all bursty.

The rule should be: no blank checks, but no dumb throttling during approved high-leverage windows.

The danger of false economy

A more capable model can be cheaper even when its tokens are more expensive.

This feels counterintuitive only if you measure the wrong thing.

Assume a cheaper model takes three attempts, creates uncertainty, and consumes 75 minutes of senior review.

Assume a stronger model costs more in tokens but finishes cleanly and consumes 20 minutes of review.

At a fully loaded senior-engineering cost of even $150–$300 per hour, the second path can easily be cheaper.

Not because the tokens are cheaper.

Because the loop is cheaper.

Human recovery time can dominate model spend; optimize the complete loop, not just token price.

Use cheaper/default models for bounded work:

formatting
summarization
low-risk docs
simple scaffolding
rote transformations
test naming

Use stronger models when the value is uncertainty reduction:

ambiguous architecture
cross-service changes
security-sensitive paths
data migrations
release readiness
incident response
adversarial review
poorly documented internal systems

The mistake is not spending too much on frontier models.

The mistake is spending frontier tokens where cheap models are enough, then being cheap where a better model would save senior attention, reduce rework, or avoid defects.

The question is not: What is the cheapest model that can attempt this?

The better question is: What is the cheapest complete loop that can get this safely accepted?

AI amplifies the engineering system you already have

DORA’s 2025 research found near-universal AI adoption among software professionals: 90% of respondents reported using AI at work, and more than 80% believed it increased productivity. The same Google Cloud summary notes that 30% reported little or no trust in AI-generated code, which is exactly the tension engineering leaders need to design around.

That matches what technical leaders see in practice.

AI does not fix the engineering system.

It amplifies it.

If your specs are crisp, tests are fast, ownership is clean, docs are usable, review norms are strong, and release paths are well instrumented, agents multiply the system.

If your specs are ambiguous, tests are flaky, docs are stale, ownership is fuzzy, and senior review is overloaded, agents multiply that too.

More generated code without more validation capacity is not productivity.

It is inventory.

The model should not become the control plane

There is a subtle danger as agents get more competent.

The workflow starts cleanly:

model proposes a plan
model writes code
model runs tests
model reviews the diff
model summarizes risk
model recommends human approval

Then the next step is tempting:

model decides whether human approval is needed

Then:

model decides which human
model decides whether evidence is sufficient
model decides whether a policy applies
model decides whether release is safe

At that point, the model is no longer just a cognitive engine.

It is becoming an authority layer.

This is where I would be conservative.

OpenAI’s Model Spec is a useful public example of how much behavior-level logic is moving into models: instruction following, conflict resolution, intended defaults, safety boundaries, and agentic side-effect control. OpenAI describes the Model Spec as a formal framework for desired model behavior and a target for training and evaluation, while also saying it is not a claim that models behave that way perfectly today.

That work is necessary.

It also proves the point: model behavior is itself a control surface.

If your engineering workflow depends on hidden model defaults, provider-side routing, or behavior updates you cannot inspect, you do not really own the control plane.

You are renting it.

The answer is not to distrust model vendors by default.

The answer is to separate intelligence from authority.

The model can advise.

The system must decide.

The model can classify.

The policy layer must enforce.

The model can produce evidence.

The evidence standard must be external.

The model can remember patterns.

The company must own durable memory.

Heuristics inside the model, invariants outside the model

This is the cleanest boundary I see today.

Where the model suggests heuristics, the control plane keeps authority, approvals, and auditability.

A heuristic is: this looks like a security-sensitive change

An invariant is: security-sensitive changes require named approval, passing checks, rollback notes, and an audit trail

A heuristic is: these tests are probably relevant

An invariant is: no merge without required checks

A heuristic is: this migration resembles prior incidents

An invariant is: customer-data migrations require staged rollout and post-release monitoring

A heuristic is: this evidence package looks complete

An invariant is: every agent-authored PR records the model, prompt/context version, tools used, tests run, unresolved risks, and human approver

A heuristic is: a human should review this

An invariant is: which human, under what SLA, with what responsibility

This matters because otherwise the model becomes the invisible control plane.

And that is the failure mode.

Not “AI writes bad code.”

That is annoying, but manageable.

The worse failure mode is decision laundering:

The model makes a judgment.
The human rubber-stamps it.
The audit trail says “human approved.”
Nobody can reconstruct where the real decision came from.

That is how organizations lose engineering memory while pretending to gain productivity.

Security makes this boundary non-optional

Prompt injection is not a side issue. It is a reminder that LLMs are not normal software components.

The UK National Cyber Security Centre argues that prompt injection is better treated not as normal code injection, but as exploitation of an “inherently confusable deputy.” OWASP’s 2025 Top 10 for LLM and generative AI applications lists prompt injection, sensitive information disclosure, supply-chain risk, data and model poisoning, improper output handling, excessive agency, system prompt leakage, vector and embedding weaknesses, misinformation, and unbounded consumption.

This has a direct implication for engineering agents: never make the model its own permission system.

A model with tool access is powerful.

It is also confusable.

The system needs hard constraints outside the model:

permission boundaries
sandboxing
approval gates
allowlists and denylists
policy checks
secret handling
destructive-operation stops
release gates
audit logs
reproducible evidence

If a model can create code, run code, read internal docs, open tickets, comment on PRs, call tools, inspect logs, and recommend releases, then the model is no longer “just chat.”

It is part of the engineering system.

Treat it that way.

Every non-trivial agent task should start with a plan

Coding agents should not jump from prompt to diff for non-trivial work.

The first artifact should be a plan.

A useful plan names:

affected files and services
assumptions
open questions
sequencing
expected tests
risk areas
migration concerns
rollback concerns
expected evidence before review

This is not bureaucracy.

This is leverage.

A senior engineer can review a plan much faster than they can reverse-engineer a bad branch.

Plan review is where humans catch the shape of the work before the model spends tokens and creates state.

Then implementation happens against the approved plan.

Then a separate review pass attacks the implementation.

Then the output is packaged as evidence.

Then the human reviews the evidence, not just the diff.

OpenAI’s AI-native engineering guide makes a similar operational point in its getting-started checklist: start with well-specified tasks, have the agent use a planning tool or write a PLAN.md, check that commands succeed, and iterate on an AGENTS.md file that unlocks loops like running tests and linters.

The goal is fewer interruptions and clearer human gates, not less human accountability.

The minimum evidence package should include:

plan
diff summary
tests run
failing tests
unresolved questions
risk assessment
security notes
performance notes
rollout notes
rollback notes
model and context version
tool actions taken

This is the practical version of “human at the gates.”

Without the evidence package, the reviewer has to reconstruct the story.

With it, the reviewer can spend judgment where judgment matters.

“Human in the loop” is too vague

The phrase “human in the loop” can mean anything.

It can mean thoughtful supervision.

It can also mean a senior engineer constantly babysitting a model, interrupting it, correcting it, rerunning tests, unwinding drift, and manually stitching together a final diff.

That is not autonomy.

That is interrupt-driven delegation.

The human should not be in the gears.

The human should be at the gates.

Before work starts

The human gate is scope:

product intent
constraints
blast radius
test strategy
migration risk
open questions

During work

The human gate is exception handling:

repeated failure
ambiguity
high-risk files
destructive operations
security-sensitive changes
missing evidence

Before merge

The human gate is judgment:

architecture
maintainability
security
performance
user impact
rollout plan
rollback path

After release

The human gate is responsibility:

customer impact
incident response
follow-up prioritization
what should become durable organizational memory

The model should do the mechanical reading, mapping, drafting, testing, summarizing, and first-pass review.

The human should decide what matters.

Routing is necessary, but auto-routing is not authority

No single model wins every task.

That makes routing inevitable.

LLMRouterBench, a 2026 routing benchmark, evaluates more than 400,000 instances across 21 datasets and 33 models. It confirms strong model complementarity, but also finds that many routing methods perform similarly under unified evaluation and that several recent methods do not reliably beat simple baselines.

That matches the production reality.

Auto mode is useful for convenience.

It is not an authority for high-risk engineering.

A better routing policy has explicit lanes.

Cheap/default lane

Use for:

docs
formatting
summarization
boilerplate
simple scaffolding
low-risk edits
test naming
mechanical transformations

Standard frontier lane

Use for:

ordinary production implementation
debugging
feature work
test generation
codebase exploration
moderate refactors

Top-tier escalation lane

Use for:

failed loops
repeated test failures
ambiguous architecture
security-sensitive changes
schema/data migrations
release-critical code
adversarial review

Human-gated lane

Use for:

destructive operations
production changes
secrets
permissions
customer-impacting rollout
incident response

The labels matter less than the telemetry.

Track:

wrong-route corrections
repeated attempts
escalation rate
human intervention count
model switches
review latency
escaped defects
cost per accepted diff

That is how routing becomes engineering instead of vibes.

Proprietary context is both tax and asset

Every company has internal abstractions.

Some are strategic.

Many are accidental.

Models have strong priors for common public stacks. They have weaker priors for your internal framework, your release ritual, your service boundaries, your custom metadata model, your old migration convention, and the bug everyone “just knows” to avoid.

That creates a context tax.

The model needs more examples, more docs, more traces, more corrective feedback, and more human steering to do the same job.

The wrong answer is to dump everything into context.

The right answer is to separate proprietary surface into two buckets.

First: proprietary surface that is not differentiated

Reduce it.

Standardize it.

Wrap it.

Replace it.

Make it boring to the model.

Second: proprietary surface that is differentiated

Invest in it.

Make it machine-readable.

Create canonical examples.

Build runbooks.

Version skills.

Capture golden traces.

Add private evals.

Make the hidden system legible.

Anthropic’s Agent Skills documentation is a useful public example of this pattern: skills package instructions, metadata, workflows, and optional resources so Claude can load relevant domain-specific expertise on demand instead of repeatedly consuming the same context. The same documentation emphasizes progressive disclosure: metadata is always available, instructions load when triggered, and deeper resources are accessed only as needed.

Skills should not be treated as prompt snippets.

They are a mechanism for turning repeated human corrections into reusable workflow capital.

But skills can also become prompt debt.

So they need:

owners
versioning
triggering rules
evaluation
conflict detection
deprecation
telemetry

A good bug investigation should not end with a patch.

It should produce at least one durable artifact:

regression test
eval case
runbook
skill update
prompt/context rule
better dashboard
deleted abstraction
clearer API
improved release checklist

That is the compounding loop:

workflow → trace → evaluation → correction → skill → better workflow

If that loop belongs to the vendor, the company rents intelligence.

If that loop belongs to the company, the company compounds judgment.

Multi-agent work needs an operating system

The future is not one engineer chatting with one assistant.

It is many agents running in parallel:

one maps the codebase
one drafts a plan
one implements
one writes tests
one attacks the diff
one checks security
one prepares release notes
one watches post-release telemetry
one turns discoveries into reusable knowledge

Codex can work on tasks in the background, including in parallel, using its own cloud environment; it can also connect to GitHub repositories and create pull requests from its work.

That sounds powerful.

It is also how you create an expensive mess if the human becomes the scheduler.

Past a small number of concurrent streams, the bottleneck is not agent availability.

The bottleneck is state.

Which branch is blocked?
Which tests failed?
Which agent repeated work?
Which diff conflicts with another diff?
Which risk needs a human decision?
Which output is ready for serious review?
Which run should be stopped?
Which context is stale?
Which decision has already been made?

Multi-agent engineering needs a command center.

Not necessarily a fancy product.

But at least a shared state view:

task
branch
owner
agent role
model
prompt/context version
current plan
tests run
failures
blockers
risk level
cost so far
next required human decision

Without this, the senior engineer stops being an architect and becomes an interrupt router.

That is a bad trade.

The org chart changes, but not in the lazy way

The lazy story is junior replacement.

The better story is supervision capacity.

As code generation gets cheaper, verification gets more valuable.

The shape of engineering work changes.

Staff and principal engineers spend more time on:

architecture
decomposition
eval design
model steering
review standards
high-risk release decisions

Senior engineers spend more time on:

running agent pods
supervising parallel streams
owning service quality
converting ambiguous work into reviewable plans

Engineering managers and tech leads spend more time on:

review queues
bottleneck removal
ownership clarity
sequencing
cross-team coordination
making sure agent output does not overwhelm the human system

Junior and mid-level engineers should not be pushed out of the pipeline.

Their apprenticeship path changes toward:

verification
debugging
tests
observability
code reading
small scoped changes
agent operation

AI platform engineers become more important:

routing
evals
skill registries
context stores
telemetry
secure permissions
sandboxing
cost controls

SRE, QA, release, and security capacity also matter more, not less.

More generated code without more validation capacity is not productivity.

It is inventory.

Governance should control the work system, not just total tokens

A token cap is easy to understand.

It is also usually the wrong control surface.

Better metrics:

cost per accepted diff
cost per reviewed and shipped change
human interventions per PR
repeated attempts per task
model switches per task
wrong-route corrections
review latency
escaped defects
cache hit rate
duplicated agent work
percentage of agent tasks with evidence packages
percentage of incidents that improve runbooks/evals/skills
percentage of high-risk PRs with named human approval
percentage of agent-authored PRs with model/context version recorded

Also track the hidden cost:

re-reviewed diffs
conflicting branches
redundant context loading
repeated codebase reads
repeated explanations of the same internal abstraction
agent runs stopped because ownership was unclear
human attention lost to interruptions

That is where the waste hides.

The governance objective is not to minimize tokens.

The governance objective is to spend enough tokens to reduce avoidable human touchpoints while preserving human judgment at the places where judgment changes quality.

A narrow operating loop I would actually run

Start boring.

1. Set floors before caps

Do not plan below:

$100/user/month for any serious AI use case
$200/engineer/month for engineering use cases

The floor does not mean entitlement.

It means you are taking the workflow seriously enough to support real usage.

2. Every non-trivial agent task starts with a plan

The plan includes:

affected files/services
assumptions
open questions
risk areas
sequencing
test strategy
rollback concerns

Humans approve plans, not vibes.

3. Every agent output includes an evidence package

No evidence package, no review.

Minimum evidence:

plan
diff summary
tests run
failing tests
unresolved risks
security/performance notes
rollout notes
rollback notes
model and prompt/context version

4. Use adversarial review from a different model family for high-stakes work

Same-model self-review is better than nothing.

Cross-model review is better for correlated blind spots.

5. Put hard stops on repeated failure

No infinite retries.

Stop on:

repeated test failure
high-risk file paths
destructive operations
migration changes
missing tests
unclear ownership
unresolved ambiguity

6. Build a skill registry

Reusable procedures should become skills, not Slack lore.

But skills need owners, versioning, evals, and deconfliction.

7. Build private evals

Public benchmarks are useful.

They are not your production system.

Measure models on your real task classes:

migrations
debugging
release readiness
internal abstractions
security review
test repair
incident summaries
codebase navigation
PR evidence quality

8. Review budgets by outcome

Do not ask only:

who spent the most?

Ask:

what shipped?
how much human review did it consume?
how many attempts did it take?
what defects escaped?
what reusable knowledge was created?
what should be routed differently next time?

That is the difference between buying AI tools and building an AI-native engineering system.

The stable equilibrium

The model vendors will own a lot.

They will own more generic engineering intelligence than most companies expect.
They will make the average code review better.
They will make the average migration plan better.
They will make the average test-generation loop better.
They will make the average junior task cheaper.
They will absorb a lot of workflow scaffolding that companies currently think is special.

Good.

We should want that.

There is no moat in forcing humans to remember what a model can reliably surface.

But great companies will still own the harder thing:

context
authority
verification
accountability
taste
strategy
release risk
customer commitments
institutional memory
the learning loop around their own decisions

That is the equilibrium I trust.

More intelligence inside the model.

More authority outside the model.

Use the model as a cognitive engine.

Do not use it as the court of record.

Use model vendors aggressively.

Do not let them become your invisible control plane.

The future engineering function is not organized around who can type code fastest.

It is organized around who can define invariants, operate agents, verify outputs, own release risk, and turn failures into durable knowledge.

The model is not the moat.

The moat is the learning loop you can inspect, evaluate, govern, and carry across models.

Data Is the Control Surface for LLMs and Agents

Ruslan Belkin — Tue, 09 Jun 2026 18:15:25 GMT

Data quality is not a scalar. Data quality is learnable structure under a compute, context, and objective budget.

Introduction: data is where intent becomes behavior

I have always been biased toward data.

Not because data work is glamorous. It is not. It is messy, repetitive, expensive, and full of judgment calls.

But in AI systems, data is where intent becomes behavior.

A model is not trained on “knowledge.” It is trained on sequences, transformations, mixtures, orderings, rubrics, rewards, traces, tool outputs, verifiers, and corrections.

An agent is not improved by “prompting” in the abstract. It is improved by deciding what information belongs in context, what should be retrieved, what should be compressed, what should be isolated, what should be verified, and what should be converted into future training or evaluation data.

The old pipeline was too small:

collect data → clean data → train model

The modern pipeline is closer to:

source → extract → clean → deduplicate → filter → transform → mix → train → evaluate → deploy → trace → verify → compile → replay

That is the data flywheel behind modern LLMs and agents.

Figure 1. Data as a staged control surface across training, evaluation, deployment, verification, and replay.

The important shift is this: data quality is not a scalar. It is learnable structure under a compute, context, and objective budget.

1. Epiplexity gives us the right language

The best recent framing I have seen is epiplexity.

Classical information theory gives us a problem. It says deterministic transformations should not create new information. It says information should not depend on data order. It treats likelihood modeling largely as distribution matching.

Modern LLM practice disagrees.

Synthetic data can help.
Curriculum order matters.
Reformatting a document into instructions can improve downstream behavior.
A verifier can turn a raw rollout into a stronger learning signal.
A long context can be useless noise or a high-value playbook.

Epiplexity formalizes this gap by focusing on what a computationally bounded observer can learn from data. The paper introduces epiplexity to capture structured, learnable content for bounded learners, and explicitly addresses why deterministic transformations, ordering, and likelihood modeling can matter in ways classical Shannon or Kolmogorov framings do not capture. [1]

That is the core thesis of this post: data preparation is not janitorial work. It is the engineering of learnable structure.

The question is not “how many tokens do we have?” The better question is:

What useful structure does this data expose to this model,
at this stage,
under this objective,
with this amount of compute, context, and verification?

That is true for pretraining, post-training, context engineering, reinforcement learning, and production agents.

2. Cleaning is not just removing bad text

Cleaning is often treated as a low-status preprocessing step. That is a mistake.

For web-scale training, cleaning is one of the largest capability levers. It determines what the model repeatedly sees, what gets amplified, what gets memorized, and whether the model learns coherent human language or boilerplate, menus, spam, broken markup, affiliate pages, keyword lists, and duplicated templates.

The best public example is FineWeb. Hugging Face built FineWeb as a 15T-token dataset from 96 Common Crawl snapshots, and the paper is valuable because it documents the cleaning pipeline and ablates the choices instead of presenting them as folklore. The final pipeline extracts from WARC rather than WET, applies base filtering, performs individual per-crawl MinHash deduplication, applies selected C4-style filters, adds custom heuristic filters, and anonymizes email and public IP addresses. [2]

The details matter. FineWeb found that WET extraction retained too much boilerplate and menu text. The team instead extracted text from WARC files using trafilatura, and the WARC/trafilatura ablation produced a better model than WET extraction. The base filtering stage then used URL blocklists for adult content, fastText language identification, and MassiveText-style quality and repetition filters, producing roughly 36T tokens after base filtering. [2]

Deduplication was more subtle. FineWeb used fuzzy MinHash deduplication with document 5-grams, 112 hash functions, 14 buckets of 8 hashes, and a target of at least 75% similarity. But global deduplication across all 96 snapshots did not work as expected: it reduced the corpus to 4T tokens and gave little improvement. Worse, for an older crawl, the 10% of data retained by global deduplication was visually lower quality than the 90% removed. The better choice was independent per-snapshot MinHash deduplication, which produced 20T tokens and matched RefinedWeb performance in the ablation. [2]

Deduplication is not automatically good. Deduplication granularity changes the data distribution.

If you deduplicate too aggressively across time, you may accidentally retain the wrong tail. You may remove useful recurring explanations while keeping strange, low-quality pages that only appear once. The goal is not to maximize deletion. The goal is to remove harmful repetition while preserving useful coverage.

FineWeb also shows how heuristic filtering should be done. The team tested C4-style filters and found that terminal punctuation gave the largest individual HellaSwag boost, but removed about 30% of tokens. They chose all C4 filters except terminal punctuation because the latter deleted too much. Then they built custom filters by collecting more than 50 document statistics, comparing high- and low-quality distributions, selecting candidate thresholds, and validating them with 28B-token ablation runs. The final three custom filters removed about 22% of tokens and improved aggregate benchmark score by about 1% in the ablations. [2]

That is how data cleaning should work:

inspect → hypothesize → filter → train proxy models → evaluate → keep only filters that improve behavior

FineWeb-Edu adds another useful datapoint. It filtered FineWeb into a 1.3T-token educational subset using a classifier trained from LLM annotations. The classifier achieved 82% F1 at the selected threshold; applying it over FineWeb required 6,000 H100 GPU hours. In a 1.82B model trained on 350B tokens, FineWeb-Edu increased MMLU from 33% to 37% and ARC from 46% to 57%, while matching Matrix’s MMLU performance with almost 10x fewer tokens. [2]

That is strong evidence for classifier-based filtering when it is measured. But it also comes with a warning. Apple’s “Data-Quality Illusion” work argues that classifier-based quality filtering can improve downstream task performance while not necessarily improving language modeling on the high-quality target set. The authors challenge the assumption that classifier score captures a universal notion of quality. [3]

The right conclusion is not “classifiers are bad.” The right conclusion is that quality classifiers are instruments, not truth. A quality filter must be attached to a target behavior, benchmark suite, training stage, and ablation record. Otherwise it becomes a magic number.

3. The cleaning stack should be empirical

A modern LLM data cleaning stack should usually include:

source validation
HTML / PDF / code extraction
boilerplate removal
language identification
document normalization
PII handling
policy filtering
exact deduplication
fuzzy deduplication
near-duplicate cluster analysis
repetition filters
low-information filters
classifier-based quality filters
contamination checks
source and license metadata
benchmark-backed ablations

The important part is not the checklist. The important part is the measurement discipline.

Figure 2. A practical cleaning and filtering stack with an ablation loop.

FineWeb’s ablation protocol is the right model. The team trained data-ablation models that were identical except for the data, used equal token budgets, ran multiple seeds, and evaluated on benchmarks selected for stable signal at small scale. Their ablations used 1.82B-parameter models, 28B-token filtering runs, 350B-token dedup and cumulative-filtering runs, more than 70 trained models, and roughly 80,000 H100 GPU hours. [2]

That is what data-driven data curation means. Cleaning choices should not be made by intuition alone. They should be promoted only if they improve the relevant evaluation suite.

4. Data mixtures are learned, not guessed

The next mistake is treating data mixture as a fixed human recipe. It is not. Data mixture is a hyperparameter schedule.

Different training stages need different data distributions:

early pretraining → broad coverage and diversity
late pretraining → higher-quality and capability-dense data
reasoning annealing → math, code, STEM, synthetic reasoning
long-context training → long coherent documents and long-context QA
SFT → instruction, reasoning, tools, chat, safety
preference / DPO → ranked outputs and rejected alternatives
RL → verifier-backed tasks and environments
agent training → tool trajectories, state transitions, rubrics, rollouts

NVIDIA’s Nemotron Nano 2 is a good recent example. Its pretraining mixture has thirteen categories, including quality-bucketed crawl data, synthetic high-quality crawl, math, Wikipedia, code, academic data, multilingual data, and synthetic SFT-style data. NVIDIA says it weights higher-quality sources more heavily and uses a three-phase curriculum: Phase 1 emphasizes diversity, Phase 2 shifts toward higher-quality data at the 60% training point, and Phase 3 shifts again at the 90% point. [4]

NVIDIA also discloses ablations that show how these choices are made. For multilingual data, they continued a 1B checkpoint for another 100B tokens with 50% multilingual data and 50% default pretraining data, then evaluated on Global-MMLU. Curated Common Crawl averaged 37.0, FineWeb-2 averaged 35.1, DiverseQA-wiki averaged 42.1, and DiverseQA-crawl averaged 47.0. That result led them to upweight DiverseQA-crawl in the multilingual mixture. [4]

That is the pattern we want. Not “we think multilingual web is good.” Instead: compare multilingual source families under controlled continuation, measure Global-MMLU, and change the mixture.

The same report shows that adding only 5% fundamental-reasoning SFT-style data into a 100B-token continuation improved Nemotron-H 8B’s MMLU-Pro from 44.24 to 56.36, with average math increasing by about two points and no decrease in average commonsense or code benchmarks. [4]

Qwen3 describes a complementary methodology. It trained on 36T tokens across 119 languages, including coding, STEM, reasoning, books, multilingual data, and synthetic data. It used Qwen2.5-VL to extract text from PDF-like documents, Qwen2.5 to refine that text, and Qwen2.5/Qwen2.5-Math/Qwen2.5-Coder to synthesize trillions of tokens in formats such as textbooks, QA, instructions, and code snippets. Qwen also annotated more than 30T tokens for educational value, domain, safety, and related labels, then optimized the mixture at the instance level through proxy-model ablations rather than only at source or domain level. [5]

Qwen3’s stage schedule is explicit: more than 30T tokens of general pretraining at 4K context, then about 5T higher-quality tokens with increased STEM, coding, reasoning, and synthetic data, then hundreds of billions of long-context tokens, where 75% of texts were 16K–32K and 25% were 4K–16K. [5]

So the modern methodology looks like this:

label data finely
bucket by source and quality
run proxy-model ablations
evaluate downstream capability deltas
schedule mixtures by stage
anneal toward capability-dense data
validate against regression suites

RegMix formalizes part of this. It treats mixture selection as a regression problem: train many small proxy models on different mixtures, fit a regression model to predict performance, then train the larger model on the predicted best mixture. In its experiments, RegMix trained 512 one-million-parameter models for 1B tokens, used the regression model to choose a mixture, and trained a 1B model for 25B tokens that performed best among 64 candidate mixtures. It also outperformed human selection up to 7B models trained on 100B tokens and matched or exceeded DoReMi with about 10% of the compute. [6]

The practical point is blunt: if you are not measuring mixture effects, you are guessing.

And loss is not enough. Nemotron-H reported downstream accuracy jumps after mixture changes and noted cases where validation loss was not a reliable proxy for downstream task performance. FineWeb selected filters using benchmark deltas, not just perplexity. DataComp-LM makes this broader: it provides a 240T-token Common Crawl corpus, training recipes, and 53 downstream evaluations so teams can test deduplication, filtering, and data mixing. Its baseline trained a 7B model on 2.6T tokens to 64% 5-shot MMLU, a 6.6-point MMLU gain over MAP-Neo with 40% less compute. [7]

Figure 3. Stage-wise mixture design as an experimental loop.

5. Synthetic data is a compiler, not a shortcut

Synthetic data has a bad reputation when it means low-diversity model slop. But the best synthetic data work is not “ask a model to make more examples.” It is compilation.

A raw document is not a dataset. A book is not a dataset. A manual is not a dataset. A GitHub repository is not a dataset. A support transcript is not a dataset. A tool trace is not a dataset. They are source material.

The data engineering task is to compile source material into useful training, evaluation, retrieval, reward, and context artifacts.

Instruction Pre-Training augments raw corpora with instruction-response pairs generated by an instruction synthesizer. The paper synthesized 200M instruction-response pairs across 40+ task categories and showed that instruction-augmented pretraining improved base models and made them benefit more from later instruction tuning. [17]

MAmmoTH2 recalls relevant web documents, extracts instruction-response pairs, and refines them using open-source LLMs. The project harvested 10M instruction examples from the pretraining web corpus; MAmmoTH2-7B Mistral improved from 11% to 34% on MATH and 36% to 67% on GSM8K without in-domain training data. [18]

Web Reconstruction treats each web document as either an instruction or a response and reconstructs the missing side. Its WebR datasets outperformed prior instruction-tuning baselines by up to 16.65% across four instruction-following benchmarks. [19]

Qwen3’s use of synthetic textbooks, QA, instructions, and code snippets is one example. NVIDIA’s Nemotron Nano 2 uses synthetic high-quality crawl, synthetic SFT-style data, long-document QA, and translated DiverseQA. For long-context extension, NVIDIA used academic documents longer than 32K tokens, split them into 1,024-token chunks, selected 10% of chunks for Qwen2.5-72B-Instruct to generate QA pairs, concatenated those QA pairs, and appended them to the original document. A 20% allocation of this long-context document-QA data improved RULER-128K in ablations. [4]

This is the right abstraction:

source corpus → compiled training, evaluation, retrieval, reward, and context artifacts

A book can become:

chapters
sections
claims
definitions
examples
procedures
tables
equations
edge cases
contradictions
factual QA
procedural QA
retrieval tasks
citation tasks
summaries
critiques
tool-use tasks
rubrics
unit tests
replay episodes

The invariant is provenance. Every generated artifact should point back to source spans, tool outputs, environment states, or verifier results. Without provenance, synthetic data becomes impossible to debug.

A generated QA pair is not good because it looks good. It is good if it teaches a capability, survives verification, preserves provenance, improves evals, and does not regress adjacent behavior.

Figure 4. The document-to-dataset compiler.

6. RL steering is also data engineering

Reinforcement learning is usually described as optimization. For LLMs and agents, that is incomplete. RL is a data pipeline.

The prompt distribution is data.
The rollout is data.
The environment is data.
The tool result is data.
The verifier is data.
The judge rubric is data.
The reward vector is data.
The failed trajectory is data.
The replay set is data.

DeepSeek-R1-Zero is the clean example. It used rule-based rewards for math and code: boxed-answer verification for math, compiler/test feedback for code, and a format reward for tags. DeepSeek explicitly avoided neural outcome or process reward models in that stage because of reward hacking risk and retraining complexity. AIME 2024 pass@1 increased from 15.6% to 71.0%, and majority voting reached 86.7%. [8]

This is the key engineering point: if the environment can verify the work, prefer verification over taste.

For math, code, structured extraction, database queries, tool-use workflows, and many agent tasks, the best reward is often not a learned preference model. It is a test, execution result, schema check, constraint checker, exact match, or environment transition.

DeepSeek’s later R1 pipeline added cold-start long-CoT data, reasoning-oriented RL, rejection sampling, roughly 600K reasoning samples, and final training that included broader non-reasoning data. The final DeepSeek-R1 report shows strong benchmark results such as 79.8 AIME 2024, 97.3 MATH-500, 65.9 LiveCodeBench, and 49.2 SWE Verified. [8]

Qwen3 shows that RL data does not always need to be huge if the verifier is strong. Its reasoning RL stage used 3,995 query-verifier pairs, selected to be unused in cold start, learnable, challenging, and broad. Qwen3-235B-A22B improved on AIME’24 from 70.1 to 85.1 over 170 RL steps. [5]

Kimi K2 shows the agentic version. Moonshot built a large-scale tool-use synthesis pipeline with three stages: tool spec generation, agent/task generation, and trajectory generation. It fetched 3,000+ real MCP tools from GitHub and generated more than 20,000 synthetic tools; tasks were paired with rubrics specifying success criteria, expected tool-use patterns, and evaluation checkpoints. The simulator maintained state, returned realistic tool feedback, introduced edge cases, and retained only trajectories that passed LLM-judge rubric evaluation. For coding and software engineering, Kimi complemented simulation with real execution sandboxes and test-suite pass rates. [9]

NVIDIA Nemotron 3 Super shows the same pattern in an open training pipeline. It reports 25T pretraining tokens, a 7M-sample SFT stage, and a three-stage RL pipeline: multi-environment RLVR across 21 environments and 37 datasets, SWE-RL with OpenHands and Apptainer containers, and RLHF with a principle-following generative reward model. [10]

Figure 5. A verifier-backed RL environment stack.

The engineering lesson is simple:

Use verifiers where possible.

A scalar reward without provenance is dangerous. A judge without calibration is dangerous. An environment without anti-hacking checks is dangerous. A rollout without replay is lost data.

7. Context engineering is runtime data engineering

The same data problem now appears at inference time.

Prompt engineering asks: what should I say to the model?

Context engineering asks: what information state should the model see before the next action?

That information state includes instructions, task state, retrieved evidence, examples, memories, tool schemas, tool outputs, prior decisions, constraints, citations, rubrics, and open uncertainties.

This is not cosmetic. It is runtime curation. An agent context window is a temporary dataset assembled for one forward pass.

If the context is stale, the model reasons from stale evidence.
If the context is polluted, the model amplifies pollution.
If the context is too large, the model gets distracted.
If the context omits a constraint, the model violates it.
If the context includes sensitive data unnecessarily, the system creates avoidable risk.
If the context is over-compressed, the model loses the structure it needed.

ACE, or Agentic Context Engineering, is one of the most relevant recent papers here. It treats contexts as evolving playbooks rather than static prompts. ACE uses a Generator, Reflector, and Curator to accumulate, refine, and organize strategies through incremental delta updates rather than monolithic rewrites. Across agent and finance benchmarks, ACE reports +10.6% on agents and +8.6% on finance, while adapting without labeled supervision by using execution feedback. [11]

The most important ACE result is the failure mode: context collapse. In one AppWorld case study, a context at step 60 contained 18,282 tokens and achieved 66.7 accuracy. One rewrite collapsed it to 122 tokens, and accuracy fell to 57.1, worse than the 63.7 baseline without adaptation. [11]

That is epiplexity in practice. Compression removed useful structure. Shorter was not better. Cleaner was not better. A summary destroyed operational knowledge.

The right rule is not “use more context.” The right rule is: preserve useful structure while controlling distraction, staleness, privacy, and cost.

ACE’s cost data is also useful. On AppWorld offline adaptation, ACE reduced latency by 82.3% and rollouts by 75.1% versus GEPA. On online FiNER, it reduced latency by 91.5% and token cost by 83.6% versus Dynamic Cheatsheet. That happened because ACE used incremental delta updates and non-LLM merging/deduplication instead of repeatedly rewriting the whole context. [11]

Figure 6. Context engineering as runtime data curation.

Retrieval should optimize utility, not similarity

Most RAG systems still retrieve by semantic similarity. That is not enough for in-context learning.

ICLERB reframes retrieval for ICL as a recommendation problem: retrieve the documents that maximize downstream LLM utility, not merely the documents most semantically similar to the query. Its benchmark evaluates retrievers by whether retrieved contexts improve LLM accuracy in ICL settings. [12]

The results are important. ICLERB found that rankings diverge from MTEB-style retrieval rankings, meaning strong search embeddings are not necessarily strong ICL context selectors. It also found that fine-tuning for semantic similarity can be detrimental for ICL utility in some cases. [12]

The RLRAIF method goes further: it fine-tunes a retriever using LLM feedback as a reward signal. With only about 10K DPO values, estimated at 5M tokens, and a small adapter trained on a consumer GPU, the RLRAIF reranker achieved 0.7238 nDCG@10 and 0.7225 nDCG@50, outperforming much larger models such as bge-en-icl and NV-Embed-v2 on ICLERB. [12]

The practical implication is direct: do not evaluate retrieval only by relevance. Evaluate whether the retrieved context changes the answer.

For agents, the retrieval target is not top-k similar chunks. It is the minimum sufficient evidence pack for the next decision.

Likely useful in the evidence pack

task state; hard constraints; source-grounded facts
prior decisions; tool schemas; recent tool outputs
similar solved cases; known failure modes; rubrics; unit tests

Usually exclude or isolate

duplicate chunks; stale state; irrelevant logs
raw tool dumps; summaries that erase operational detail
sensitive data not needed by the model

Long-context benchmarks show why this matters

Needle-in-a-haystack is too easy.

RULER was designed because vanilla needle tests measure superficial retrieval. It adds diverse needle types, multiple needles, multi-hop tracing, aggregation, and configurable length/task complexity. In its evaluation of 17 long-context LMs across 13 tasks, many models that were nearly perfect on vanilla needle-in-a-haystack degraded sharply as length and complexity increased; only about half maintained satisfactory performance at 32K despite claiming 32K+ context. [13]

NoLiMa removes literal overlap between the question and the relevant evidence. That matters because many long-context systems can exploit lexical matching rather than understanding. At 32K context, 11 evaluated models dropped below 50% of their short-context baselines, and GPT-4o dropped from 99.3% to 69.7%. [14]

LongBench v2 tests deeper long-context reasoning with 503 multiple-choice questions, contexts from 8K to 2M words, and six categories including single-document QA, multi-document QA, long in-context learning, long-dialogue history, code-repository understanding, and structured-data understanding. Human experts reached 53.7% under a 15-minute constraint; the best direct-answer model reached 50.1%, while o1-preview reached 57.7% with longer reasoning. [15]

For coding agents, ContextBench is more directly operational. It contains 1,136 issue-resolution tasks from 66 repositories across eight programming languages, with human-annotated gold contexts. It measures context recall, precision, and efficiency across agent trajectories. [16]

That gives us a practical evaluation stack for context engineering:

RULER → synthetic long-context retrieval, tracing, aggregation
NoLiMa → long-context retrieval without lexical shortcuts
LongBench v2 → realistic deep long-context reasoning
ContextBench → coding-agent context recall, precision, efficiency
ICLERB → utility of retrieved examples/documents for ICL
AppWorld/ACE → adaptive context and agent memory under feedback

The conclusion is direct: context must be evaluated as a data product.

Every context policy should be ablated:

remove a memory component
change the retriever
change the reranker
change the compression method
change the evidence budget
change tool-output handling
change citation requirements
replay the same episodes
measure success, grounding, latency, token cost, and regressions

8. Production traces are the next dataset

The highest-value data often appears after deployment.

Production traces show what users actually ask, what the model retrieves, which tools it calls, where it fails, which evidence it ignores, which judge disagrees, where policies are ambiguous, and where the model sounds confident without being grounded.

Those traces should not disappear into logs. They should become training, evaluation, context, and reward artifacts.

The loop is:

observe → classify → verify → compile → replay → promote

A failure can compile into many things:

bad answer → corrected SFT example
bad retrieval → new chunking rule or index update
bad tool call → tool-use trajectory or schema constraint
bad judgment → rubric update or judge calibration item
unsafe response → safety eval and reward example
missing knowledge → source update and retrieval test
long-context miss → context benchmark replay
agent dead-end → environment transition test

This is the practical version of the Verifier-Compiler Loop: observe traces and outcomes, evaluate with checks and judges, intervene safely, then compile expert decisions into a versioned behavioral contract. [20]

Do not just patch the prompt. Classify the failure. Compile the correction. Replay the system. Promote only when the change improves the target behavior without regressing adjacent behavior.

9. Data quality needs contracts

A serious LLM or agent program needs data contracts. Not bureaucracy. Engineering contracts.

For every dataset:

Dataset name:
Training stage:
Source classes:
License / usage constraints:
Freshness window:
Extraction method:
Filtering rules:
Dedup method:
PII / policy filters:
Contamination checks:
Capability tags:
Difficulty tags:
Mixture weight:
Verifier:
Judge model / rubric:
Known failure modes:
Evaluation suite:
Ablation evidence:
Owner:
Promotion criteria:
Rollback criteria:

For every context policy:

Context sources:
Retrieval policy:
Reranking objective:
Memory write policy:
Compression policy:
Isolation boundaries:
Citation requirements:
Tool-output handling:
Sensitive-data policy:
Staleness policy:
Replay evals:
Observed regressions:

For every RL environment:

Task distribution:
Environment interface:
Allowed tools:
State transition rules:
Reward components:
Verifier implementation:
Judge model and version:
Rubric:
Anti-hacking tests:
Timeouts and budgets:
Replay set:
Human escalation path:

This is where production AI becomes engineering rather than prompt folklore.

10. Conclusion

The next wave of progress will not come only from larger models.

It will come from better data control.

Better extraction.
Better cleaning.
Better deduplication.
Better filtering.
Better source mixtures.
Better ordering.
Better synthetic compilers.
Better long-context curricula.
Better context playbooks.
Better verifier-backed environments.
Better reward models.
Better replay systems.
Better governance.

The teams that win will not be the teams with the largest pile of tokens.

They will be the teams that know which data changes behavior, why it changes behavior, how to measure that change, and how to keep improving it without corrupting the system.

Data is not just the input to training.

Data is the control surface.

References

[1] Epiplexity: Information Theory for Computationally Bounded Learners. https://arxiv.org/abs/2601.03220

[2] FineWeb: Decanting the Web for the Finest Text Data at Scale. https://arxiv.org/html/2406.17557v1

[3] Apple Machine Learning Research: The Data-Quality Illusion. https://machinelearning.apple.com/research/data-quality-illusion

[4] NVIDIA Nemotron Nano 2 Technical Report. https://arxiv.org/html/2508.14444v2

[5] Qwen3 Technical Report. https://arxiv.org/html/2505.09388v1

[6] RegMix: Data Mixture as Regression. https://arxiv.org/abs/2407.01492

[7] DataComp-LM: In Search of the Next Generation of Training Sets for Language Models. https://arxiv.org/abs/2406.11794

[8] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://arxiv.org/html/2501.12948v1

[9] Kimi K2 Technical Report. https://arxiv.org/html/2507.20534v2

[10] NVIDIA Nemotron 3 Super Technical README. https://github.com/NVIDIA-NeMo/Nemotron/blob/main/docs/nemotron/super3/README.md

[11] ACE: Agentic Context Engineering. https://arxiv.org/html/2510.04618v1

[12] ICLERB and RLRAIF: Learning to Retrieve In-Context Examples by Utility. https://arxiv.org/html/2411.18947v1

[13] RULER: What’s the Real Context Size of Your Long-Context Language Models?.

https://openreview.net/forum?id=kIoBbc76Sy

[14] NoLiMa: Long-Context Evaluation Beyond Literal Matching.

https://openreview.net/forum?id=0OshX1hiSa

[15] LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-Context Multitasks. https://arxiv.org/abs/2412.15204

[16] ContextBench: A Benchmark for Measuring Context Engineering in Coding Agents. https://arxiv.org/html/2602.05892v2

[17] Instruction Pre-Training: Language Models are Supervised Multitask Learners. https://arxiv.org/html/2406.14491v1

[18] MAmmoTH2: Scaling Instructions from the Web. https://arxiv.org/html/2405.03548v3

[19] Web Reconstruction: Reconstructing Instruction Data from the Web. https://arxiv.org/html/2504.15573v1

[20] The Verifier-Compiler Loop. https://www.equationblog.com/p/the-verifiercompiler-loop-turning

Updated workout routine - Part 3

Ruslan Belkin — Sat, 09 May 2026 14:15:58 GMT

“Train to stimulate, not annihilate.” —Lee Haney

I’ve recently tweaked my workout schedule again.

In the previous two posts I described the updated weight-lifting block and the updated functional block. The first one was more muscle / VO2max maintenance focused. The second one was more full-body power / functional movement focused.

This post is about a different idea: combining multiple styles of training in the same schedule — strength, hypertrophy, power, agility, combat-style conditioning, grip strength, yoga / mobility and VO2max work.

The goal is not to make the week random. The goal is to make it dense, but still sequenced.

The main constraint is not just muscular fatigue. It is nervous system fatigue. Muscles can often handle more work than attention, coordination and movement quality can handle. Power, agility, plyometrics, MMA / boxing and interval work are not just “muscle” work. They require coordination, speed, braking, balance, bracing and intent. When mental fatigue sets in, form starts getting sloppy before the muscles are truly done.

Because I work out first thing in the morning, this matters a lot. The workout cannot consume the whole mental budget for the day. I still need to be useful at work at 9am, 1pm and 5pm.

So this routine is built around a simple idea: enough stimulus to keep progressing, but not so much nervous system fatigue that the rest of the day becomes a tax.

As before, I work out first thing in the morning, with a 5:30 am wake-up, quick oral/face hygiene, pre-workout supplements, and a 10-15 min meditation.

Because the schedule is dense, I’m still skipping pre-workout foam rolling during the work week. I do try to compensate on the weekend with a quick 20 min foam-rolling session once a week, or, when possible, a professional massage every few weeks. If I had unlimited time, I’d keep the full pre-workout mobility work, but this is the realistic compromise.

I still use guided workouts, mostly Beachbody / BODi historical programs, mainly for timing and pacing, and my week starts on Sundays.

The general structure: a main workout, followed by a cardio segment, with optional add-ons for abs, pull-ups, chin-ups or core depending on the day.

This is a 4-week block. Muscle building remains the primary goal, but athleticism is no longer separated into a completely different block. It is built into the same week.

Abbreviations below:

Body Beast = Body Beast / Beast Up / Deluxe family. In this block, that includes the Build, Bulk, Tempo, Beast Up, Beast: Abs and Beast: Abs Classic workouts. This is still the main hypertrophy engine of the schedule.

P90X GN = P90X Generation Next
T30 = Tough Mudder T-MINUS 30
PO4 = The Power of 4
Zone 2 and interval work are elliptical-based in this schedule.

A note on the elliptical: I am using it intentionally, not as a compromise because I cannot think of something better. I have an old tendon injury from MMA, so high-volume running is not the best risk/reward trade for me. The elliptical gives me a close enough running-like stimulus for Zone 2 and interval work, but with much lower impact and less tendon irritation. It is not identical to running — the lack of impact and fixed path are part of the point — but it gets close enough for my current goals. The other advantage is that, with variable incline, the elliptical can hit the posterior side of the legs surprisingly well. Higher incline and more deliberate drive turn it into useful work for glutes, hamstrings and calves, not just generic “cardio.”

Week 1 - accumulation and continuity week

Week 1 is the classic foundation week. There is a normal Beast chest / back / shoulders structure, but the lower body and athletic work are already present.

Tuesday is dense because legs are followed by P90X3: The Challenge, so Zone 2 is cut to 30 min.

Friday is the first athletic / power day. Saturday is long on the clock, but yoga plus Zone 2 is not the same type of stress as heavy lifting plus intervals.

Week 2 - athletic leg variation week

Week 2 shifts lower-body work from classic leg hypertrophy to more athletic lower-body strength.

Friday also becomes a true agility day. This is one of the reasons the block feels different from a standard bodybuilding plan — the muscle stimulus is still there, but the movement quality matters just as much.

The upper body still gets enough direct work: Beast Up push, Tempo Back/Bis, shoulders, arms, The Challenge and the pull-up / chin-up finishers.

Week 3 - peak overload week

Week 3 is the hardest week on paper.

Bulk Chest, Build Back/Bis, Bulk Legs and Full Body Power all land in the same week. This is where form quality matters more than chasing extra load.

Full Body Power is exactly the kind of workout where the nervous system gets tired faster than the muscles. You may still be able to move the weight, but the movement is no longer crisp. That is usually the signal to reduce weight or stop increasing intensity.

The goal here is not to survive the workout. The goal is to create a strong enough training stimulus without borrowing too much from the rest of the day.

Week 4 - consolidation without going soft

Week 4 is not a deload. It is more of a consolidation week.

Tempo Chest/Tris and Tempo Back/Bis reduce the need to chase maximum weight, but still create a strong muscle stimulus through time under tension.

Acceleration / Deceleration on Friday is also a good fit here. It is not just about moving quickly. It is about braking, redirection and movement precision. That makes it very useful, but also very form-dependent.

The week is somewhat easier neurologically than Week 3, but it is not soft.

Why this structure

There are four hypertrophy-led days per week: push, pull, legs, shoulders / arms.

There is one combat / core bridge day per week: Cardio Boxing or MMX, followed by Core Circuit.

Note: That slot is somewhat flexible. Sometimes I follow the prescribed P90X GN Cardio Boxing or P90X3 MMX routine exactly, especially when I feel unimaginative and just want a guided cardio workout. Other times, I use the same slot to practice specific Krav Maga techniques and combinations from prior training: footwork, strikes, kicks, defensive movements, and transitions. The point is not the exact brand-name workout. The point is combat-style conditioning, coordination, rotation, trunk control and movement under fatigue.

There is one athletic / power day per week: Plyometrix, Speed and Agility, Full Body Power, or Acceleration / Deceleration.

There is one full yoga anchor per week.

This may be the opposite for most people, but yoga is one of the hardest routines for me mentally. Not because it is the most metabolically demanding, but because it requires patience, stillness, balance and sustained attention. That is exactly why I put it on Saturday. I do not want to start a dense workday by forcing myself through the routine that costs me the most mentally. Saturday gives me more room to do it properly.

That is the key difference between “hard” and “expensive.” Some workouts are hard muscularly. Some workouts are expensive neurologically. I am trying to avoid too many expensive sessions stacked together.

Tuesday is intentionally controlled. Leg day plus P90X3: The Challenge is already enough. That is why Zone 2 is only 30 min.

Wednesday uses short intervals, but the base workout is combat / MMA, not heavy lifting.

Friday is the main athletic day, but it is followed by Zone 2 and a short pull-up routine, not another heavy strength block.

Saturday is the recovery anchor, even though it is not a rest day.

P90X3: The Challenge

P90X3: The Challenge consists of four blocks. Each block has two sets of pull-ups followed by push-ups, with different variations.

My current numbers are 15 pull-ups and 40 push-ups per set. That’s 120 pull-ups and 320 push-ups per workout.

There’s also a burnout at the end — 2 pull-ups and 4 push-ups, no rest, for 6 sets — so add another 12 pull-ups and 24 push-ups to the total.

In this block, The Challenge shows up every Tuesday after leg work. This is intentionally aggressive. It keeps upper-body endurance and grip strength in the schedule without making the whole block a pure bodybuilding program.

It is also one more reason why Tuesday Zone 2 is shorter.

T-MINUS 30 pull-up and chin-up routines

The T30 Pull Up and Chin Up routines are short, roughly 11-12 min.

They are not meant to replace the main strength work. They are there as grip-strength and pulling-practice finishers.

My ladders are usually 2-4-8-16. That gives enough volume to matter, but the routines are short enough that I can still treat them as add-ons.

Friday gets the Pull Up routine. Saturday gets the Chin Up routine.

If I am short on time or if grip is already too fatigued, these are optional and can be dropped.

Intervals

Long intervals: warm-up, 3 × 5-min hard efforts with 2-min easy segments in between, then cool-down.

Short intervals: warm-up, 9 × 1-min hard efforts with 1-min easy segments in between, then cool-down.

All interval and Zone 2 work in this block is done on the elliptical. For intervals, I adjust resistance / pace to create the hard segments. For Zone 2, I mostly use incline and resistance to keep the effort steady while biasing the posterior chain.

In this schedule, Sunday has a longer short-interval block, Wednesday has a shorter short-interval block, Monday has the longer long-interval block, and Thursday has a shorter long-interval block.

All other cardio is Zone 2.

When time is short, abs get dropped first. Then the optional pull-up / chin-up routines. If Zone 2 is 50 min, I cut it to 30. If something still has to go, I would rather preserve the main workout and drop the supplemental cardio than rush the main movement work.

Notes on fatigue

This block averages a little under 2 hours per day. The shortest day is around 93 min and the longest day is around 130 min.

That is a lot. The only reason this works is that not every minute has the same recovery cost.

Heavy lifting, power moves, agility work, boxing / MMA, intervals and high-rep pull-up work all hit differently. The schedule is designed so that the hard work is spread out, and the most coordination-heavy work does not sit on top of maximal lifting every day.

For me, the nervous system fatigue is the main constraint. If I train too hard in the morning, the problem is not just soreness. The problem is reduced focus later in the day.

That is a bad trade.

The point of morning training is to improve the body and improve the day, not to win the workout and lose the workday.

Form

The form is critical in this block, especially once mental fatigue sets in.

This is true for heavy lifting, but it is even more true for power and agility work. Bad curls are usually not a great idea. Bad acceleration / deceleration or sloppy plyometrics are much worse.

Seemingly lower weights can be deceptive. Do not raise the weights unless you can complete the exercise with nearly perfect form. The risk of injury with complex power moves goes up considerably.

Note: doing exercises with a perfect form is an easy way to enhance effectiveness with lower weights and much reduced risk of injury.

For tempo workouts, I am not chasing weight. I am chasing tension and control.

For power workouts, I am not chasing fatigue. I am chasing crisp movement.

For intervals, I am not trying to turn every segment into a max-effort suffer fest. The goal is repeatable quality.

As always — if the last set is too easy, add a few extra reps; next time, increase the weight. The key to performance is sufficient stimulus.

When traveling

When traveling and the hotel gym with weights is available, the strength portions can mostly continue, depending on space and how crowded the place is.

If no gym is available, I normally switch to bodyweight conditioning for the duration of travel. Insanity Max 30 Month Two workouts are still a decent fallback. They are not the same stimulus, but they are good enough for maintenance.

If there is a pull-up bar, I can keep some version of The Challenge or the T30 pull-up / chin-up work. If not, I don’t overthink it.

Travel workouts are for maintenance, not perfect replication.

Again - this is what I do. This is not a recommendation that everyone should train 7 days per week or spend roughly 2 hours per day working out.

Consistency and a purposeful plan are keys and there are many other ways to stay in shape.

Reliability knobs for agents

Ruslan Belkin — Mon, 30 Mar 2026 14:15:11 GMT

The agent conversation is as noisy as it has ever been:

One camp says base models are now good enough, just give them tools.
Another (the moi is mostly in. it) says evaluation is the bottleneck, just build better judges.

While I would assign greater weight to the second camp, they are both directionally right and still too coarse.

The durable gains are showing up somewhere less glamorous and much more useful: new reliability knobs. Not one magical agent architecture, but a stack of control surfaces that help systems lose less intent, preserve more capability, remember more of the right state, and fail in ways teams can actually inspect and correct.

Several papers of note along these problems:

1. PPS: make intent explicit before the model starts guessing

Natural-language prompts have a hidden failure mode: intent transmission loss. The user knows what they mean; the model only sees the compressed, underspecified surface form. The PPS paper attacks that directly. Across 60 tasks, 3 domains, 3 models, and 540 generations, natural-language-rendered PPS outperformed both simple prompts and raw JSON on goal alignment. The gains were strongest in ambiguous business tasks, weaker in technical tasks, and actually reversed in low-ambiguity travel planning. That last part matters. It means structured prompting is not magic; it is most valuable when the user’s objective is fuzzy enough to be misread. The preliminary survey result is also practical: fewer follow-up rounds, from 3.33 to 1.13 on average.

Why we should care: a lot of agent failure is downstream of an upstream ambiguity. If the goal, audience, constraints, tone, and success criteria are hazy at turn one, the rest of the trajectory is just confident error propagation. The other useful lesson here is that raw structure is not enough. In the study, rendered PPS beat raw JSON. So the knob is not “add schema everywhere.” It is “make intent explicit in a form the model can actually use.”

Paper: https://arxiv.org/abs/2603.18976

2. Prompt repetition: a surprisingly cheap robustness hack

This one sounds silly until you think about what causal attention is doing.

The core idea is simple: when reasoning is not enabled, repeat the prompt. The paper shows that prompt repetition improved accuracy across popular models without increasing output length or materially increasing latency in most settings. On their experiments, it won 47 of 70 benchmark-model combinations with 0 losses when reasoning was disabled. The gains were especially strong when the prompt order was hostile to the model, such as options-first multiple choice, and on custom tasks like NameIndex and MiddleMatch.

Why we should care: not every model call inside an agent should be a long reasoning trace. Some calls are cheap subroutines: route this, extract that, normalize this, draft tool arguments, re-check the user constraint. For those non-reasoning calls, prompt repetition looks like a genuinely useful default knob. It is not a substitute for reasoning, and it is not free on arbitrarily long prompts, but for short operational subcalls it is exactly the kind of low-cost robustness trick teams tend to underrate.

Paper: https://ar5iv.labs.arxiv.org/html/2512.14982v1

3. GLM-5: do not buy agentic RL by deleting earlier skills

Sequential post-training has a nasty habit: each new stage can quietly sand down the thing the previous stage got good at.

GLM-5 is interesting partly because it says that out loud. Their pipeline runs sequential RL stages for reasoning, then agentic behavior, then general helpfulness, and then uses on-policy cross-stage distillation as a final refinement to recover skills from earlier stages. Previous stage checkpoints become teachers; the final pass is meant to stop the classic “more agentic, less sharp” tradeoff from becoming acceptable collateral damage. On their reported benchmarks, GLM-5 posts about a 20% average gain over GLM-4.7 across agentic, reasoning, and coding tasks, including 77.8 on SWE-bench Verified.

The more durable lesson is even lower-level than that. The paper is refreshingly explicit that agentic RL stability lives in systems details as much as in objectives. They switched to a deterministic top-k operator because nondeterministic sparse-attention selection caused sharp RL degradation, froze the indexer during RL for stability, and emphasized token-in-token-out handling so the trainer learns on exactly the same token stream produced by the rollout engine. That is the kind of detail that separates “agent demo” from “agent training system.”

Why we should care: a lot of agent improvement work still acts as if new capabilities can simply be stacked. In practice, they interfere. If post-training adds planning but dulls reasoning, or adds autonomy but destabilizes the learning loop, you have not really improved the agent. You just moved the failure somewhere harder to notice.

Paper: https://ar5iv.labs.arxiv.org/html/2602.15763v2

4. FullStack-Agent: round-trip the artifact, then test the hidden surfaces

I like this one because it attacks a very real agent failure mode: the frontend looks right, the demo works, and the backend is still fake.

FullStack-Agent combines three ideas. First, a multi-agent development workflow with specialized debugging tools for frontend and backend work. Second, FullStack-Bench, which evaluates not just frontend behavior but backend APIs and database state as well. Third, Repository Back-Translation, which converts existing real-world repositories into agent trajectories the model can learn from. The benchmark itself is notable: 101 instructions, 647 frontend tests, 604 backend tests, and 389 database tests. Even better, frontend success is not counted unless the required database interaction is real. That is exactly the kind of hidden-surface check agent evaluation needs more of.

The results are strong, but the more interesting pattern is the training and evaluation shape. FullStack-Dev with a Qwen backbone reached 64.7 frontend, 77.8 backend, and 77.9 database accuracy, while FullStack-Learn improved a 30B model through self-improvement using repository back-translation and augmentation. The debugging tools also mattered a lot: removing the backend debugging tool increased average backend iterations from 74.9 to 115.5. That is not just a model story. It is a workflow design story.

Why we should care: reliable coding agents need falsifiable artifacts. A useful practical extension of this idea is round-tripping: code to spec, spec back to code, compare the two, and inspect the mismatch. That creates a verifier surface instead of treating the codebase as one opaque blob. More broadly, the paper is a reminder that real artifacts and real tests are better teachers than synthetic vibes.

Paper: https://arxiv.org/html/2602.03798v1

5. MSA: memory should be part of the model, not a retrieval afterthought

Agents with long histories do not just need more context. They need memory that stays usable when the history becomes absurd.

MSA pushes that idea hard. The paper proposes an end-to-end trainable memory framework with sparse attention, document-wise RoPE, KV compression, and a Memory Parallel inference path. The headline is the kind of number people usually ignore until it becomes operationally relevant: less than 9% degradation while scaling from 16K to 100M tokens, with 100M-token inference on 2xA800 GPUs. The other important piece is Memory Interleave, which alternates retrieval, context expansion, and generation so the model can reason across scattered memory segments instead of just pulling one flat chunk and hoping.

Why we should care: a lot of current agent memory stacks are really retrieval pipelines wearing a memory costume. That works until the task needs long-range consistency, multi-hop evidence integration, or stable persona/state over time. MSA is interesting because it tries to make memory intrinsic and differentiable rather than bolted on. The real caveat is operational: the current setup still relies on offline pre-encoding of the corpus. So it is not a universal replacement for dynamic knowledge systems yet. But as a direction, it is much closer to agent memory than “just add bigger RAG.”

Paper: https://arxiv.org/abs/2603.23516

6. Verifier–compiler loops: verification is becoming its own stack

This is the one I keep coming back to.

The core production fact is ugly and simple: long workflows multiply small defects. In the verifier–compiler loop framing, a 1% failure rate across 100 steps leaves only about 36.6% end-to-end success. Even 0.1% per-step failure still leaves only about 90.5%. That is the march-of-nines problem. The implication is that agent reliability is not mainly a prompt problem. It is an error-correction problem. The system needs to observe the episode, judge it against institutional standards, intervene conservatively, replay changes before release, and keep durable evidence of what changed and why. That is also why the distinction between execution knowledge and institutional judgment matters: the agent can know the facts and still fail the organization.

Recent judge work mostly points in the same direction. JudgeBench shows hard evaluator tasks are genuinely hard, with strong models like GPT-4o only slightly above random on some challenging judge settings. RewardBench 2 makes reward evaluation meaningfully harder than RewardBench 1 and emphasizes correlation with downstream use. DeepSeek’s GRM/SPCT line is also important because it argues that reward modeling itself can scale with more inference compute through principle generation, critique, and voting, not just with bigger training runs.

But the field is also getting more honest about calibration. Evaluative Fingerprints found near-zero inter-judge agreement while also showing that judges are individually stable enough to be fingerprinted from their rubric behavior. In other words: they are not random, they are systematically different. Separate work on LLM-as-a-judge reporting shows that evaluator bias and uncertainty should be corrected statistically, not hand-waved. On user simulation, the news is similarly mixed: SimulatorArena suggests profile-conditioned simulators can track human judgments reasonably well on some tasks, but Lost in Simulation shows simulator choice can move measured success rates by up to 9 points and systematically miscalibrate difficulty.

Why we should care: one judge score is not a control system. High-reliability agents are going to need judge stacks, not judge monocultures: crisp gates for obvious defects, stronger reasoning judges for nuance, replay before release, disagreement review for hard cases, and humans on the highest-risk boundaries. Simulation will help widen coverage, but only if it is continuously calibrated against real traces.

Blog: https://www.equationblog.com/p/the-verifiercompiler-loop-turning

7. IndexCache: systems work is reliability work too

This one is more infrastructure than alignment, but it belongs in the same conversation.

IndexCache starts from a simple observation: in sparse attention, adjacent layers often choose very similar top-k token sets. So instead of recomputing the indexer at every layer, reuse it across layers. On the reported results, that removes up to 75% of indexer computation with negligible quality loss, while reaching 1.82x prefill speedup and 1.48x decode speedup at 200K context. The paper also reports 70–100% top-k overlap across adjacent layers, which is the structural reason the trick works.

Why we should care: efficiency is not separate from reliability. Every unit of inference cost you remove from the serving path can be reinvested into something reliability-shaped: longer context, more retrieval, more search, more verifier passes, more replay budget, or simply lower latency at the same control quality. That is why inference-side engineering keeps mattering more than people think.

Paper: https://arxiv.org/html/2603.12201v1

The connective tissue

If I had to compress the direction into one line, it is this: reliable agents are becoming layered control systems.

Structured intent reduces loss before the trajectory begins. Prompt repetition stabilizes cheap non-reasoning subcalls. Post-training methods like cross-stage distillation try to make new capabilities additive instead of destructive. Artifact-grounded training and hidden-surface testing make agent outputs more falsifiable. Long-memory work tries to decouple memory capacity from reasoning quality. Judge research is forcing evaluation to become calibrated, replayable, and auditable. Systems work buys the budget to do more of all of it in real time.

Just a growing stack of knobs that make agent behavior narrower, more inspectable, and a little less mysterious week over week.

The Verifier–Compiler Loop: Turning Human Preferences into Production Agent Judgment

Ruslan Belkin — Mon, 09 Mar 2026 14:14:40 GMT

Agents do not usually fail in production because a prompt suddenly stopped working. They fail because a workflow that looked 98% fine in isolation turns into 30 turns, six tool calls, two handoffs, a compliance boundary, and a frustrated human on the other side.

That is the march-of-nines problem (as originally described by Andrey Karpathy). In a long workflow, tiny per-step defects compound into very large end-to-end losses. At 100 steps, a 1% failure rate at each step leaves only about 36.6% end-to-end success; even 0.1% still leaves only about 90.5%. For customer service, regulated operations, and multi-agent workflows, that gap is the difference between a promising demo and a system the business can trust.

The practical implication is that production reliability should be treated as an error-correction problem. Better base models help, but they are only part of the story. The system also needs a way to observe what happened, judge it against the organization’s standards, intervene safely when necessary, replay changes before release, and keep durable evidence of what changed and why.

Figure 1. In long workflows, small per-step errors compound into large end-to-end losses.

Reliability should come from continuously catching, correcting, compiling, and proving small errors — not from hoping one prompt patch will cover every edge case.

Production failures are not only knowledge failures

A surprisingly large share of production failures are not missing-fact failures. The agent may know the product terms, retrieve the right document, or call the right tool and still fail the institution.

Four surfaces matter in practice: missing knowledge, institutional judgment, user affect, and evidence. A response can be factually correct but still violate policy, choose the wrong level of certainty, worsen the user’s emotional trajectory, or leave the team unable to reconstruct what happened well enough to approve a safe fix.

Missing knowledge: The agent lacks a fact, a tool result, or an updated policy exception.
Judgment misalignment: The facts are present, but the trade-off between speed, certainty, empathy, policy, or escalation is wrong.
Affect regression: The reply is technically valid but increases frustration, distrust, or confusion.
Evidence gap: The team cannot replay the episode, inspect the rewrite, or approve the next release with confidence.

A banking example makes the distinction concrete. An assistant can know the product, know the account state, and still reply in a way that is too dismissive, too certain, or too slow to escalate.

Figure 2. A customer-facing answer can know the facts and still miss the institution’s judgment on policy, style, affect, or action.

Judgment should be explicit

One way to make this tractable is to keep execution knowledge and institutional judgment separate, but versioned together. Skills describe how the work gets done: task-specific success criteria, approved procedures, tool permissions, and required evidence. Judgment describes how the organization wants that work done: risk boundaries, policy rules, quality bars, tone, trade-offs, escalation behavior, and release gates.

That separation matters because business policy changes faster than workflow logic. If a new escalation rule or compliance boundary requires rewriting every skill from scratch, the system becomes brittle. If judgment is explicit, global changes can be compiled once and enforced across many workflows.

Figure 3. A reliable agent should keep workflow execution and institutional judgment separate, but versioned together as a behavioral contract.

An example of a production flywheel

A useful production flywheel is not a vague analytics dashboard. It turns one live interaction into an ordered, auditable episode. In one practical pattern, a conversation becomes: user message → draft response → affect and intent signals → initial evaluation → rewrite decision → post-rewrite evaluation → delivered message.

Once that structure exists, the same episode can be replayed later, reviewed by humans, approved or rejected, and compiled into the next version of the behavioral contract. Runtime interventions can stay conservative — rewrite, route, pause, slow-halt, or escalate — while the offline loop decides what should become a durable product change.

This is the shift from prompt folklore to production engineering. A corrected response is useful at the moment; a replayable, reviewable correction is useful week over week.

A few operating patterns tend to work

Observe real traces, not only synthetic eval sets.
Keep affect, policy, and tool behavior in the same episode view.
Replay proposed changes before release rather than patching blindly in production.
Treat approvals and evidence as part of the product, not as after-the-fact documentation.

Figure 4. A production flywheel turns one conversation into an ordered, replayable episode that can be inspected, approved, and released with confidence.

Judge quality now depends on calibration

Recent judge research points in two directions at once. On one hand, judges are improving. Benchmarks such as JudgeBench and RewardBench 2 are making judge quality easier to measure, while DeepSeek’s GRM / SPCT work suggests that principle generation, critique, and inference-time aggregation can make reward modeling and preference judging much stronger in practice.

On the other side, the field is getting more honest about calibration. Newer 2025–2026 work argues that one raw judge score is not enough. Some evaluators show stable but different “evaluative fingerprints,” meaning they can be internally consistent while still disagreeing systematically with one another. Other papers show that rubric order, pointwise versus pairwise framing, and judge allocation can shift rankings if those choices are left uncalibrated.

The production lesson is simple: one judge should not be the whole control system. High-reliability systems tend to use a judge stack — crisp gates for clear defects, calibrated reasoning judges for nuance, replay for release decisions, disagreement review for hard cases, and humans for the highest-risk boundaries.

The newest question is no longer only “Which judge scores highest on a benchmark?” It is also “Which judge remains stable, interpretable, and properly calibrated inside a release process?”

User simulation can help, but only with calibration

User simulation is becoming necessary because replay without synthetic users does not cover enough edge cases. Recent work such as SimulatorArena suggests that profile-conditioned simulators can track human ratings reasonably well on some multi-turn tasks, especially when the simulator has access to richer user profiles rather than a generic system prompt.

But simulation should not be treated as ground truth. Lost in Simulation is the warning label: simulator choice can materially change measured success, and simulated populations can drift away from real human behavior. The practical pattern is simulation plus calibration — use synthetic users to widen test coverage, then measure the gap to hold-out human traces and correct for it.

Auditability should be part of the product

If the team cannot reconstruct which draft was blocked, which rewrite was sent, what evidence triggered the intervention, and which version of the behavioral contract approved the change, then it does not really have a production flywheel. It has prompt folklore.

Auditability is what turns one corrected episode into a durable release decision. In practice, that usually means keeping trace and event correlation, judge verdicts, rewrites, replay results, approval records, and release decisions together. The goal is not paperwork. The goal is to make the next change easier to inspect, safer to ship, and easier to trust.

How this connects to my ODSC East talk

At ODSC East 2026, I’ll go deeper into the mechanics behind this pattern: how judge stacks can be calibrated, how runtime interventions can be chosen without creating new risk, how replay should precede release, and how week-over-week improvements can turn human preferences into durable production agent judgment.

The talk title is “The Verifier–Compiler Loop: Turning Human Preferences into Production Agent Judgment.” This article only sketches the frame. The session will go much deeper into the system design and operating model behind it.

Selected references

Agent reliability

Ruslan Belkin — Mon, 26 Jan 2026 15:15:42 GMT

The field is moving extremely fast right now. New agent stacks, new evals, new post-training tricks - the whole ecosystem shifts weekly.

But if you ship agents, you learn a painful lesson fast:

An agent that succeeds once is not a reliable agent.

Single-run success rates are demo metrics. Production reliability is a different game:

Consistency across runs (the same task, same setup, multiple attempts)
Robustness to “equivalent” user inputs (paraphrases, small spec changes, harmless reorderings)
Grace under tool/API failures (because they will fail - timeouts, rate limits, partial responses, schema drift)
If I had to compress the theme of the last ~90 days into one line, it’s this:

Reliability is a surface, not a score.

This post is intentionally narrow: recent work that treats agent reliability as a first-class object - not a vibe.

The production reality check: humans are still the reliability layer

A paper I keep pointing people to is “Measuring Agents in Production.” It’s one of the rare efforts that asks practitioners what’s actually working (and what’s breaking).

A few findings that stuck with me:

Many production agents are built to be simple and controllable: 68% run at most 10 steps before requiring human intervention.
Most teams lean on prompting off-the-shelf models vs weight tuning (70%), and rely primarily on human evaluation (74%).
Reliability shows up as the top challenge - especially “ensuring and evaluating correctness.”

That’s the current equilibrium: humans as circuit breakers.

The real question is how we scale beyond that without lying to ourselves about what “reliable” means.

ReliabilityBench: measuring reliability as a surface, not a score

ReliabilityBench is exactly the kind of benchmark we’ve needed.

Instead of asking “did it succeed,” it asks:

Does it succeed again (consistency)
Does it succeed under equivalent variations of the task (robustness)
Does it survive tool/API failures (fault tolerance)

They formalize this across three dimensions:

pass^k for repeated execution
perturbation intensity ε
fault intensity λ
...and propose a unified reliability surface: R(k, ε, λ).

Two ideas here that I think will stick:

Action metamorphic relations: judge correctness by end-state equivalence rather than brittle text matching.
Chaos-style fault injection: simulate timeouts, rate limits, partial responses, schema drift.

The reported results are the point:

Perturbations alone reduced success from 96.9% at ε=0 to 88.1% at ε=0.2.
Rate limiting was especially damaging.

This is what “production-like” really means: not one clean run, but performance under stress.

Why we should care:

If you only track single-run pass rates, you end up optimizing for demos.
A reliability surface forces the conversation into repeatability, robustness, and failure modes.

E-valuator: turn “judge scores” into runtime decisions (with guarantees)

Assume you’ve built a verifier (LLM judge, PRM, heuristics). You can score trajectories - but can you trust the score enough to make a runtime decision?

E-valuator reframes this as a sequential hypothesis testing problem: distinguish successful vs unsuccessful trajectories as actions unfold, using a statistically valid test at every step.

They propose converting any black-box verifier score into a decision rule with controlled false-alarm rates, and show it can both improve monitoring and terminate problematic trajectories early to save tokens.

Why we should care:

“Judge reliability” is now a core dependency for agent reliability.
This is one path from heuristics to operational control.

LLMdoctor: test-time steering as a reliability tool

Benchmarks and verifiers tell you “it broke.”

But reliability also requires you “fix it now.”

That’s why I like test-time alignment approaches that are modular. LLMdoctor has a clean patient-doctor framing: steer a frozen model with a smaller controller trained on token-level preference signals, via token-level flow-guided preference optimization.

Even if you ignore the specific algorithm, the pattern matters:

You can steer without retraining the foundation.
You can make reliability interventions fast and reversible.
You can version and evaluate the controller like a product.

Why we should care:

Most teams treat reliability fixes as either “change the prompt” or “fine-tune and pray.”
Controller-style steering gives a third option: a scoped, testable intervention layer.

Human-in-the-loop rubrics: reliability is often a “shared standard” problem

The hardest part of agent reliability isn’t always the model.

Sometimes it’s the absence of a shared, auditable definition of “correct.”

A recent paper on patch evaluation proposes a simple but scalable framework:

use an LLM to draft a task-specific rubric,
have a human review/refine it once,
use the rubric-guided LLM judge to evaluate many candidates.

They report improved agreement with human consensus (e.g., Cohen’s kappa 0.75 on the subset with unanimous human agreement), plus high recall/precision in that setting.

Even though the domain is program repair, the reliability lesson generalizes:

When humans disagree, it’s often because the rubric is implicit. Make it explicit once - then scale it.

A narrow reliability loop I’d actually run

If I had to condense the above into a practical loop (without turning it into a platform pitch), it would look like this:

Define correctness in end-states, not text: Use metamorphic relations / end-state equivalence where possible.
Stress-test, don’t just benchmark: Measure a reliability surface across repeated runs (k), perturbations (ε), and tool failures (λ).
Monitor online with calibrated decision rules: Turn verifier scores into stop / continue / escalate decisions you can defend.
Keep humans as reviewers of standards, not full-time graders: Use human time to approve/refine rubrics and resolve disagreements.
Treat steering as a first-class intervention: Controller models (doctor -> patient) are a pragmatic way to improve behavior without turning every fix into a full retrain.

Agent reliability is not a single feature and not a single metric. It’s a contract:

measured under stress
monitored online
improved with small, controlled interventions
audited through shared standards humans can actually read.

The best recent work is finally treating reliability like an object we can engineer - not a hope we can prompt.

Quick credit: Andy Wong consistently finds great new papers early, and we end up debating the implications together before they show up in writing.

Control knobs from recent LLM papers

Ruslan Belkin — Tue, 20 Jan 2026 15:15:42 GMT

The field is moving extremely fast right now. The half-life of a new idea is measured in weeks (sometimes days), and it is getting harder to tell what will actually stick.

Most weeks, the discourse around language models collapses into one of two modes:

“Look at this new leaderboard bump.”
“Agents are coming, everything changes.”

Both can be true - and still miss the point.

The stuff that actually moves production outcomes tends to look like new control knobs: better ways to steer models, keep them stable, make them faster, and make failures more legible.

Credit where it is due: a bunch of these paper finds came from Andy Wong, who consistently surfaces great work. We usually end up debating the implications together before it shows up here.

So here is my January reading stack: papers that feel unusually primitive-shaped. Each one adds a knob I expect we will keep using.

1. Recursive Language Models: “infinite” context via self-calls (no retraining)

Recursive Language Models (RLMs) are a different answer to long context: do not cram the prompt into the transformer - treat it like part of the environment.

One concrete instantiation: the prompt becomes a variable inside a Python REPL, and the model writes code to inspect the prompt, decompose it, and recursively call sub-instances of itself over slices of the prompt.

They report handling inputs two orders of magnitude beyond typical context windows - and even mention strong performance at the 10M+ token scale.

Why we should care:

This pushes long context from architecture into systems. Not ‘train longer,’ but ‘reason out-of-core.’
It is an agent-shaped pattern: when the model can write the loop it thinks inside, you get a new class of tool-use + decomposition behaviors.

It reframes the bottleneck: the limit becomes less context length and more how good the model is at building the right indexing + recursion strategy.

2. LLMdoctor: alignment at test-time, token by token

Most alignment work still assumes you either fine-tune the whole model (expensive, slow, often brittle), or you do test-time tricks that are coarse (trajectory-level) and compute-hungry.

LLMdoctor proposes a clean separation: keep a big ‘patient’ model frozen, and steer it with a smaller ‘doctor’ model using token-level signals.

Their claim is that many test-time alignment methods rely on distorted trajectory-level rewards or inefficient sampling that caps performance and harms diversity. The patient-doctor setup extracts token-level preference signals from the patient’s behavioral variations, then trains the doctor via token-level flow-guided preference optimization (TFPO) to preserve diversity while aligning outputs.

Why we should care:

Steering becomes modular. Iterate on the doctor without re-baking the patient.
Granularity matters. Token-level intervention is the difference between ‘mostly aligned’ and ‘aligned where it counts.’

Closer to how agents fail. Agents do not fail at the end of the trajectory - they fail mid-trajectory.

3. Entropy-Adaptive Fine-Tuning: a practical take on “don’t forget”

Supervised fine-tuning is still the workhorse for specialization - and “catastrophic” (or I would rather say “annoying”) forgetting is still the bill we pay.

This paper’s framing is crisp: it contrasts SFT with on-policy RL and argues the gap comes from distribution mismatch. In RL, the model’s learning signal is more consistent with its internal beliefs; in SFT, the model is forced to fit external supervision even when that conflicts sharply with what it ‘knows.’

They focus on confident conflicts: cases where the label token is low-probability under the model, while the model’s distribution is low entropy (i.e., it is confidently predicting something else). That is where gradients get destructive.

Their proposal, Entropy-Adaptive Fine-Tuning (EAFT), uses token-level entropy as a gating mechanism: learn aggressively when the model is uncertain; suppress gradients when the model is confident-but-disagreeing.

From my most recent post: I also think EAFT is a genuinely useful alternative to LoRA in the ‘don’t wreck the base model’ sense - rather than constraining where we update (parameter-efficient adapters), EAFT constrains when updates should matter (skip the destructive ones).

Why we should care:

This is the kind of idea that turns continuous updates from scary to feasible.
It maps to a real production vibe: most of the time we want to learn; sometimes we want to refuse the lesson.
It is a different safety knob than LoRA - but it is targeting the same anxiety: regressions.

4. From Entropy to Epiplexity: measuring “useful information” for bounded learners

Data quality is still the hidden kingmaker. The hard part is: we are not data-rich, we are signal-poor.

This paper asks a deceptively simple question: can we quantify learnable content in data without tying it to a downstream task?

They argue classic information measures (Shannon entropy, Kolmogorov complexity) do not capture what matters for computationally bounded learners, and they propose a new measure: epiplexity.

The vibe: epiplexity is meant to capture structural content while excluding ‘time-bounded entropy’ (random/unpredictable content), and the authors claim it helps explain why deterministic transformations and data ordering can still create useful learnable structure in practice.

Why we should care:

If inputs are becoming the product, we eventually want a metric for the informational value of inputs.
Epiplexity feels like a step toward data selection as an engineering discipline, not an art project.

5. LLaDA2.0: diffusion language models to 100B

Autoregressive decoding is powerful, but it is fundamentally serial.

LLaDA2.0 pushes discrete diffusion language models to 100B parameters via a conversion process: take a pretrained AR model and convert it to a dLLM using a 3-phase block-level training scheme (warm-up with increasing block size, stable full-sequence diffusion, decay back to compact block diffusion).

They also discuss post-training alignment with SFT and DPO, framing this as a path to frontier-scale efficiency while preserving parallel decoding advantages.

Why we should care:

Parallel decoding is not just a speed story. It changes how we can spend compute at inference time.

Faster sampling = more room for verification, search, and self-checking within real latency budgets.

6. PoPE: decoupling the “what” and “where” in positional embeddings

I have a soft spot for papers that say: ‘this popular thing is entangled in a way that quietly hurts you,’ and then fix it cleanly.

PoPE (Polar Coordinate Positional Embeddings) argues RoPE entangles content (what) and position (where), which can impair tasks requiring independent matching on the two. They propose PoPE to remove the confound, show better performance on diagnostics and across sequence modeling domains, and highlight strong zero-shot length extrapolation vs RoPE - and even vs YaRN.

Why we should care:

Long context is now table stakes for serious agent workflows.
Works at 8k is not the same as behaves at 80k.

DeepSeek + stuff that ships

Engram: conditional memory as a second sparsity axis

The Whale does it again. MoE gave us conditional computation. But knowledge lookup is still mostly simulated via dense compute.

Engram proposes conditional memory - a complementary axis of sparsity - implemented via an O(1) lookup module modernizing classic N-gram embeddings.

They describe a ‘Sparsity Allocation’ tradeoff between neural computation (MoE) and static memory (Engram), claim a U-shaped scaling law, and report scaling Engram to 27B parameters with gains not just on knowledge tasks but also reasoning, code/math, and long-context retrieval.

mHC: Manifold-Constrained Hyper-Connections

This zooms in on a real training pathology: expanding residual streams/connectivity can improve performance, but it can also break the identity mapping property residual connections rely on - leading to instability and scalability issues.

mHC proposes projecting the residual connection space onto a manifold to restore identity mapping while keeping things efficient.

DeepSeek-R1: RL-first reasoning + the GRPO refresher

Even if you are excited about the next base model drop, R1’s training recipe is the more durable lesson.

The core claim: reasoning behaviors can emerge via pure RL (with a cold-start SFT phase for readability/stability), and they lean on GRPO - which is worth revisiting if your PPO mental model is rusty.

Quick intuition: GRPO drops the critic and estimates a baseline from grouped samples, which matters a lot for scaling RL in LLM land.

Solving LLM repetition in production

This one earns points for being unapologetically real: repetition loops that stall batch tasks.

They identify repetition patterns, frame the root cause via Markov analysis + greedy decoding getting stuck in loops, and evaluate mitigations: beam search with early_stopping=True (universal post-hoc), presence_penalty (case-specific), and DPO fine-tuning (model-level universal).

The connective tissue

If I had to summarize the direction across these papers in one line: we are entering the era of systems that add knobs: inference-time recursion for extreme context, token-level steering, entropy-gated learning, explicit memory, better information measures, disentangled position representations, and faster decoding.

The fun part: these knobs compound.

Notes on Product Development Process?

Ruslan Belkin — Fri, 21 Nov 2025 23:15:14 GMT

“When a number of people concentrate on a single thought, they can compel the world to accept it.” — Paramahansa Yogananda

“What is your Product Development Process?” - I’ve been asked this question several times lately—including during the ELC podcast.

I’ve always been an engineering person. Systems, infrastructure, APIs—that’s the way my brain was wired. My interest in product and users started early at LinkedIn. Perhaps I was influenced by my product partner there - Allen Blue, who was one of the co-founders and amazingly forward looking product visionary. Somewhere in those early cycles it clicked: we do what we do in service of our users—whether those users are developers, end users, or enterprise customers. That realization pulled me from “just engineering” into caring about product outcomes. And yet, at the start, I was a complete novice at product (and truthfully probably still am).

I read the usual suspects (The Lean Startup, Competing Against Luck, The Mom Test, and a mountain of blogs) about experiments, rapid iteration, speed, listening to customers, etc. Yet the truth is that a lot of product and company building is about luck—and if you play the lottery long enough, sometimes you win and the technique was perhaps less important than professed. At some point I came to a realization: there is no such thing as “finding” product‑market fit beyond random luck. There is only compelling the world to accept your vision of the future.

So, here’s how I think abut the product product development process today:

1. Imagine the future

Think of yourself as a sci‑fi writer. Imagine the future you want to see. In detail and with maximum nuance. Feel it, smell it, live it, write about it (for yourself mostly). This is not entirely a logical process. Works of art don’t come from the mind alone; they come from the soul. The more exact, detailed, and precise the imagination—the better. The approach works not only for big ideas, but also for small problems, design decisions and similar challenges. Try it. Obviously, you need a background in the field so your imagination has grounding.

Where this can go wrong:

The idea isn’t yours. Works of art are not a collective exercise. We often adopt other people’s ideas unconsciously—we are imitators by design (see René Girard’s mimetic theory). Your founding team can absolutely enhance and improve the idea—just don’t confuse where it came from. Maybe the product idea isn’t yours, but you came up with a technological solution that makes it work—that’s OK, join the team. The worst failure mode is convincing yourself you like someone else’s idea on logical merits when you’re not passionate about it. If you’re going to be a mercenary, be honest about it—and love the craft of being a mercenary.
The idea isn’t well envisioned. It’s a fleeting, fuzzy thought. The world can’t materialize a fuzzy image. This failure mode often happens to non-technical people or when we simply don’t yet have a necessary background to be able to imagine the end product in a precise enough detail.
It’s purely cerebral. A product born only from logic usually isn’t art. You need conviction that lives below the neck.

2. Test it

Pitch it to your smartest friends, as well as to friends not in the field. Stress‑test it from different angles. Run experiments to prove or disprove your thinking. Is the idea actually new in the first place or you just didn’t bother to do basic research? Think like an investor; evaluate it as you would an investment pitch. This is the logical part. Is the technology feasible? Is the timeframe right? You have to allow some uncertainty, perhaps a lot of uncertainty. You will hear many “no”s and there will never be enough data. But there should be a threshold of probability you can settle on—and then decide.

Where this can go wrong:

No serious scrutiny. Less of an issue with startups, but a very common failure mode in large companies. Apply VC‑type scrutiny: explicit assumptions (why, why now, why this team, how is it different, what category are we creating or capturing, how do we solve the distribution). Even for small projects - do this, it can be done very quickly with modern tools.
Never‑ending search for 100% conviction. It doesn’t exist. Set the risk level you can carry and move.

3. Focus on it

This is about compelling the world to accept your vision. It requires assembling as many smart, values‑aligned people as you can and maintaining persistent focus on execution. This is also the place where much traditional advice can work—to an extent.

Where this can go wrong:

Insufficient critical mass of smart people, or settling for the wrong, mediocre team (or a team of mercenaries) that isn’t excited or hasn’t adopted the idea almost as a new religion. I have yet to hear from anyone that they have a mediocre team (“we have a very high bar cliché”). Be honest, is it really true? Most teams in fact are mediocre or worse.
Insufficient funding. It is perhaps a sad truth - but funding is a competitive differentiation for startups. Inadequate capital could impair both the ambition and execution when the speed matters most.
Loss of focus during execution. Could happen for variety of reasons - loss of conviction on a part of the team, poor growth execution / team dilution (by hiring too many or wrong people too fast), sometimes a function of too much funding.
Bailing too early—not giving the idea enough time to work. This one is hard as you never know for sure.

I’m writing this for myself—as a reminder—as much as for you. It’s incredibly easy to fall into these traps.

Imagine clearly. Test honestly. Focus completely.
Compel the world.

Love to hear your thoughts.

Agents That Feel the Room—and Fix Themselves

Ruslan Belkin — Sat, 25 Oct 2025 14:15:25 GMT

If you watch today’s agents for more than a few minutes, you see the mood swings. In one chat they’re warm and crisp; in the next they over‑explain, miss an obvious cue, or bungle a tool call with the wrong parameter. And even when there’s no “human feel” at all—just a heads down task—they drift off spec: skip required steps, optimize the wrong objective, or simply neglect or misinterpret human inputs.

I don’t think this is a “just scale it” problem. It’s a feedback problem. We’re still treating behavior—tone, choices, task adherence, even API hygiene—like vibes in a prompt instead of something we can observe, grade, and steadily improve. That’s why I keep coming back to intelligent data flywheels: small, concrete loops that turn context ↔ behavior trace into a living artifact for training, evaluation, and live steering.

What I mean by “intelligent data flywheels”

The picture in my head is simple. Keep a map of the situations your agent encounters (support vs. sales, calm vs. upset, low‑stakes vs. high‑stakes) and the behaviors you want in each (clarity, empathy, brevity, caution, brand voice and values). Use that map to (a) generate realistic multi‑turn data—by having a “Human LLM” act out the human side of conversations—and (b) judge the agent against the behavior playbook with a grader that you also train. Then run the same primitives at three speeds: observability (find where it breaks or excel), inference‑time nudging (steer it right now), and training (make the fix stick). When human raters and the judge disagree beyond a reasonable margin, you update the judge; when they align, you let the judge carry more load and feed the next enhancement round. That’s the loop.

I still care a lot about emotional intelligence—human-facing agents that communicate across different media and modalities—but the same flywheel helps with boring, high‑impact stuff that isn’t “EQ” at all: tool use, retrieval grounding, latency/cost tradeoffs, and safety drift.

Why this suddenly feels practical

A few research threads clicked into place:

Principle‑following reward models showed you can align behavior to a collection of human-written rubrics instead of massive preference sets
Inference‑time scaling for judges matured (e.g. DeepSeek’s GRM). Spend more tests‑time compute—parallel samples plus a meta‑judge—and you get more reliable reward signals for both training and live guardrails.
Judges that actually reason generate a case-specific rubric before scoring, which humans can verify, align, and generalize..
Open, strong preference data (e.g. helpsteer3) and closed, experiential data from your deployed agent provides a solid base you can specialize with your own behavior map.

Caveat: LLM‑as‑a‑judge isn’t magic. Benchmarks like JudgeBench show judges can be brittle or biased if you don’t treat them like first‑class products—versioned, monitored, retrained, and contexted.. That’s another reason to put them inside the flywheel.

The unglamorous stuff agents fail at

Function calling & schema correctness. Even top models still fumble basic format rules (quote this string; ISO date there) and multi‑step tool chains. Recent work—BFCL, JSONSchemaBench, IFEval‑FC—quantifies how often calls are syntactically valid yet semantically wrong. In my head, the “judge” can be a schema/trace checker with scenario‑aware penalties, and the generator can synthesize tricky, long‑horizon tool graphs to close the gap.
Grounding & hallucinations in RAG. Datasets like RAGTruth and newer lenses like HalluLens keep reminding us that extrinsic hallucinations haven’t vanished; high‑certainty hallucinations are especially sneaky. A flywheel can grade answers on entailment against retrieved context and choose the next hard cases to label or synthesize.
Open‑world task reliability. Real agent work looks like OS‑level workflows and the messy web. OSWorld, WebArena, and AgentBench have moved the bar here and highlight recurring failure modes—state tracking, planning depth, visual grounding. Using their task taxonomies as “contexts” and step‑level success as “behaviors” gives you a clean contract for the flywheel to optimize.
Safety and the “persona dial.” OpenAI’s emergent‑misalignment results point to interpretable persona features—directions in activation space that modulate toxic or deceptive modes. That turns safety from a black box into a dial you can monitor and counter‑steer inside the flywheel.

And yes, EQ still matters because humans are in the loop. Benchmarks like EmoBench, EmotionQueen, and multimodal EmoBench‑M show a persistent gap to humans on “understand + respond appropriately.” That’s the sweet spot for a behavior‑by‑context map coupled to a judge that also reasons about emotion.

How this looks in practice—my mental movie

I picture an analytics view that doesn’t just say “CSAT dropped,” but where and why: “In escalations from upset users, brevity overrode clarity; schema errors spiked in step‑3 tool calls; the judge drifted on empathy.” From there, the loop suggests more of what it needs: a batch of synthetic escalations with complicated tool chains; a judge tune on Emotion‑Application items; a tweak to the behavior weights for this scenario. We close the loop in three places: surface the issue (observability), compensate now (inference), and make it permanent (training). Publish the change, re‑benchmark, repeat.

Over time you get a system that not only thinks better, but behaves predictably under stress—because behavior stopped being vibes and started being data.

And one more thing: over time, those results compound into a differentiated, market-adapting playbook — a living operational memory co-built by your team and the system, shaped by every success, failure, and fix.

Is code still the source of truth?

Ruslan Belkin — Wed, 15 Oct 2025 14:15:48 GMT

I argued at ELC Annual that the center of gravity is drifting—from code to inputs about the code. Prompts, datasets, models, and evals are becoming the primary artifacts. That’s not a prediction; it’s merely connecting the dots backward and noticing what’s already changed.

The constraints we actually fight

Most of engineering is just pressure-testing three constraints: human cognition (complexity, coordination, bugs), provisioning at scale (deploy, redundancy, cost), and change risk (feedback loops). The tools that moved the needle before LLMs—managed runtimes, DVCS+CI/CD, containerization/cloud, observability—were all bets against those constraints.

What shifted with LLMs

Treating the model like a “compiler” works—until it doesn’t. Traditional compilers are deterministic; auto‑regressive models aren’t. Tiny prompt edits (or swaps between model families) yield materially different outcomes. That puts inputs—prompts, retrieval/context rules, tool schemas—and evaluation at the center. The inputs are the product.

Two practical implications:

Inputs as first‑class artifacts.
Prompts (and flows), input datasets (docs, tickets, logs), models/fine‑tunes, and eval suites/user simulators all need real versioning, lineage, and regression checks—because models change under your feet.
Evaluation is your safety rail.
You won’t guarantee determinism; you can bound behavior and catch drift. Invest in evals and forward simulation before you put agents anywhere near money or production.

Legacy, repos—and the new “rewrite”

“Full rewrite” used to be a dirty word. With LLM‑accelerated throughput, wholesale rewrites are increasingly viable when entropy makes understanding costlier than regeneration + hardening with evals.

What engineering is becoming

Zoom out and the job collapses into tooling and data.

Humans become what the models aren’t: (a) evaluators and (b) carriers of undocumented institutional knowledge. That’s where leverage lives.

Tech debt is dirty data.
Clean it, or it compounds.

A minimal operating checklist (until better tooling is available):

Check in prompts with code. Include tool schemas, context rules, guardrails, and tests.
Pin and record models. Track families/versions and fine‑tuning metadata like compiler flags. Expect drift.
Build evals before features. Scenarios, simulators, acceptance criteria—gate releases on them.
Prefer rewrite when entropy wins. If understanding cost > regeneration + eval hardening, start over.
Instrument everything. You can’t lead probabilistic systems blind.

None of this requires prophecy (Bohr and Feynman would approve); tempered by the reminder that auto‑regressive generation diverges without control.

The link to the full video

— Ruslan

Updated workout routine - Part 2/2

Ruslan Belkin — Sat, 11 Oct 2025 14:15:37 GMT

“I am good at pullups” —Tony Horton

As mentioned in the previous post - I’ve recently tweaked my workout schedule to harmonize the functional block (focused on full body power movements) and weight lifting block (focused on muscle building) to make them both be 6-weeks in length each. I alternate those blocks.

In my previous post I described the prior routine: two blocks — a 6-week full-body (mostly power-focused) functional block, followed by a 4-week weight-lifting block.

In this post, I’m outlining a modified functional block. This is the hardest block due to high demand on neuro-musculature system (even though the weights are considerably lower) and the follow on weight lifting block always feels like a break.

As before, I work out first thing in the morning (work scheduling), with a 5:30 am wake-up, quick oral/face hygiene, pre-workout supplements, and a 10-15 min meditation.

Because the blocks are now a bit longer, I’m skipping pre-workout foam rolling during the work week. If I had unlimited time, I’d keep it.

I still use guided workouts (Beachbody — historical) mainly for timing and pacing.

The core workout is built around The 6-weeks of the work program by Amolia Ceasar and it consists of 3 blocks, 2 weeks each. The workouts get progressively more complex (although not necessarily harder) in each subsequent block. Since the program has a formal rest day and a really easy 20-25m stretch routine day, I simply skip the rest day and tag on the stretching routine to the tail end of the last workout of the week. So the program’s 2 weeks end up being 10 days for me. To make it still be 6 calendar weeks I simply repeat week 1 or week 5 in the end (week 1 is actually one of the hardest weeks if you push the weights up). For reference - I am sharing my workout sheets here.

Below are two blocks each matching alternating week in the program.

Weeks 1/3/5:

Weeks 2/4/6:

P90X3 Challenge consists of four blocks. Each block has two sets of pull-ups followed by push-ups (different variations). My current numbers are: 15 pull-ups, 40 push-ups per set. That’s 120 pull-ups and 320 push-ups per workout. There’s also a burnout at the end — 2 pull-ups and 4 push-ups, no rest, for 6 sets — so add 12 pull-ups and 24 push-ups to the total.
10min Ab Hummer is from Hammer and Chisel program, P90X Ab Ripper is from P90X and 10min Abs is from Insanity Max 30.

Intervals

Long intervals: 3-min warm-up jog; 3 × 5-min runs with 2-min walks in between; cool-down jog/walk.
Short intervals: 3-min warm-up jog; 9 × 1-min max-effort sprints with 1-min walks; cool-down jog/walk.

When time is short, abs get dropped first. If Zone 2 is 50 min, I cut it to 30; if Zone 2 follows intervals, I drop it.

The form is critical for these workouts, especially mental fatigue sets in. Seemingly lower weights can be deceptive - do not raise the weights unless you can complete the exercise with nearly a perfect form, as the risk of injury with complex power moves goes up considerably.

Note: doing exercises with a perfect form is an easy way to enhance effectiveness with lower weights and much reduced risk of injury.

When traveling

When traveling and the hotel gym with weights is available, both circuits can continue, depending on the space and how crowded the place is. When no gym is available I normally switch to Insanity Max 30 Month Two workouts for the duration. Those are body weight only and while not sufficient by themselves, they are decent enough for maintenance while traveling.

Again - this is what I do. Consistency and a purposeful plan are keys and there are many other ways to stay in shape.

Updated workout routine - Part 1/2

Ruslan Belkin — Sat, 23 Aug 2025 14:15:23 GMT

“Key to performance: fit the benchmark.” —unspoken rule at major AI labs

DEXA Update (August, 2025):

As you can see - total body fat went up by 3.3 pounds, while total lean mass went up by 4.4 pounds since the last scan. As you can see - it is actually very difficult to increase lean mass without gaining some fat mass.

I’ve recently tweaked my workout schedule, shifting more toward muscle and VO2max maintenance.

In my previous post I described the prior routine: two blocks — a 6-week full-body (mostly power-focused) functional block, followed by a 4-week weight-lifting block.

In this post, I’m outlining a modified weight-lifting block that’s now 6 weeks to match the functional power block.

As before, I work out first thing in the morning (work scheduling), with a 5:30 am wake-up, quick oral/face hygiene, pre-workout supplements, and a 10-15 min meditation.

Because the blocks are a bit longer, I’m skipping pre-workout foam rolling during the work week. If I had unlimited time, I’d keep it.

I still use guided workouts (Beachbody — historical) mainly for timing and pacing, and my week starts on Sundays.

The general structure: a weight-lifting segment followed by a cardio segment (where I listen to podcasts or other educational audio).

The core workouts are built around Body Beast workout with additional components added on.
P90X3 Challenge consists of four blocks. Each block has two sets of pull-ups followed by push-ups (different variations). My current numbers are: 15 pull-ups, 40 push-ups per set. That’s 120 pull-ups and 320 push-ups per workout. There’s also a burnout at the end — 2 pull-ups and 4 push-ups, no rest, for 6 sets — so add 12 pull-ups and 24 push-ups to the total.
T-30 Chin-up and Pull-up routines are short (~12m) 2X chin-up/pull-up ladders followed by hanging leg raises, monkey bar simulation and a burn-out. My ladder is usually 2-4-8-16 (30 total in each ladder).

When time is short, abs get dropped first. If Zone 2 is 50 min, I cut it to 30; if Zone 2 follows intervals, I drop it.

Intervals

Long intervals: 3-min warm-up jog; 3 × 5-min runs with 2-min walks between; cool-down jog/walk.
Short intervals: 3-min warm-up jog; 9 × 1-min max-effort sprints with 1-min walks; cool-down jog/walk.

Notes on the weight-lifting pattern

Standard sets follow a light–medium–heavy–drop-set pattern.

Light set (15 reps): warm-up for the muscle group and a chance to fine-tune the form.
Medium set (≈12 reps): calibration (heavy but not maximal). Sometimes I skip medium and go straight to a heavy 8-rep set, then a second heavy set after 45–60 s rest.
Heavy set (6–8 reps): stop 1–2 reps shy of failure, immediately followed by a drop set at medium weight for another 6–8 reps.

There are no true power moves in this block, but I often simulate them by moving fast on the concentric and slowing down on the eccentric (e.g., a rapid pull-up followed by a slow controlled descent).

Note 1: doing exercises with a perfect form is an easy way to enhance effectiveness with lower weights and much reduced risk of injury.

Note 2: shoulder upright row is bio mechanically not a great exercise as it puts shoulders into an impingement position (EZ-bar or not), so I replace it with a dumbbell high pull as the closest alternative.

As always—if the last set is too easy, add a few extra reps; next time, increase the weight. The key to performance is sufficient stimulus.

Why Inverting AI Workflows Is Key to Enterprise Accuracy

Ruslan Belkin — Sun, 08 Dec 2024 15:15:47 GMT

In this post, I want to highlight a persistent challenge that often stalls efforts to improve AI accuracy in enterprise environments. While many acknowledge that insufficient data can be a limiting factor—especially when proprietary information is needed to enhance a general-purpose large language model (LLM)—the real issue goes beyond mere scarcity. To achieve top-tier domain performance, enterprises may need to fine-tune LLMs so that they internalize specialized knowledge and context. Unlike Retrieval-Augmented Generation (RAG) or prompt-based methods, which can inject additional information at inference time without altering the model’s parameters, fine-tuning permanently integrates the provided training data into the model’s internal representations.

This direct integration makes data quality crucial. Fine-tuning can deliver significant improvements in accuracy and domain alignment, but only if the training data is both reliable and relevant. Unfortunately, many enterprise datasets—sourced from CRMs, ERPs, and other operational systems—are incomplete, inconsistent, and lack clear indicators of quality. These repositories often contain “dirty” data that, even after cleaning, remains difficult to differentiate in terms of importance or strategic value. Merely connecting an LLM to these data sources, hoping they will serve as a source of refined knowledge, is insufficient. Without a methodology to identify which subsets truly matter, organizations risk overwhelming the model with undifferentiated content, ultimately diluting its domain expertise rather than enhancing it.

Only with robust frameworks for scoring, filtering, and weighting data can organizations ensure that their fine-tuned LLMs prioritize the most accurate, timely, and contextually relevant information. In the absence of these data differentiation strategies, the promise of fine-tuning remains out of reach, and valuable proprietary knowledge fails to translate into tangible performance gains for the AI model.

An Even Bigger Problem: Human Knowledge Gaps

Even if enterprise data could be perfectly cleaned, quantified and integrated, it still wouldn’t guarantee optimal AI-driven decisions. A significant portion of critical knowledge remains locked in human minds—varied in expertise, judgment, and subjectivity. This disparity is why organizations hold meetings, rely on structured decision-making processes, and depend on human interpretation to bridge data with hopefully sound judgment.

To overcome this challenge, enterprises must rethink their AI workflows. Rather than AI serving as a co-pilot to human decision-makers, it should take the role of the pilot. In this model, humans shift from being the central drivers of decision-making to acting as data providers or experts who contribute insights only when the AI specifically requests them. The ideal scenario minimizes human intervention, enabling the AI to steer autonomously.

Extending the aviation analogy, this transition must be continuous. Instead of a static handover, the AI system should operate within a near real-time decision loop, constantly refining its approach based on new information and feedback. In doing so, the enterprise moves toward a state where human judgment is available on demand, but rarely required at the forefront.

A Real-World Example: Customer Service Chatbots

Let’s consider a simple but real-world scenario: a customer service chatbot. Imagine this bot has access to all relevant information—customer databases, Jira tickets, FAQs, and documented workflows. For simplicity, assume it achieves perfect accuracy in intent detection.

What failure modes remain?

The data is incomplete (e.g., it doesn’t know the answer).
The data is wrong (e.g., outdated FAQs or incorrect workflows).

In these scenarios, the fallback is often to engage a human. This is illustrative because it highlights what the AI system should ideally do: engage humans strategically when gaps arise.

Addressing Failure Scenarios

Two critical questions emerge:

What should happen next after a failure is detected and a human is engaged?
How could the failure have been avoided in the first place?

For the first question, the optimal response is to escalate the issue to a knowledgeable human (or multiple humans for high-stakes decisions), supported by the AI system in data lookup and verification. These escalated humans are likely not the same as the first-line responder—they must have the expertise to guide the system to the best resolution. Think of this as a "pager duty" for data management.

For the second question, prevention lies in robust simulations. For example, problem queries could be simulated using past customer interactions, exploring various branches of inquiry to proactively identify gaps in data or workflows.

This approach scales beyond customer service to more complex systems, such as financial modeling, supply chain management, and strategic decision-making.

Scaling the Solution

Inverting traditional workflows—placing AI in the pilot’s seat and using humans as expert data sources—can fundamentally reshape enterprise operations. When properly implemented, this approach not only taps into the best of human expertise on demand, but also ensures that AI systems maintain efficiency, consistency, and domain relevance over time. Yet, to realize these benefits, organizations must couple this human-in-the-loop paradigm with robust data-centric strategies, particularly when it comes to fine-tuning AI models. A carefully curated and differentiated dataset ensures that the AI’s internal parameters remain aligned with business goals and processes and reflective of the most reliable, high-quality information.

For this approach to succeed, enterprises should:

Automatically Query Human Input:
Treat humans as dynamic, on-call domain experts who provide clarifications, judgments, and domain-specific insights exactly when the situation demands. This continuous, as-needed engagement ensures that fine-tuned models incorporate high-value human knowledge without depending too heavily on subjective inputs.
Simulate Task Inputs:
Integrate continuous simulation runs. By proactively refining data quality and selection, the AI system stays prepared to handle new inputs effectively.
Continuously Update Data and Models
Implement iterative pipelines that not only refine data selection, weighting, and filtering, but also continuously adjust the model’s fine-tuning parameters as it encounters new scenarios.

Enterprise AI systems can’t succeed if they’re treated as mere co-pilots. The future of AI in the enterprise requires rethinking workflows, prioritizing AI as the primary decision-maker, and leveraging humans as on-demand knowledge sources—brought into the loop only when their judgment or domain expertise is truly needed. This shift in perspective not only demands robust strategies for filtering, scoring, and fine-tuning proprietary data, but also calls for continuous refinement of how organizations manage, interpret, and integrate information into AI models.

By giving the AI system the reins, enterprises can better address the persistent issues of “dirty” data, the hidden complexities of human judgment, and the inherent challenges of operating at scale. The goal is not to diminish human input, but to transform it into a strategic resource that the AI can query as needed. This inversion—where humans assist AI, rather than the other way around—represents the logical next step in unlocking the full potential of AI within the future enterprise landscape.

Eating out hacks

Ruslan Belkin — Sat, 14 Sep 2024 14:15:58 GMT

Disclaimer: The following is based on my personal experience and is not intended as medical advice. Everyone's body and health situation are different, and what works for one person may not work for another. Always consult with a healthcare professional or physician before starting or changing any medication, supplement, or diet plan.

Let’s face it—we can’t always control what we eat, especially if we want to participate in business and social life.

You eat at work, attend business or family functions, year-end parties, and let’s not even mention holidays and vacations. The energy balance equation can quickly get out of whack.

Let’s first address the topic everyone is talking about: GLP-1 receptor agonists. While I don’t have personal experience with them (don’t need it), I can see how they could be beneficial for many people. From what I understand, I don’t share many of the concerns about them. In biology, there’s no such thing as a "free lunch"—there’s always a risk/reward balance, and in the general case, being overweight and metabolically unhealthy presents a much higher risk.

GLP-1 agonists, such as semaglutide (sold under brand names like Ozempic and Wegovy), liraglutide (Saxenda), and tirzepatide (Mounjaro), have been used for years in the treatment of diabetes and off label in the body building community. These medications work by enhancing insulin secretion, slowing gastric emptying, and increasing satiety, leading to reduced food intake. The mechanism of action is well understood. However, the downsides are also known. One of the primary concerns is the loss of muscle mass along with fat, which can be a serious issue for sedentary individuals. So yes—if you’re considering using these medications, lift weights and eat plenty of protein.

There are some lesser-known issues, such as reduced heart rate variability (HRV), so I wouldn’t suggest casually using these drugs just to lose 10 pounds. But honestly, I see a lot of people who could have benefited from starting on them yesterday.

As for nutraceuticals, there are plenty of marketing claims about manipulating GLP-1 levels with probiotics, herbs, and other supplements. However, these claims are not supported by solid science. Mechanistically, to achieve the same effect as pharmaceutical GLP-1 agonists, you would need a 1,000-fold increase in the efficacy of these substances. So don’t bother.

The second approach is purely behavioral and has worked for me and many people I know—aligning your meal schedule with your circadian rhythm. This means eating three meals (breakfast, lunch, and dinner) at as consistent a time as possible each day, while absolutely avoiding any snacking between meals. Doing this helps prevent the constant need to eat by keeping your body in a more natural rhythm.

Regarding hunger between meals, there’s a remedy that can help (though, if you avoid snacking long enough, you will not need it). This involves using bitters, an ancient remedy believed to stimulate digestion and reduce appetite. Bitters work by activating bitter taste receptors in the mouth and gastrointestinal tract, which in turn stimulates the release of digestive enzymes and bile, potentially promoting a feeling of fullness or reducing cravings.

One specific supplement is Amarasate Extract, derived from New Zealand’s Gentian root. It contains bitter compounds that target receptors in the gut, signaling the brain to trigger satiety, which can help suppress appetite between meals. While these supplements aren’t highly effective for everyone, they can be a useful temporary hack until you get into the no-snacking habit.

Now, assuming you’ve got the basics under control but find yourself at a function with an overwhelming amount of tasty food (think authentic pasta dishes), self-control might not be enough—except for leaving early (which is often a good idea for other reasons).

First, front-loading with protein-rich foods when available (instead of carbs or alcohol) can help. This will increase satiety upfront, making you less likely to overeat.

Secondly, there are two pharmaceutical options you can consider:

Orlistat blocks pancreatic lipase, the enzyme necessary to digest fats. The over-the-counter dose is 60 mg, which is mildly effective, while the prescription dose of 120 mg is more effective, reducing fat calorie absorption by about 30%. However, a warning: overuse will reliably lead to gastrointestinal side effects, especially diarrhea.
Acarbose is a reversible alpha-glucosidase inhibitor that slows the breakdown of complex carbohydrates (such as pasta, rice, and bread) into simple sugars, delaying glucose absorption. Acarbose can lower postprandial blood glucose by about 20-40 mg/dL (closer to 20 mg/dL for me) per 25 mg dose, up to around 75 mg. Most of the calories from these carbs will eventually be absorbed, but some may pass into the large intestine, where they are fermented by gut bacteria. Beyond metabolic benefits, avoiding blood sugar spikes helps control hunger. Fun fact: Acarbose is only one of 3 drugs found to prolong lifespan in ITP (Interventions Testing Program) studies (the other two being rapamycin and SGLT2 inhibitors).

Both of these medications must be taken at the right time—specifically with meals. The official recommendation is to take them with the first bite of food, but I’ve personally found they work a little better if you take them a few bites into the meal.

When it comes to simple sugars, that’s the toughest challenge. Unfortunately, there’s no highly effective remedy for handling them (aside from SGLT2 inhibitors, which wouldn’t qualify as a party hack). One supplement with some evidence, though not strong, is Gymnema Sylvestre extract. This herb contains compounds known as gymnemic acids, which are thought to work by blocking the sugar receptors on your taste buds, reducing the sweetness sensation and potentially curbing sugar cravings. It may also slow the absorption of sugar in the intestines by interacting with glucose transporters. However, while Gymnema Sylvestre might help a little if taken about 30 minutes before a meal, especially with sugary foods, it’s far from a magic bullet for controlling simple sugars.

In conclusion, these strategies can help you shave off some extra calories and reduce hunger, allowing you to get through meals with minimal damage.

Hope this helps and love to hear your feedback,

Ruslan

Note: when I say something does or does not work for me, I rely on objective measurements, such as CGM in real-time, as well as calorie tracking (using Cronometer app), calorie expenditure (using Apple Watch), body composition (a scale as well as periodic DEXA scans).

Differentiation in SaaS AI

Ruslan Belkin — Thu, 12 Sep 2024 17:30:04 GMT

A lot has happened in the last few months as the field has advanced (new models, new choices, lower prices), several companies have "inflected," and some things (CUDA dominance) have stayed the same.

What has remained elusive, however, is where differentiation will ultimately lie, specifically among startups.

In the interest of minimizing noise, I will omit opining on the most common themes that other people have been discussing at length (inference vs. training hardware, open-source vs. closed-source models, large vs. small models, various picks and shovels, and fine-tuning—though the latter remains an unsolved problem for application developers). All these areas have one thing in common: they require a lot of money and are thus mostly suitable for larger players.

The question is—where is, or is there, a differentiated opportunity? (And by differentiated, I mean outside of dumping a ton of money into a seed-stage company against a deck and a team—which can be differentiating on its own—or likely not.)

Let’s look at the basic structure of an LLM-first SaaS app. What does it need?

Access to a set of competent LLMs of different sizes at a reasonable cost and speed (many choices, nothing super differentiated).
A RAG system (i.e., pgvector)—not differentiated, especially with larger context windows, where the emphasis on retrieval accuracy is somewhat diminished.
An orchestration infrastructure: there’s basically a classic flow, such as intent detection, query rewriting, summarization, output checking, and correction. These frameworks will evolve, and there will be many options. In fact, with more and more LLM usage to write code itself, one could argue that frameworks like LangChain add more layers of complexity and problems than they solve—though I could be proven wrong. There could be new ways to think about frameworks in a world where most of the code would be generated.
A testing and observability infrastructure—definitely a major pain, but the problem area is obvious and many people are working on it. The question is whether it will spur s slew of new SaaS companies, or - considering immaturity of frameworks it will be more tightly coupled with service providers offering models and data services?
Coding tools (Copilot, Cursor, etc.)—again, quite obvious, and various tools will exist wrapping around ever-better models. The IDE becomes more of a very complex prompting interface and a feedback collection tool. There are several possibilities here:
- A revolutionary new interface that we have not yet seen or thought of
- A plethora of Cursor-like forks and VSCode plug-ins, though it’s difficult to see how they win, unless Microsoft simply yields. Cursor being superior to GitHub Copilot is mostly a function of Cursor RAG-ing your codebase, and GitHub being presumably more “privacy” conscious. A simple business decision by Kevin Scott can alter that equation overnight.
Differential fine-tuning and associated dataset management—no great solutions, but probably best addressed by model providers and companies like Scale.AI.

Alright. Some winners could emerge (especially around testing, and observability). It will likely converge into a few larger winners, most closely tied to model and data providers.

What are we overlooking? Can building an LLM wrapper be intrinsically differentiated? By intrinsically, I mean: is there an accumulation of leverage that creates a competitive moat over time? There’s always an opportunity to build a better product—better UX, clever influencer marketing, superior GTM—that leads to the acquisition of market share, and that can be differentiated, albeit hardly a sure thing.

There is likely one area that can be a source of leverage. We know proprietary data is that source in AI. But that’s too unstructured; to be differentiated, it needs to lead to a vastly superior outcome that is not obtainable by just using commonly available models.

So, what form is that, and how might it work?

When we look at a classic SaaS workflow (for example, responding to a customer service query), what happens is quite specific to each customer in two ways:

Prompts are likely different for each domain and further different for each customer.
Prompts involving function calling (i.e., opening a Jira ticket) are definitely different, especially if a custom integration is involved.
Workflow routing logic may be different (though this is technically outside LLM data).
Error checking and validation at the end of the action can likewise be very specific to a customer domain and environment.

Moreover, in other cases where user feedback is part of the loop, prompts may be dynamically updated (via LLM).

So, these prompts, in fact, constitute a customer-specific and user-specific dataset that grows over time, requires deep knowledge of customer workflows (and thus is not easily replicable), and grows over time.

Is this enough differentiation? Unclear—I think it depends on whether business workflows can converge and be serviced by a more intelligent model. However, to do so will require understanding these workflows first. While I think ultimately it will converge, there’s probably room on the way there.

Love to hear your thoughts,

Ruslan

The supplements

Ruslan Belkin — Sun, 26 May 2024 14:15:57 GMT

Update (Mar, 2026)

Added:

TUDCA: 1 service, 250mg AM/PM
- TUDCA is here primarily as a liver and bile-flow support compound. Unlike many “liver support” botanicals, it is a bile acid derivative, so the rationale is more targeted: it helps make the bile acid pool more hydrophilic, may reduce stress on hepatocytes, and has a cleaner mechanistic case in the setting of a crowded stack plus mild liver-enzyme drift. I’m treating it as pragmatic insurance rather than a performance supplement — the goal is preserving liver resilience and bile handling, not expecting any noticeable day-to-day effect.

Removed:

Acetyl L-Carnitine HCI: 1 serving, 1g
- It does help transport fatty acids into mitochondria, acetyl-CoA group donor providing brain health benefits and good nootropic effects, and I did feel moderate benefits for the workouts. The downside - it feeds bacteria in the gut that produce TMAO. While TMAO impact is not completely settled, I am removing ALCAR on risk/reward basis, especially considering absorption problems with oral forms in general.
Cordyceps: 1 serving, 2g
- Preliminary research suggested that Cordyceps may have positive effects on telomere length. Subjectively I did not feel any effect and upon further examination telomere length rationale is quite weak. Removing to simplify the stack.

Removed:

EGCG 1/2 serving, 1 cap, 200mg 2x/day
- Originally the thesis here was colorectal risk mitigation (there is mechanistic plausibility + some human data in the adenoma space), but on the latest labs I saw a mild upward drift in liver enzymes. Green tea extracts are one of the more common “usually fine, but occasionally not” botanicals from a liver standpoint — and when you’re monitoring LFTs, anything with even a small hepatotoxicity signal is a poor fit. Net: risk/reward no longer pencils out for me right now, so it’s out until LFTs normalize and a future re-challenge (if any) can be done cleanly.
AC-11 1cap 350mg into the morning shake
- This was a pure experiment around the “DNA repair / resilience” claims. After a reasonable run, the subjective benefit was barely detectable and I couldn’t find anything objective to justify keeping it in. Given how crowded the stack already is, anything that isn’t clearly pulling its weight gets cut.
Gotu Kola extract: 1/2 serving, 1 cap, 120mg
- I like Gotu Kola for microcirculation / skin-collagen support, but similarly to EGCG it’s a botanical with rare (but real) hepatotoxicity case reports. With LFTs already trending the wrong way, this becomes an obvious “first things to pull” candidate. Removed for now.

Added:

Urolithin A, 1/2 service, 1 cap, 1000mg, PM
- Urolithin A is here for mitochondrial “quality control” via mitophagy — i.e., preferential cleanup of dysfunctional mitochondria rather than just suppressing ROS. Another reason I like it: humans vary a lot in how well they can generate urolithins from food (microbiome-dependent), so supplementing can bypass that variability. The human RCT data isn’t perfect, but it’s stronger than most “mitochondrial” supplements, with safety/tolerability plus measurable signals on muscle endurance / mitochondrial health markers.
PQQ: 1 cap, 20mg (AM)
- PQQ is a redox-active compound with a plausible role in mitochondrial biogenesis / cellular energy metabolism, and there are a few small human studies with signals on inflammation markers and cognitive endpoints — nothing definitive. I previously ran it and couldn’t find a clean personal signal, so conviction remains low. I’m re-running it only because it pairs conceptually with Urolithin A (different mechanism, same “mitochondrial cleanup + support” direction) and it’s easy to discontinue if it stays silent.
Colostrum 1 scoop, 1g, morning shake
- Colostrum is a “gut insurance” lever for me: bovine colostrum concentrates immunoglobulins (IgG), lactoferrin, and other bioactives that support barrier integrity and innate immune function. The best human data I’ve seen is in athletes / high training stress contexts, where colostrum has been associated with improved gut permeability markers and fewer URTI-type issues. I’m using a conservative dose (1g/day) and treating it as supportive — not magic. Avoid if dairy allergy is a concern.

Updated:

Bacopa Monirelli (Synapsa): 1 serving, 320mg
- Formulation changed to Cognance 1 serving, 100mg with rationale being a better overall profile with no potential sedation effects.

Update (Oct, 2025)

Replaced Reishi mushroom powder with Turkey Tail mushroom powder: 1 serving, 1g (2 days on, 1 day off - non overlapping with Lion’s Mane). Reasons:
- Safety/monitoring—Reishi has rare but documented hepatotoxicity case reports and inclusion in LiverTox, which matters if you’re watching LFTs; Turkey Tail has a long clinical track record as an adjuvant (PSK) with generally favorable tolerability.
- Evidence fit—Turkey Tail’s PSK/PSP have been studied extensively as immune modulators (including RCTs), with signals on NK-cell activity and broader immune function, making it a reasonable like-for-like immunologic substitute
- Potential 5-AR inhibition: There’s some weak evidence that Reishi may have an anti-androgenic potential, which we obviously don’t want
Added another Creatine: 1 serving, 5g to the morning shake for 10g total (between pre-workout and the morning shake). New research strongly suggest cognitive benefits (especially when short on sleep) from this dosing and subjectively I feel a noticeable boost and no afternoon slump at all.

Update (Aug, 2025)

Added:

Tongat Ali was replaced Eurycomax and correspondingly 2.5mg of Pregnenolone was removed. In total between Cistamax, Eurycomax and direct supplementation DHEA and Pregnenolone dosages stayed at 5mg each. So far appears to be a much smoother stack.
Maca extract powder was added to the morning shake at 200mg. See notes.
Gotu Kola was added into the morning shake along with the evening dose.
EGCG was added into the morning shake along with the evening dose. See notes.
Stinging Nettle Root Root extract was added into the morning shake along with the evening dose.
NMN/NR stack was restructured and replaced with NR (Tru Niagen) plus Pterostilbene+Quercetin. See detailed discussion.
AC-11 was added experimentally, 1cap 350mg into the morning shake. So far seem to generate a slight mental kick. Data-wise not yet convinced.
Creatine was moved from morning shake to pre-workout. Seemed to make a significant difference, probably due to transient DHT elevation.
Rosuvastatin (Crestor) along with Geranylgeraniol were moved from AM to PM to avoid (theoretical) transporter conflicts with other supplements.
H2 Molecular Hydrogen was added along with the morning supplements. See notes.

Removed:

Gaia Herbs Adrenal Health Daily - I took this for years; it made my original list because it seemed helpful during a heavy-stress period. That said, I’ve always been skeptical of “adrenal support” blends—sensible ingredients, but usually well below clinically studied doses. After multiple on/off cycles, I’m convinced it isn’t doing anything meaningful (and at worst may slightly lower morning cortisol from the small amount of ashwagandha).

Update (Sep, 2024):

Added:

Spermidine (see AM morning shake section)
Magnesium Glycinate, 200mg in the evening replacing Magnesium complex

Removed:

C60 (in olive oil): 1 tsp
The premise: an interesting compound due to it’s direct and unique anti-oxidant action that doesn’t blunt effects of exercise.
It was an experiment and thus far I have not found neither subjective nor objective evidence of it doing anything (perhaps it was a problem with a specific brand).
PQQ: 1 cap, 20mg
The premise: research around mitochondrial support effects is quite solid. Quantitatively or subjectively I couldn’t see or feel anything.
Probiotics
After experimenting with several over the years, I found them to be useful for a specific purpose, but not as part of a maintenance routine.

Everyone is asking what supplements I take. I am going to share full list upfront and then we are going have a discussion.

Several disclaimers:

This list is highly personalized and is periodically revised based on quantitative testing, research updates, and subjective evaluation. What you see here is a snapshot as of today.
None of this is a medical advice. Please, do not apply this list to yourself without proper medical guidance.
It's crucial to remember that, from a scientific perspective, supplements (and even diet) are not as significant as one might think compared to exercise. Nutrition and supplement studies are typically of low quality and should be viewed with skepticism. They are most often based on epidemiology, which is unreliable due to confounding factors, biases, and difficulty in establishing causality. Although Mendelian randomization trials are more robust in inferring causality, they are limited in number due to costs and incentives. When they are done - these trials are often underpowered, too short, and tend to focus on populations that may not be relevant to our purposes.
Therefore, the totality of data, including observational data from various traditions, along with personal biomarkers, functional fitness metrics, and subjective evaluations, must all be considered.
Given the difficulty of evaluating efficacy, why do I pursue this? Partially, it's for the fun of biohacking.
The main goals for me are health span and performance (not longevity by itself - who wants to live longer without being able to do things anyway).

Here’s how to read the list:

Supplement: (including the specific brand I use and dosages; note that I have no affiliation with any of the companies mentioned).
Commentary: I’ll provide brief commentary without references (for the sake of my time). Accuracy is not guaranteed. Feel free to discuss it further with ChatGPT.
Risk: low/medium/high (I don’t use anything I consider high risk). No supplement is zero risk as unknown effects may exist, and there could be discrepancies versus the label due to manufacturing or quality control issues.
Personal outcome: none/minor/notable/significant - a subjective evaluation of the benefit based on an aggregate of biomarker, functional testing, and subjective feelings of well-being.
Conviction: low/medium/high - my current level of conviction in the supplement's efficacy based on the aggregate of data, research and personal outcome.

Let’s get right to it.

AM - Pre-workout

As you know from my earlier post - I workout first thing in the morning and that drives some of the logic here (if I was to work out in the afternoon or the evening - the supplementation would be different)

The following supplements are mixed in as powders into a pre-workout drink:

Essential Amino Acids (EAAs): 1 serving, 11g

As I work out in the morning and we know that in a hypo-caloric state the body can break down muscle for energy, having circulating amino acids (in addition to the stimulus from exercise) helps preserve muscle mass. Essential Amino Acids (EAAs) are beneficial because they are already in their simplest form and are readily available for muscle protein synthesis. This makes them easier for the body to use immediately compared to whole proteins, which need to be broken down first. Additionally, EAAs are easier on the digestive system.

Risk: low

Personal outcome: significant

Conviction: high

Legion Pulse Pre-Workout drink: 1/2 serving, 11.75g

This product serves as both a nootropic supplement and a pre-workout aid. It’s a well-designed product with no filler or questionable ingredients. The main components are: a moderate dose of caffeine for increased energy and focus, L-Citrulline for nitric oxide enhancement and improved blood flow, Betaine/TMG which helps control homocysteine levels (this ingredient combines nicely with other supplements, such as NAD precursors) and has several performance benefits, L-Theanine for its calming and focus-enhancing effects, Alpha-GPC as an effective acetylcholine precursor for cognitive enhancement, and electrolytes (sodium and potassium) to support hydration and muscle function.

Risk: low

Personal outcome: significant

Conviction: high

Taurine: 1 serving, 2g

The research on taurine is impressive on many levels, demonstrating benefits for cardiovascular health and muscle function. However, I am not sure it’s doing anything noticeable for me personally. Taurine supplementation might be more important for people on a vegetarian or vegan diet, as they may have lower taurine levels due to the absence of taurine-rich animal products in their diet.

Risk: low

Personal outcome: none

Conviction: low

Collagen: 1 serving, 10.29g

Collagen helps complement the amino acid profile provided by EAAs, offering specific amino acids like glycine, proline, and hydroxyproline. Research suggests that taking collagen prior to a workout, assuming adequate exercise stimulus, can support connective tissue and joint health, with some evidence also indicating benefits for bone health. This particular product uses special formulation of collagen and has additional ingredients such as Hyaluronic Acid, ch-OSA, and Buffered Vitamin C, which are known for their synergistic benefits for joint health.

A side note: excess collagen, due to conversion of hydroxyproline into oxalates. Between this dose and the morning shake it comes down to 20g / day for me, which may be close to an upper safe dose.

Risk: moderate

Personal outcome: notable

Conviction: high

N-Acetyl L-Tyrosine (occasionally): 1 serving, 0.4g

N-Acetyl L-Tyrosine (NALT) is a precursor to neurotransmitters such as dopamine, norepinephrine, and epinephrine, as well as thyroid hormones like thyroxine (T4) and triiodothyronine (T3),, which can enhance mental focus and cognitive performance. I take it occasionally for an extra kick during workouts, as it helps improve alertness and mental clarity, contributing to a more energized workout experience.

Risk: low

Personal outcome: notable

Conviction: moderate

The following supplements are taken separately as capsules along with the pre-workout drink at the same time:

CistaMax (5 days on, 2 days off, 2 weeks washout every 10 weeks): 1 serving, 1 cap

This is an amazing combo and you have to admire the thinking that went into designing it. It would take a couple of pages to dissect it - so for the sake of time do your own research here.

Risk: moderate

Personal outcome: significant

Conviction: high

Eurycomax (5 days on, 2 days off, 2 weeks washout every 10 weeks): 1 serving, 2 caps

Eurycomax is the new great combo based around Tongat Ali with additional ingredients, and it seems to avoid some of the concerns with pure Tongat by better balancing estrogen suppression. A reminder - anytime you are tinkering with hormones even a little, quantitative testing is a must to ensure safety and efficacy.

Risk: moderate

Personal outcome: significant

Conviction: high

DHEA: 1/2 serving, 1/2 tab, 2.5mg (5 days on, 2 days off)

At very low doses - such in in my case for a total of 5mg of DHEA and 5mg or Pregnenolone between DHEA tablet, Eurycomax and Cystamax this helps nudge, especially as we age the process of steroidogenesis in the right direction, towards testosterone synthesis without triggering negative effects and feedback loops.

Risk: moderate

Personal outcome: significant

Conviction: high

Creatine: 1 serving, 5g

The benefits of creatine are well known - both for muscle endurance and brain (and even systemically).

There’s a plausible risk of DHT elevation (and resulting hair loss). The one frequently quoted study was not replicated and there’re other issues with it (however, anecdotally there are a lot of bold body builders). Either way keeping an eye on DHT levels is recommended if hair loss is a concern.

Secondarily, creatine supplementation will elevate blood creatinine levels that will show up on a standard metabolic panel test (such as CMP). This is not a concern - if the cause is Creatine supplementation. However tell your doctor, so a more accurate biomarker can be used to assess kidney health (such as Cystatin C).

Finally - note that creatinine as well as liver enzymes (ALT most notably) can also be elevated due to normal muscle breakdown as a result of a strength training session, so likewise tell your doctor or use a wash out period (72 hours - I am not doing it).

Note: when you severely underslept for whatever reason, additional 5mg of Creatine later in the day (I’d just add that extra dose into the morning shake) could be quite helpful as a temporary fix.

Risk: low

Personal outcome: significant

Conviction: high

AM - Breakfast (Post workout)

These are taken in conjunction with my morning shake and whenever possible I use powders (instead of capsules, to avoid filler ingredients and flow agents) and capsules (whenever a powder form is not available) are broken up and put into the shake. Using powders is also more cost effective.

Core Morning Shake recipe

Whey Protein: 25g

Collagen Protein: 9g

Fiber Powder (most days, but sometimes I skip adding it when I want to give my gut a break from extra fiber)

Wild blueberry or elderberry powder: 5g (alternating)

Pomegranate powder: 1 tbs

Tart cherry powder: 4.8g (among many other benefits this helps to keep uric acid at low levels).

Olive oil, 2 tsps (to make sure there’s a little bit of fat to help with absorption of fat soluble vitamins).

Some sort of fruit from whatever is around, could fresh or be frozen: banana, a cup of strawberries, raspberries, an apple, 2 peaches, 3 apricots, etc.

Supplements that are mixed into the shake

O.N.E. multivitamin: 1 serving, 1 cap

Think of it as an insurance policy, although given my genetics - methylated B vitamins are great. There’s no perfect multi-vitamin and it’s a compromise for convenience.

Risk: low

Personal outcome: minor

Conviction: high

Magnesium complex: 1/2 serving, 1 cap, 120mg

The benefits are well documented, especially considering that it is nearly impossible to get enough magnesium through the regular modern diet (with all the soil depletion).

Risk: low

Personal outcome: minor

Conviction: high

Glucosamin HCI: 1 serving, 1g

Decent research for joint (formation and repair of cartilage, synovial fluid production) and even brain health.

Risk: low

Personal outcome: minor

Conviction: moderate

Colostrum 1 scoop, 1g, morning shake

Colostrum is a “gut insurance” lever for me: bovine colostrum concentrates immunoglobulins (IgG), lactoferrin, and other bioactives that may support barrier integrity and innate immune function. The best human data I’ve seen is in athletes / high training stress contexts, where colostrum has been associated with improved gut permeability markers and fewer URTI-type issues. I’m using a conservative dose (1g/day) and treating it as supportive — not magic.

Risk: low

Personal outcome: minor

Conviction: moderate

Stinging Nettle Root Extract, 1/2 serving, 1 cap, 250mg

Mechanism of action is anti-androgen activity specific to prostate, thus not affecting androgens systemically, which is quite nice. The outcomes are supported by solid data, including human trials.

Risk: low

Personal outcome: significant

Conviction: high

PQQ: 1 cap, 20mg

PQQ is a redox-active compound with a plausible role in mitochondrial biogenesis / cellular energy metabolism, and there are a few small human studies with signals on inflammation markers and cognitive endpoints — nothing definitive. I previously ran it and couldn’t find a clean personal signal, so conviction remains low. I’m re-running it only because it pairs conceptually with Urolithin A (different mechanism, same “mitochondrial cleanup + support” direction) and it’s easy to discontinue if it stays silent.

Risk: low

Personal outcome: none

Conviction: low

Bacopa Monirelli (Cognance): 1 serving, 100mg

Bacopa is a long-running cognitive “insurance” herb with human trial data suggesting benefits for memory/processing after weeks of consistent use (with GI side effects being the most common downside). The trade-off with traditional bacopa extracts is that some people feel slightly sedated or “flat.” Cognance is a newer bacopa extract with a different standardization profile, and the bet here is keeping the brain-support upside while avoiding the sedation tax.

Risk: low

Personal outcome: minor

Conviction: moderate

Maca Extract Powder (5% macamides): 1 serving, 125mg

An interesting compound that inhibits FAAH (the enzyme that breaks down anandamide), which can raise endocannabinoid tone. The result is a noticeable mood and well being elevation. The safety record is excellent and it appears not to affect hormones or major neurotransmitters.

Risk: low

Personal outcome: notable

Conviction: moderate

Lion’s Mane: 1 serving, 1g, 1 day on, 2 days off

For brain benefits as it is known to increase both BDNF (Brain-Derived Neurotrophic Factor)) and especially NGF (Nerve Growth Factor).

Risk: low

Personal outcome: minor

Conviction: moderate

Turkey Tail mushroom powder: 1 serving, 1g (2 days on, 1 day off - non overlapping with Lion’s Mane)

The idea is to boost NK-cell activity as a preventative. Turkey Tail —notably its PSK/PSP fractions and mycelium—has human data showing increased NK-cell tumoricidal activity and higher circulating NK cells vs. control: a phase-I study in breast-cancer survivors reported improved NK function and lymphocyte counts; a randomized double-blind trial linked rises in the NK-activation marker CD69 to functional gains; and oncology RCTs using PSK showed peripheral NK-cell increases compared with control.

Note: if you have any kind of autoimmunity, I would avoid.

Risk: low

Personal outcome: minor

Conviction: moderate

Inositol: 1 serving, 1g

Supports healthy thyroid function and helps keeps TSH levels in the optimal range.

Risk: low

Personal outcome:: notable

Conviction: high

Milk Thisle, 1 serving, 100mg

Milk thistle is well-researched for its liver support properties, primarily due to its active compound, silymarin. Silymarin has antioxidant, anti-inflammatory, and antifibrotic effects, which help protect liver cells from damage, support liver regeneration, and improve liver function. Probably a good insurance with all the other supplements I am taking 🙂.

While research supports these benefits, it hasn’t been enough time to ascertain the results for me personally.

Risk: low

Personal outcome: none

Conviction: low

Spermidine, 1 service, 2 caps, 13mg equivalent

Spermidine induces autophagy by inhibiting the activity of key enzymes involved in acetylation, such as histone acetyltransferases. This inhibition leads to the deacetylation of autophagy-related genes, which promotes the formation of autophagosomes—vesicles that engulf and degrade damaged proteins, organelles, and cellular debris.

Risk: low

Personal outcome: none

Conviction: moderate

TruNiagen NR: 1 serving, 1 cap, 300mg, combined with Quercetin + Pterostilbene, 1 serving, 2 caps, 500mg Quercetin, 50m of Pterosilbene)

This deserves a bit of discussion. The whole idea of using NAD+ precursors came out of research by David Sinclair. We know NAD+ decreases as we age and it is important with 1000s different things, most interestingly with Sirtuin activation (Resveratrol or similar compounds, such as Pterostilbene).

However there’re many issues with it:

No positive effects have been observed in humans thus far (except for one study on cognitive health with NR)
Both NR and Resveratrol failed ITP studies (which thus far is a gold standard for longevity research).
The original Resveratrol study has been thoroughly discredited due to study design (in fact it turned out to be so bad, people wonder how it (and it’s authors) got so much attention in the first place),
There is a rodent study showing cancer growth acceleration effects from NR (for pre-existing cancers - mechanistically this makes sense as NAD+ is quickly depleted by hungry cancer cells, and replenishing NAD+ will only accelerate that process)
There is a concern of up-regulating CD38, although in moderate doses this should not be a concern (there’s yet to be determined U-shape curve here that may depend on many variables). This could potentially be mitigated by Quercetin.
Bio-availability of Resveratrol is low (I am using Pterostilbene instead for that reason, although it’s not clear that it is doing anything either).
In addition any of these precursors need to be supplemented with TMG (in my case there’s plenty in Legion pre-workout drink) to prevent elevation of homocysteine due to back conversion to nicotinamide.

That said - mechanistic theory around how all this could work for longevity is quite compelling and it is possible that it would take considerable time for it to work.

Subjective effects for me in terms of feeling more energetic, however the question is whether supplementing with plain old nicotinamide would accomplish the same outcome (including NAD+ support) for a fraction of the cost.

Finally - there’s still a serious, unresolved conundrum here. NAD+ enhances cellular energy by participating in mitochondrial redox reactions, boosting ATP synthesis (hence feeling energetic). It is also essential for sirtuin activation, which signals autophagy and repair—processes that act as evolutionary sensors for low nutrient levels.

However, consuming a lot of protein, necessary for maintaining muscle mass, stimulates the mTOR pathway, contradicting these aims. The mTOR pathway promotes cell growth and inhibits autophagy, counteracting the benefits. An alternative approach involving cycling and agents like Rapamycin might be better, but the exact protocols and pros and cons are unclear. Most importantly, I would have no idea how to reconcile it with my everyday workout routine.

Risk: moderate

Personal outcome:: notable

Conviction: low

Supplements taken along with the shake / breakfast

These are supplements that are not (or cannot be) available in a powder form or breakable capsules.

EPA/DHA: 1/2 serving, 1 cap

Not much to opine on, the benefits of EPA/DHA are well researched. I keep my omega index >8 (8.94% as of last test) on OmegaQuant test (your dose may vary, so test it).

Risk: low

Personal outcome: minor

Conviction: high

Vitamin K1/MK-4/MK-9: 1 serving, 1 cap

A must considering supplementation with Vitamin D (O.N.E. multivitamin has 2000 IU), plus other vascular and bone health benefits.

Risk: low

Personal outcome: none (as far as subjective feeling, calcium score continues to stay at 0 and bone density on DEXA went up - so that’s be pretty good)

Conviction: high

COQ10: 1 serving, 1 cap, 100mg

The idea here is to counteract effects of a statin that whacks HMG-CoA enzyme and decreases COQ10 levels as a result. While studies are conflicted, the subjective feeling is noticeable for me.

Risk: low

Personal outcome:: notable

Conviction: high

TUDCA: 1 service, 250mg

TUDCA is here primarily as a liver and bile-flow support compound. Unlike many “liver support” botanicals, it is a bile acid derivative, so the rationale is more targeted: it helps make the bile acid pool more hydrophilic, may reduce stress on hepatocytes, and has a cleaner mechanistic case in the setting of a crowded stack plus mild liver-enzyme drift. I’m treating it as pragmatic insurance rather than a performance supplement — the goal is preserving liver resilience and bile handling, not expecting any noticeable day-to-day effect.

Risk: low

Personal outcome:: none

Conviction: high

Smart PS™ Phosphatidylserine: 1 serving, 1 cap

This is a blend of phosphatidylserine, phosphatidylcholine, and phosphatidylethanolamine and supports brain health by providing essential phospholipids that contribute to cognitive function, memory, and overall neural health by playing a key role in cell membrane integrity and the formation of synaptic vesicles.

Risk: low

Personal outcome:: minor

Conviction: high

Aged Garlic Extract: 1/2 serving, 1 cap, 300mg

This is to mitigate the potential increase in TMAO levels caused by ALCAR and choline precursors, as elevated TMAO has been associated with cardiovascular risk. While the link between TMAO and ASCVD (Atherosclerotic Cardiovascular Disease) is not definitively proven (as otherwise people who eat a lot of fish would all die from heart attacks - which is obviously not the case), there is enough evidence to warrant some precaution. AGE has additional benefits, such as mild cholesterol-lowering effects, but those are not the primary reasons for taking it in this case. Note that garlic is a FODMAP and may cause digestive issues for some people, especially those with SIBO.

Risk: low (see above)

Personal outcome: none

Conviction: low

H2 Molecular Hydrogen: 1 serving, 1 tablet (80mg magnesium, 8ppm), dissolved in water when taking other supplements.

There is a significant body of research suggesting that H2 is a selective antioxidant neutralizing mostly the harsh stuff (•OH, ONOO⁻) while sparing signaling ROS (e.g., H₂O₂) that drive exercise adaptations.

Risk: low

Personal outcome: none

Conviction: low

Prescription medications

Goes without saying that these always must be discussed with and prescribed by your doctor.

Rosuvastatin (Crestor): 5mg (PM)

While there’s a lot of controversy around the use of statins - I am quite convinced based on available research. ASCVD progression is a stochastic process and keeping APO-B concentrations low is, in my opinion very much worth it.

That said - statins are not perfect drugs with a lot of off target effects. First, they work by suppressing HMG-CoA cascade, that causes all kinds of problems. Secondly it seems there’s causal evidence of increase in insulin resistance overtime. On top of that, there’re negative effects on liver (observed by modest elevation of liver enzymes and despite unclear clinical significance - this is still not cool).

That said, for me Rosuvastatin is still the best (of imperfect) choices. First - it is the only effective hydrophilic statin that (unlike lipophilic statins, such as Atorvastatin (Lipitor) doesn’t massively diffuse into tissues or crosses the blood-brain barrier (thus reducing potential side effects, such muscle soreness).

Risk: moderate

Personal outcome: signigicant (as measured by calcium score staying at zero)

Conviction: high

Ezetimibe (Zetia): 10mg (PM)

Ezetimibe works through a completely different mechanism than statins. Instead of reducing cholesterol synthesis in the liver, it blocks cholesterol absorption in the small intestine. This makes it a particularly attractive companion to low-dose statin therapy because it can provide meaningful additional LDL-C and APO-B reduction without substantially increasing statin-related side effects.

What convinced me was the growing body of evidence suggesting that lower cumulative lifetime exposure to APO-B-containing particles is strongly associated with lower cardiovascular risk. If lowering APO-B is the objective, adding Ezetimibe often appears to be one of the most efficient and lowest-risk ways to push levels down further before considering more aggressive interventions.

Risk: low

Personal outcome: notable

Conviction: high

Tadalafil (Cialis): 5mg (AM)

While developed as an ED drug, I take it for it’s off target systemic effects on improving the blood flow. As we know - most ills stem from poor blood flow as we age. Improved circulation also enhances workout performance and supports overall cardiovascular health, contributing to smoother functioning of many bodily systems. Tadalafil also slightly lowers blood pressure, helping to maintain it within the optimal range.

Risk: low

Personal outcome:: significant

Conviction: high

Minoxidil (Loniten): 2.5 mg (PM)

I am using it as a preventative for hair loss at low dose (it is an old, repurposed blood pressure medication). It doesn’t seem to affect the blood pressure at all for me at this micro dose. Otherwise no noticeable side effects.

Telmisartan (Micardis): 20mg (PM)

Telmisartan is an ARB, so the primary reason I take it is simple: keep blood pressure consistently in the optimal range, with minimal spikes and day-to-day variance. For vascular and brain health, I care not only about average blood pressure, but also about the repeated mechanical stress that pressure excursions place on small arteries over time.

What makes Telmisartan more interesting than a generic “blood pressure drug” is its broader vascular profile. Angiotensin II signaling contributes to vasoconstriction, oxidative stress, inflammation, collagen deposition, and adverse arterial remodeling. By blocking the AT1 receptor, Telmisartan helps reduce some of that remodeling pressure, not just lower the number on the cuff.

It is also one of the longer-acting ARBs, which matters because smoother 24-hour coverage. Telmisartan also has partial PPAR-γ activity, which may explain some of its reported effects on insulin sensitivity, endothelial function, and vascular inflammation.

So the short version is: I take Telmisartan mainly for tight, stable blood pressure control, but I like that it may also improve endothelial function, arterial stiffness, and the biology of vascular remodeling.

Risk: low

Personal outcome: significant

Conviction: high

Coffee

The discussion would not be complete without coffee. Coffee is one of the most widely used drugs in the world. 67% of Americans drink coffee (that’s why I suppose it just got removed from the CPI by the BLS as the price of coffee has been going up).

I take my coffee black (Americano) with no milk or sugar of course. I do add 1 tsp of C8 MCT oil into my morning coffee.

On most days that be just one mug, but sometimes I also get the second cup around lunchtime. More than that starts adversely affecting my sleep quality - so if I must, I switch to decaf.

PM (with dinner)

EPA/DHA: 1/2 serving, 1 cap

The second dose as in AM.

Magnesium glycinate: 1 serving, 1 cap, 200mg

Seem to be more effective than than standard magnesium complex and has a notable calming effect for sleep.

Risk: low

Personal outcome: significant

Conviction: high

Magnesium L-Threonate: 2/3 serving, 2 caps, 96mg

Brain health benefits and a minor calming effect for sleep.

Risk: low

Personal outcome: minor

Conviction: low

COQ10: 1 serving, 1 cap, 100mg

The second dose as in AM.

TUDCA: 1 service, 250mg

The second dose as in AM.

BroccoMax: 1 serving, 2 caps

The research is decent, especially around colon cancer prevention effects. However I can’t quantitatively put a handle on it for me personally thus far.

Risk: low

Personal outcome:: none

Conviction: low

Urolithin A, 1/2 service, 1 cap, 1000mg, PM

Urolithin A is here for mitochondrial “quality control” via mitophagy — i.e., preferential cleanup of dysfunctional mitochondria rather than just suppressing ROS. Another reason I like it: humans vary a lot in how well they can generate urolithins from food (microbiome-dependent), so supplementing can bypass that variability. The human RCT data isn’t perfect, but it’s stronger than most “mitochondrial” supplements, with safety/tolerability plus measurable signals on muscle endurance / mitochondrial health markers.

Risk: low

Personal outcome: none

Conviction: moderate

Stinging Nettle Root Extract, 1/2 serving, 1 cap, 250mg

The second dose as in AM.

Risk: low

Personal outcome: significant

Conviction: high

Aged Garlic Extract: 1/2 serving, 1 cap, 300mg

The second dose as in AM.

Risk: low

Personal outcome: none

Conviction: low

Geranylgeraniol: 1 serving, 1 cap

A fairly new compound that also counteracts negative effects of statins, by a different mechanism.

Geranylgeraniol (GGOH) is a critical intermediate in the mevalonate pathway, downstream of HMG-CoA reductase and it is necessary for the prenylation of small GTP-binding proteins. Prenylation is a post-translational modification that allows these proteins to anchor to cell membranes and function properly.

By supplementing with GGOH, the normal function of these prenylated proteins can be restored. Subjectively for me it appears to be even more efficacious than COQ10 (although I would keep both as mechanisms and effects are different).

Risk: low

Personal outcome: notable

Conviction: high

Discussion

So, that's a lot of supplements. The question is: can you achieve the same results with just a good diet? Probably not.

Even if you source high-quality ingredients and are prepared to spend an inordinate amount of time cooking everything perfectly, it's unlikely you'll achieve all the effects and benefits tailored to your genetics, personal health circumstances, and performance goals without some scientific enhancement.

The second question: can you avoid supplements altogether? Of course, you can. The human body is very resilient and can tolerate suboptimal conditions for a while. However, if performance optimization is your goal, then supplementation should be considered.

Another point: ingredients in food are presumably in more natural ratios than in a synthetically constructed supplement regimen. True. However, our body is highly adaptable and can handle a lot of variation. Besides, common processed foods like french fries and soda are far from natural or balanced.

Another concern is supplement quality, fillers, and labeling. This is valid. Fillers are problematic, which is why I opt for powders, choose the cleanest possible options, and stick with reputable brands. Risks, such as heavy metal contamination, do exist, but they are manageable if you monitor important biomarkers regularly. On the flip side - common food products are hardly immune from this problem, from Glyphosate, micro-plastics, heavy metals and other types of contaminants.

Does it take a lot of work? Some, but not a lot, as I've fine-tuned my routine over the years. I also batch my shakes, preparing and mixing them on the weekend for the week, so the actual breakfast prep takes very little time.

Hope this helps - in whatever way it does 🙂

Cold Sales Outreach Emails Inefficacy - a sign of upcoming disruption?

Ruslan Belkin — Sun, 12 May 2024 14:15:58 GMT

Everyone does it, everyone hates it, large companies are built on it (Outreach, Apollo) and it underlines the foundation of CRM systems (like Salesforce and HubSpot).

What prompted me to opine here was an accidental feature we developed at Jelled.AI that sorts communications that you don’t need to pay attention to out. Looking at the results of it - it got me thinking. I will explain.

But first - there is no surprise the efficacy of cold outreach emails has been declining (despite all the hacks - that largely do not work) and probably reached the point (with the introduction of AI tools into sales workflows) where the cost to send them is near zero, but the efficacy is an asymptote of zero. In fact I would argue there’s a high probability of a net negative return as receiving cold emails could drain whatever good will the customer could have had before opening them.

My personal routine handling inbound sales outreach was as follows:

I never respond to cold sales outreach, except if it’s from founders
I will mark repeated sales outreach email as spam (although that doesn’t seem to help due to various email distribution tricks people are using) and will block the sender.

Nonetheless - that’s still annoying.

So - with the new feature, Jelled.ai automatically detects inbound sales emails (so far with 100% accuracy) and sorts them out into a separate folder (the AI engine itself keeps the content if you ever need to do a research on it in the future). As a result - my inbox is super clean and I no longer get annoyed by the all the inbound.

However - this brings up several interesting observations. While using LLMs to generate emails is practically free (comparing to other methods involving humans) - getting rid of these emails is also free. Therefor the usefulness of the cold outbound channel is NIL and the only people making money here are from OpenAI.

The voice channel was already useless (I never ever pick up the phone in the first place if you are not in my address book). Text will largely meet similar fate (including mentions).

So, if cold outbound is of no value - that means that the value of SDRs is just an expense (the introduction research can be done by AI agents, while the introductions will still need to be done through the influencer network).

What does this do? Does It invert the SaaS sales model as it empowers the buyer to run a competitive bidding process through AI - which is already aware of company plans, budgets, timelines, etc.?

This, being on top of other AI advances, brings into question the need for CRM systems—which really have nothing to do with customer relationship management but are designed to manage the salesforce. Humans are forgetful, unorganized, need complex compensation plans, and are generally flawed. AI-first designed systems will obviously not have any of those problems. Additionally, as we have fewer humans, we will need fewer seats—there goes your pricing model as well.

Are we going to see an upheaval in the world of Salesforce? Does it apply to other professionals (in marketing for example)?

Something possibly to keep an eye on in the near future.

Thoughts welcome!

A quick look at some popular AI predictions

Ruslan Belkin — Mon, 04 Mar 2024 03:26:27 GMT

Recently, there have been a number of predictions voiced in several respected channels to which we all pay attention.

A quick evening note here—without going too deep or referencing sources and benchmarks—on the 3 claims that I am personally skeptical about.

Open source models will catch up with closed source models

The argument goes something like this: "The web is finite, therefore all data is/will ultimately be available for training, and what is not available (proprietary data) will remain a small fraction and thus of little impact. From there, it follows that if all models are trained on the same data, the performance will inevitably converge, and therefore there is not much value add in proprietary models."

The thing is, even if we assume that most relevant data is freely available (various legal issues aside), it takes a lot of resources to train and update the base model and even more resources to instruct it to make it useful (considering all the infrastructure setup and humans involved). While off-the-shelf models like LlamaN… may work in some contexts, it is unlikely to form a foundation for a true competitive advantage.

What can alter this dynamic? Perhaps new architectures and approaches to training and fine-tuning, making the process a lot less resource and cost intensive.

LLM utility is limited to text generation

(or image generation)

From a first principles standpoint, this is true. We see that auto-regressive generation (LLMs) is "an exponentially divergent diffusion process, hence not controllable."

However, a combination of LLM with pure logic software may be able to yield good planning/reasoning outcomes, albeit not yet broadly generalizable.

For example, having an LLM generate code, attempt to run/fix that code, and later evaluate the output demonstrates this is feasible. In addition, this limitation is so well understood and so many people are working on solving it that inevitably—scalable solutions will come, or at least it’s not unreasonable to bet on it.

LLMs are too slow for real-time user workflows

This is true on the face of it, and a massive improvement is unlikely in the medium term.

However,

there are a lot of (if not the majority of) workflows, especially enterprise workflows, that are not real-time and do not need to be real-time - communication, document generation, analytics, and many others). So, not only can we take our sweet time, but we can also execute very complex, multi-step LLM workflows without issues (except for cost). In fact, the cost right now is a much bigger problem.

There are also a ton of performance tricks one can play (from caching, progressive enhancement, different models for different tasks) to make the system look a lot more performant than it really is.

Happy Sunday and to the productive Monday,

Ruslan

The uncertain AI landscape

Ruslan Belkin — Sun, 31 Dec 2023 15:15:08 GMT

It has been more than a year since the release of ChatGPT and the ensuing repositioning of the industry. I jotted down some quick notes at the time, and they seem not to have aged all that poorly. Yet, it is time to check the pulse as the year ends.

These days, everyone is a Nostradamus when it comes to AI, and I am going to try to avoid direct predictions. The future is always uncertain, yet some clusters of opportunity are still visible.

To refresh the basics, the main breakthroughs so far have come from:

Compute (training/inference) and the ability to store and handle large amounts of data
Development of Word2Vec and subsequent advances in efficiently learning word embeddings
Transformer architecture that dispensed with the sequential processing of RNN-type architectures, enabling the practical training of large models

The main unknown factor was when exactly the compute and the data would lead to breakthroughs. Just like with CNNs for vision, there turned out to be a threshold past which things just started to work.

Again, it's important to remember that all an LLM does is predict the next token. So, how is it exactly useful?

Well - isn't that what people do? That's why hallucination is not a bug, but the key feature. One could conflate the LLM temperature setting (a hyperparameter used to tweak the probability distribution for selecting the next word) with the number of drinks at a party.

It has also been fascinating to observe the evolution of open-source LLMs and various claims of models of GPT 3.x quality (interestingly, no one so far has reached GPT-4 quality even according to benchmarks - and one should be very skeptical of benchmarks at this stage). Why is that?

Well - let's look deeper at what influences LLM quality:

The dataset to train the base model: if your base dataset contains clean, large, high-quality data - such as well-written books, legal libraries, scientific articles, as opposed to data scraped from the web - your next-word predictions will be more aligned with what Seneca would have said versus a modern-day guru on Twitter/X.
The amount and quality of effort put into instructing the model - that is, old-fashioned continuous human-assisted feedback.
The ability to train a large enough model, appropriately matched (in terms of the number of parameters) to the size of the dataset.

The bottleneck is actually not (3), but (1) and (2). Training the model is mostly just money and fixed time (not to minimize the effort going into optimizing and managing training pipelines, but it is still reasonably straightforward), while (1) and (2) require a lot of custom infrastructure (both compute and human), sufficient user feedback (how many users are using the platform), a lot of (indeterminable) time and is quite error prone. The outcome is highly sensitive to quality data and feedback. Finally, it inevitably raises a number of legal and data ownership issues. Most importantly, these advantages (or lack thereof) will compound over time, making it largely a winner-take-all game.

So - how probable is it that open-source LLMs will catch up? Not very. That’s not to say that there’s no room for smaller task-oriented models (i.e., classification tasks) - but it is hard to see, short of a seismic shift in how data is shared and infrastructure is made available, how open-source/small players will be practically viable in a general-purpose sense.

I should also mention the dangers of overdoing the fine-tuning step. You can kind of see that with the constant flux of quality with well-known public models as they kept being fine-tuned for safety, etc. Fine-tune it long enough - and you will get mostly canned answers (an phenomenon known as catastrophic forgetting. Fine-tuning in general, without the access to the original data and training artifacts is tricky (for obvious basic reasons) - that’s why releasing “open-source” model weights without releasing the original dataset is hardly “open”.

Now - what about AGI that people are so concerned about? Here, you must agree with Jedi Master Yann LeCun that auto-regressive generation (LLMs) is “an exponentially-divergent diffusion process, hence not controllable”. Hence, ultimately, a new architecture, combined with a practical way to efficiently learn and optimize to a world model, capable of hierarchical planning, is needed. So far, this is nowhere in sight.

In the meantime, we have to do the latter part by hand, via software (even though written with assistance from LLMs), where LLMs become like building blocks in the operating system.

What does this mean then for the ecosystem and for us as humans? Here are some possible outcomes:

Jobs

While everyone is talking about many jobs that will become unnecessary for humans to perform, a more important question is: what kind of jobs will be created instead?

It is already clear that “clean high-quality” data is the new oil. Human expertise that can be put to use to improve these data (in a broad sense, including expert fine-tuning) is necessarily going to be valuable. Hence, it is not a stretch to imagine people will be paid to instruct the models.

For example, a legal brief can be presented to several top lawyers for a review, or a code PR can be presented to top coders for review and be incorporated into the model. Tools and services that enable that will therefore be new machine tools for humans. Of course - AI will determine how much those humans will be paid, possibly in AI usage credits 🙂

The inept regulation could become a monkey wrench for the ecosystem and the progress - yet, if done reasonably well, could take society to the next level.

Model Wars

It is hard to see open-source or second-tier LLMs becoming practical due to data and effort limitations, and it is not likely that hardware advancements will be overly impactful (considering the constant need to improve quality/performance/costs, and there’s still a long way to go there).

Furthermore - if, shall we say, a major player has a large, good-quality LLM with all the datasets and infrastructure - it is easy to downsize it and produce special-purpose small models almost at will.

The overlooked point here is that when one has the infrastructure to collect and manage data and feedback - that affords an interesting advantage at scale. Companies like ‘s Scale.ai could be extremely well-positioned here in the long run.

The Software

Until radical new architectures come into play, a combination of LLM pipelines will need to be used to construct applications. Dynamic performance will be increasingly important as LLMs continue to be exceedingly resource-hungry, and multiple interactions are required to produce a good-quality outcome. What are the components of such frameworks:

- Inference pipeline execution/orchestration - both in streaming and batch fashion, focused on controlling response times, failure management, costs, and routing between different task-specific models.

- The routing logic needs to be a lot more sophisticated, controllable, and self-programmable.

- The usual conveniences for prompt management, context, history, tools, and RAG.

- Multi-modal/image/video side of the world will more commonly need to be incorporated.

- User feedback response and dataset management for fine-tuning.

- System performance evaluation/testing in a continuous fashion, especially when more advanced fine-tuning tricks are used (such LwF and EWC).

The bigger elephant in the room is Python. While super easy to use and convenient, it is arguably a JavaScript/Ruby/Perl-class language in terms of its (non)-safety, extreme error-proneness (that exponentially compounds with larger teams), and (lack of)-performance. So - either a JIT-enabled and a syntactic layer (akin to TypeScript) is going to need to be invented, or we are going to need a lot more multi-language frameworks to go mainstream.

One interesting evolution to observe will be how much of self-writing code capability will be incorporated natively into the frameworks (and their evolution) themselves to produce a truly LLM-first system.

Let’s see how this ages.

Happy New Year,

Ruslan

My current workout routine

Ruslan Belkin — Sun, 24 Dec 2023 23:23:34 GMT

The Results from the latest DEXA scan:

“I don’t count my sit-ups; I only start counting when it starts hurting because they’re the only ones that count.” - Mike Tyson

In my earlier post, "My Personal Daily Routine," I described in detail the types of exercise routines I use and the rationale behind those choices. Here, I am going to outline precisely my current exercise regimen - with a disclaimer that I do change things up from time to time.

This regimen is based on a blend of several guided Beach Body workouts. The choice of Beach Body is purely historical as I am a long time user since P90X came out. There are many other excellent choices.

In "My Personal Daily Routine," I also go into detail as to why I prefer guided workouts - and yes, you can't listen to the All-In Pod while doing it, on purpose - unless you are doing Zone 2 training on an elliptical).

The overall length of the circuit is 10 weeks (6 weeks of it is based on 6 Weeks of the Work by Amoila Cesar and 4 weeks is based on Body Beast by Sagi Kalev with several routines from T-30 by Hunter McIntyre and other programs along with running / elliptical workouts).

I exercise every day and don't have a formal rest day. That said I will throttle the intensity, when I feel I am not fully recovered (based both on subjective and objective metrics), such as sleep quality, resting HR, HRV, respiratory rate etc.).

As you have noticed, the workouts are front-loaded in the morning and are of considerable intensity. This is a compromise within my schedule - as I am able to burn enough calories to not worry about the exercise and energy expenditure for the rest of the day.

A quote from Mike Tyson is important. In order to get an adaptation, we need to push it. So whatever recommended reps and loads are - those are just for reference. You need to continue to push it up (provided you are fully recovered and the form is not compromised). For example, if you did the last set of bicep curls with 30 pounds, go up to 35 next time (even reps go down) - and if you don’t have 35s and just did 8 reps with 30s and feel like you can do more with good form - well do 2-3 more.

Every workout starts with a 10-15mins foam rolling routine and I never skip it. Makes me feel like a teenager all day long.

The week for me starts on Sunday.

The first 6 weeks:

The 6-weeks of the work program consists of 3 blocks of 2 weeks each comprising of a mix of 5 different functional routines followed by a 20-25 mins recovery routine on the 6th day with 1 rest day in the middle. As I mentioned earlier - I don’t do rest days, therefor I use a modified schedule.

Sun, Mon, Tue, Wed, Thu is the appropriate sequential workout from 6-weeks of the work with a requisite recovery routine (called Range & Repair) followed right after the last weekly workout on Thu.

Fri - whatever I feel like a full body workout or an MMA-focused workout (could be as free form as practicing Krav Maga moves), optionally followed by a 30-mins Zone 2 on an elliptical (depending on how intense / long was the main workout).

Sat - 50m elliptical Zone 2 workout followed by a T-30 chin-up routine, followed by T-30 Sheriff Abs, and ending with a short stretch.

In addition (after the main workout):

On Sun - a short interval run (or a simulated run using an elliptical) consisting of 3m warm-up jog, 9 intervals of 1 min on / 1 min off followed by a cool down, followed by a 5-min stretch.

On Wed - a long intervals run (or a simulated run using an elliptical) consisting of a 3m warm-up jog, 3 intervals of 5 min on / 2 min off followed by a cooldown and a 5 min stretch.

On Mon for weeks 1, 3 and 5 (the 6-weeks of the work legs routine) - a T-30 pull-up routine, followed by a 10min Ab-routine from Insanity Max 30.

On Tue for weeks 2, 4 and 6 (6 weeks of the work Cardio & Core) followed by a 30min Zone 2 on an elliptical. Cardio & Core is the easiest workout in the series and the day overall feels like a rest day.

I am sharing the exact weight and rep ranges for me here. On T-30 pull / chin ups I normally go up to 12 reps on the ladder (unassisted). Of course this is with a maximally good form and with a minimum amount ego, making these weight / rep ranges quite challenging every time.

The last 4 weeks:

Sun: Beast-Up Chest/Shoulders/Triceps followed by T-30 short intervals.

Mon: Beast up Legs, followed by a T-30 pull-up routine.

Tue: Beast Cardio followed by Beast Abs optionally followed by a 30min Zone 2 on an elliptical.

Wed: Build Back/Biceps optionally followed by a 30min Zone 2 on an elliptical. This is one of the toughest workouts in the series and Zone 2 here is truly optional.

Thu: Bulk Shoulders (Thursday #1), Bulk Chest (Thursday #2), Bulk Arms (Thursday #2), Build Chest / Triceps (Thursday #4) or a similar auxiliary lift followed by T-30 long intervals.

Fri: Beast Total Body or if feeling really good, Week of Hard Labour: Total Body (another program by Sagi Kalev) followed by a recovery routine.

Sat - 50m elliptical Zone 2 followed by a T-30 chin-up routine, followed by a T-30 Sheriff Abs routine, and ending with a short stretch.

Unlike 6-weeks of the Work routines - these are done with heavy weights - and I mean heavy.

Why it may look like the last 4 weeks are more demanding - they are not. 6-weeks of the work type workouts focus on compound and power movements putting a lot of stress on a neuromuscular system, so these 4 weeks feel like a break.

All of this is not set in stone, and I do vary and experiment with things from time to time. When I have to shorten the workout, the ab routines get skipped as they are the least essential.

Enjoy!