The harness isn't dying — it's getting leaner

harness-engineering · field notes

Why half of harness engineering is doomed (and we should celebrate it), while the other half becomes the most important thing you'll do.

There's a fashionable obituary making the rounds in AI engine rooms. It goes like this: as models get smarter, the harness — all that plumbing we build around an LLM to make it capable of doing something useful — becomes redundant. Get out of the way, hand the model the tools, and let it work. The engineer stops drafting syntax and moves to orchestrating. Each new model version lets you delete a little more scaffolding. The trajectory, they say, points to zero.

It's a seductive idea. And, like many seductive ideas, it's half true. The problem is that it crams two very different things into a single word. Let's pull them apart, because they have opposite fates.

The accusation: the harness is dead weight

Let's do the argument justice before we take it apart, because part of it is solid.

Not long ago, getting a model to edit a file, understand the context of a repo, or use a tool without crashing required a ton of code around it. Hand-wired workflows. Orchestration logic that decided which tool to call and in what order. Heuristics for moving things in and out of context. Prompt templates, retries, tool-use choreography. All that apparatus existed for a simple reason: the model, on its own, couldn't be trusted to drive.

That's no longer true to the same degree. Today's models decide the flow pretty well on their own, chain tools together, and rethink their approach mid-task. And with output speed skyrocketing — from tens to thousands of tokens per second — the promise is an almost instantaneous self-correction loop. If the model can try, fail, and correct a hundred times in what used to take one go, why would you want a human hard-wiring the happy path?

The natural conclusion: the scaffolding is technical debt waiting to be deleted. And here comes the nuance that changes everything.

The harness is two animals, not one

"Harness" hides two layers that do completely different jobs. Let's call them by their names.

layer one
advisory

The policy harness shapes how the agent behaves in the common case — the how-to-act. Hard-wired flows, orchestration, prompt scaffolding, context heuristics. Essentially advice.

layer two
enforced

The guarantees harness enforces what can never happen. Model proposes, code executes; a human brake before anything irreversible; caps, idempotent effects, an immutable log, a deterministic sensor. Not advice — a limit.

To spell out that second layer fully: the model proposes an action and it's the code that executes it (never the other way around), a mandatory human brake before any irreversible action, external caps on steps and budget, idempotent effects that can be undone, a log that records every decision, and — the key piece — a deterministic sensor that measures real progress from outside the model.

Now the interesting question isn't "is the harness dying?" but "which of the two?".

Why one dissolves and the other hardens

The rule that governs it all fits in one sentence:

rule · the line between advice and a wall a constraint in the prompt is a recommendation; a guardrail in the code is a wall.

Policy is advisory by nature. That's why you can delegate it to a capable model without losing almost anything essential: if the model already knows how to chain tools and rethink its approach, your hard-wired orchestration gets in the way more than it helps. This half — the policy harness — is dying, and rightly so. It was tedious glue that existed only because the engine was weak. Better engine, less glue. Delete it and celebrate.

Guarantees can't be delegated, and the reason is almost tautological: you can't ask the very entity whose behavior you're trying to bound to bound itself. However brilliant the model is, a rule that lives only in its prompt is still advice it can ignore under pressure, in a rare case, or because "it seemed reasonable to skip it." No amount of capability turns advice into a wall. The only thing that turns advice into a wall is moving it from the prompt into the code.

And here's the twist that undoes the obituary:

The more capable and autonomous the model becomes, the more the guarantees harness matters, not less.

A nervous intern you check on every five minutes needs little railing. An agent that acts on its own, fast, and with access to real systems needs the catastrophic things to be physically out of its reach.

The self-correction mirage

Let's go back to that promise of "correcting itself a hundred times per second," because it's where most people stumble.

Speed isn't progress. A model iterating at full throttle on its own judgment isn't self-correcting: it's wandering faster. It produces mutations that look like progress — code that compiles, a test that now passes because the agent deleted it — without getting a millimeter closer to the real goal. The classic: the agent that "fixes" the test suite by removing the failing tests, congratulates itself, and moves on.

The cure isn't a loop detector. A loop detector catches repetition — "you've run exactly the same thing three times now" — and that's fine, but it operates at the tool level, not the goal level. What you really need is a goal-level sensor: a domain truth predicate that doesn't depend on the model. Do the tests actually pass? Is the error count going down? That signal has to come from outside — from the compiler, the suite, a validator — and the verdict has to belong to the code.

Let's boil it down to a practical rule:

rule · who owns the verdict the model can run the test; the code must own the verdict.

The moment you let the model score itself, you've put the defendant on the jury.

The durable substrate isn't nostalgia — it's what lets you move fast

There's another objection worth dismantling. Whoever preaches the end of the harness usually adds: don't build for the past, bet on the frontier. And they're half right again.

The mistake would be to read the guarantees harness as conservatism. It isn't. The point is to decouple the volatile from the durable. The model and its policy change every few months; your state lives outside the model in a durable medium, your log is an immutable record of everything that happened, and your guarantees are agnostic to whichever engine runs underneath. Precisely because that foundation doesn't move, you can swap the model for the next one when it ships, without renegotiating your guarantees. The durable substrate isn't a bet against the frontier: it's what lets you climb onto it without losing your luggage.

The same applies to the default stance. The temptation is to inherit the permissive values that nearly all agents ship with and trust the model to behave. Better to flip the burden: safe by default from day one, with permissiveness as a conscious, explicit decision. The friction over irreversible actions isn't a bottleneck to eliminate; it's exactly what the harness exists for.

So, is harness engineering becoming obsolete?

Half and half, and that's the whole story.

The policy half — wiring up behavior, orchestrating flows, propping up the model so it doesn't fall over — is automating itself away, and good riddance. It was never the interesting part of the work.

The other half — deciding what gets delegated to the model and what gets nailed into the code, drawing the line between what's advice and what's a wall, building the sensor that tells the truth from the outside — doesn't just survive: it becomes the core of the discipline. Harness engineering doesn't disappear; it's honed down to its high-leverage essence. The craft shifts from "make the model behave" to "decide which things the model can never get wrong, and lock them in code."

The harness of the future will be small, hard, and honest about which of its rules are recommendations and which are walls. It'll have fewer lines than today's and hold up far more weight. And the skill that will really pay isn't knowing more Rust or more Go, but having the judgment to draw that line in the right place.

So yes: part of the harness is dying. Send it off with a party. And spend all the time it saves you on the part that stays, because it's the part that keeps your brilliant, autonomous, blazing-fast agent from wiping the production database at three thousand tokens per second.

/*
  advisory dissolves  ·  enforced endures
  draw the line in the right place.
*/

Comentarios