Saturday, March 28, 2026

Harness In AI: The Revival of Cybernetic

In 2026, all of Silicon Valley is obsessed with something that sounds decidedly unglamorous: building "scaffolding" around large language models.

When you watch an agent autonomously squash a gnarly bug or fluidly refactor a complex project, you might marvel at the model's brainpower. But peel back the engine behind these dazzling demos and there's no magic to be found.

What you find instead is a meticulous execution environment: automated linters and type-checkers intercepting syntax errors in milliseconds, sandboxed test suites validating every output, permission systems strictly governing which files and tools the agent can touch, state managers maintaining memory across sessions, and observability pipelines tracing every step of reasoning.

This is Harness Engineering. A harness — in the original sense — is the full set of tack used to control a horse: saddle, bridle, bit, and reins. In the age of AI agents, it refers to the complete runtime architecture wrapped around a model. It doesn't tune the model's parameters; it builds the operating system that lets the model interact with the real world safely, reliably, and under control.

Dig into the core of this engineering discipline — actions, environment sensing, error feedback, retry-and-correct loops — and you'll arrive at a surprising realization: none of this was invented in the AI era. It is the full-scale revival of Cybernetics in the age of silicon — a theoretical framework born eighty years ago.

AI cannot do without cybernetics. When we can't peer inside the internal state of a hundred-billion-parameter black box, the only thing we can do is fit it with a precision harness.


The Soul of Cybernetics, Woven Through the History of Technology

To understand why Harness Engineering matters so much, we need to go back to 1948.

That year, Norbert Wiener published Cybernetics. His epiphany came from the predictive fire-control systems used in World War II anti-aircraft guns.

To shoot down a plane, you can't aim where the aircraft is — you have to aim where it's going to be. This required the weapon system to continuously capture deviations in the target's trajectory and feed those "error signals" back as inputs for the next firing command.

That is cybernetics in its purest form: using error feedback to correct a system's actions. Complex systems naturally drift. The more complex they are, the faster they drift. Cybernetics is the discipline of fighting that drift.

From anti-aircraft guns to far more intricate systems, cybernetics proved a powerful insight: you don't need to understand every internal state of a system. As long as you can observe its outputs and inject corrective signals, you can force the system toward its target.

Over the following half-century, this logic permeated modern engineering. TCP congestion control dynamically adjusts the send window using packet-loss feedback. Kubernetes reconciliation loops continuously compare a cluster's actual state against its desired state and auto-heal discrepancies. Even the auto-brightness on your phone is driven by an optical control loop.

Large language models — the biggest black boxes humanity has ever constructed — have given this theory its most demanding proving ground.


Three Paradigms, Each Tightening the Reins

The engineering methodology for deploying large models has undergone three metamorphoses, each expanding the boundary of control.

Prompt Engineering concerned itself with "what I say to the model." Carefully crafted system instructions, few-shot examples, persona definitions — that was the defining motif of 2023. At its core, this is open-loop control: you pack all your instructions in, then hope the model gets it right in one shot. No sensors, no feedback channel. Like a letter dropped into a mailbox — you won't know if the contents are correct until the recipient opens it.

Context Engineering shifted the focus from "what to say" to "what the model sees right now." RAG retrieval, memory management, context compression, selective information injection — no longer hand-assembled static text, but dynamically curated information environments for each reasoning step. It solved the signal-to-noise problem. Yet it still left a fundamental question unanswered: what happens when the model gets it wrong?

Prompt engineering lets you issue a precise instruction. Context engineering ensures the model sees the right information when making a decision. But both are missing the same thing — an error-correction mechanism.

Harness Engineering fills that gap.


Anatomy of the Harness

Harness Engineering doesn't focus on "input." It focuses on "how the entire system operates."

This is the most radical paradigm shift of the three. If a large model is a thoroughbred racehorse, the harness is the full set of tack that controls it. It encompasses all the peripheral infrastructure a production-grade agent needs:

  • Deterministic execution environments: Sandboxes, compilers, test suites — every action the model takes is verified in a controlled environment.
  • Multi-tiered feedback loops: Instant feedback from type-checkers (milliseconds), rapid feedback from linters (seconds), medium-speed feedback from automated tests (minutes), and slow feedback from CI/CD pipelines and human review.
  • Tool interfaces and permission governance: Strict definitions of which file systems, APIs, and databases the agent can access — and what it cannot touch. In a previous article on the desktop takeover, I mentioned MCP's security vulnerabilities — 30 CVEs in 60 days, path-traversal flaws in 82% of implementations. Harness Engineering is the systematic response to that attack surface.
  • State and memory management: Letting agents maintain context across sessions rather than starting from a blank slate every time.
  • Observability pipelines: Logs, traces, metrics — giving developers the ability to retrace every reasoning step and tool invocation.
  • Circuit breakers and guardrails: When an agent fails repeatedly or approaches a dangerous boundary, force a hard stop and escalate to a human.

This isn't a theoretical framework. This is what's running inside real products right now.

Claude Code automatically fires linters and type-checkers after every code generation, then formats the error messages and feeds them back to the model — a textbook feedback loop. Cursor's agent runs full test suites inside isolated sandboxes, climbing pass rates from roughly 60% on the first attempt to over 85% through automated retries. Google's Jules achieved a 51.8% solve rate on SWE-bench Verified, automatically parsing stack traces and restructuring its fix strategy after each failure.

None of these products claim to be "doing cybernetics." But everything they do — act, sense, error-signal, correct — is the exact closed loop Wiener drew on a blackboard eighty years ago.


The Power of Closed-Loop Control

Harness Engineering openly assumes the model will make mistakes. It never demands perfection on the first try.

When an agent outputs code with a syntax error, the harness doesn't "prevent" the mistake. It pushes the code into a sandbox, catches the error output, formats it, and feeds it back to the model: "This line is out of bounds. Here's the stack trace. Rewrite it." If three retries still fail, a circuit breaker fires, the task is suspended, and a human is notified.

This is the purest form of a closed-loop control system in cybernetics. The model is the actuator; the harness is the sensor plus controller. The system continuously measures the deviation between the action and the desired target, driving the model into a corrective cycle.

Why is a closed loop so critical? Because traditional imperative programming — "if A, then do B" — is deterministic: the same input always yields the same output. LLMs don't work that way. They are high-dimensional, nonlinear systems. The same prompt can produce wildly different results depending on the decoding path. In multi-step planning tasks, a small deviation at each step compounds through subsequent ones — 5 degrees off at step 3 can become 90 degrees by step 8. An LLM without a feedback loop is a rocket without a gyroscope: it might be pointed in the right direction at launch, but the farther it flies, the farther it strays.

Data from Metr's agent evaluation framework backs this up: agents equipped with a full harness — sandbox plus tests plus feedback injection — complete complex coding tasks at rates several times higher than bare models. Not because the model got smarter, but because the system allows it to err, detect the error, and correct it. Without this harness, even the most capable foundation model spirals into hallucination after a handful of complex interactions.


From "Human in the Loop" to "Human on the Loop"

Why elevate Harness Engineering to such a prominent position among all engineering practices?

Because it changes the role of humans in the system.

Over the past few years, most AI engineering teams have done one thing: human in the loop — take the model's output, manually inspect it, find problems, manually tweak the prompt, and try again. The human is perpetually jammed inside the loop, every iteration dependent on human judgment.

The core philosophy of Harness Engineering is to pull the human out of the loop and place them above it: human on the loop. You stop reviewing every line of the model's output and start designing the entire environment. When an agent makes a mistake, instead of blaming the prompt, you ask: "What information was missing from the environment that could have prevented this error?" Then you add a lint rule, a structural test, a machine-readable architectural constraint document.

This is precisely how cybernetics operates. Wiener never tried to understand the full internal state of a controlled system — he cared only about output deviation and the quality of the feedback channel.

Intelligence lives inside the model. Reliability lives inside the harness.

The industry consensus of 2026 is clear: model intelligence is no longer the bottleneck — reliability is. And reliability depends almost entirely on the engineering quality of the harness. Anthropic's engineering team has shared data showing that Claude's task-completion time dropped roughly 40% after deploying a robust tool-use feedback chain. Multiple coding-agent teams have echoed the same finding in technical talks: the single largest source of performance gains is not model upgrades, but improvements to the execution environment.

In a previous piece on the reasoning era, I wrote that reasoning efficiency is the Moore's Law of the AI age. Harness engineering quality is yet another multiplier on top of that — no matter how cheap inference becomes, if a large share of calls are wasted on errors and retries, every cost advantage is devoured.

System robustness is migrating from inside the model to the surrounding infrastructure. The star engineering teams of the future may not be competing for fine-tuning wizards who can cook up training scripts, but for harness engineers who can architect execution environments and build feedback loops.


Who Holds the Reins

No amount of compute, no ocean of data, produces anything more than an immensely powerful stallion. It surges forward with raw momentum, yet cannot see the cliff ahead on its own.

To steer this horse and carriage onto the demanding production lines of human enterprise — and create value with precision — what you need is that seemingly unglamorous set of tack: a bit to limit the mouth, reins to transmit intent, a bridle to calibrate direction. The cold predictions Wiener scribbled into his manuscripts decades ago are descending upon the AI engineering of 2026.

And this is what makes technology so fascinating. The paradigm disrupting our era is, at its foundation, paying homage to the intuitions of its forebears.

Yet Wiener himself would probably not be so optimistic. In his final book, God & Golem, Inc. (1964), he wrote: the danger of machines is not that they will rebel, but that they will faithfully execute the objectives they are given — even when the consequences of those objectives are catastrophic.

The reins can steer direction, but who sets the direction itself? As we build ever more sophisticated harnesses to drive AI toward ever more sharply defined goals, the real question may not be "are the reins strong enough" — but whose hands hold the other end.

Thursday, March 19, 2026

The AI Big Three: The End of the Model War and the Beginning of the Product War

 In the 2026 AI race, a clear triopoly has emerged: OpenAI, Anthropic, and Google.

While each has its distinct strengths and weaknesses, one thing is becoming increasingly certain—the decisive battleground of the next phase is no longer the model, but the product.

OpenAI: Sitting on a Gold Mine, but Missing the Key

OpenAI's models are the strongest—there is little debate about this today. The GPT-5 series continues to lead entirely on programming benchmarks like Terminal-Bench, and its reasoning capabilities remain squarely in the top tier.

But the problem is, a strong model does not automatically equate to a usable product.

Anyone who has used ChatGPT likely shares this sentiment: features are piled high, but the experience is far from seamless. The interface changes constantly, and not always for the better; features are erratic, and what works today might be hidden under a different menu tomorrow. A top-tier model stuffed inside a mediocre product shell often leaves users with the frustrating feeling of "sitting on a gold mine without knowing how to spend it."

What makes OpenAI even more anxious is that competitors are closing in. Their lead at the model layer is shrinking, yet they haven't built a sufficiently deep moat at the product layer. As a result, we're seeing some interesting moves—like partnering with early-stage tools such as OpenClaw to expand the market, attempting to solidify their position through ecosystem building.

The logic behind this move is sound, but execution is another story. If product quality lags, even the most powerful model merely buys time for latecomers to catch up. The era of crushing competitors purely through model parameters is drawing to a close.

Anthropic: Thriving in Silence

Claude's model might rank second, but if you ask, "Which is actually the most usable for real work?", the answer is very likely Anthropic.

Anthropic's style has always been clear: slow and steady, with no gimmicks. They prioritize safety, quality, and stability. Their engineering execution is arguably the best of the big three, and their product polish is meticulous. Whether you're writing code with Claude Code or processing documents with Cowork, the fluidity and reliability of the experience are noticeably a cut above the competition.

Of course, Claude has glaring weaknesses. Image generation and multimedia processing have always been its Achilles' heel, and in an era where multimodal capabilities are increasingly paramount, this deficiency hurts.

Interestingly, there’s been a recent shift. Under market pressure from tools like OpenClaw, Anthropic rolled out Dispatch. I've used it for a while, and honestly—the actual experience is quite good. While it has minor quirks, the core loop works well, and its completeness exceeded expectations. This proves that when backed into a corner, Anthropic's speed and quality of execution are still rock-solid.

Anthropic's greatest advantage is exactly its "unsexy" demeanor. No headline grabs, no arms-race-style launch events—just building good products and polishing the user experience. In a race where everyone is screaming slogans, keeping a low profile and staying out of trouble has ironically become their ultimate competitive edge.

Looking at the landscape today, if I had to bet on the most promising of the three, my money is on Anthropic.

Google: The Giant Pivots, Comprehensive but Blunt

Google's playbook differs completely from the other two. It doesn't start from a single model or product, but rather from an entire ecosystem.

Its engineering prowess is unquestionable. It moves fast, too—the iteration speed of the Gemini series is arguably the fastest among the big three. But speed comes at a cost: the model is highly verbose, and its actual problem-solving capability is mediocre. When using Gemini for deep-focus tasks, you often feel that it says a lot without actually resolving the issue.

However, Google holds a trump card that the others do not: its image processing capability is currently the strongest. Whether in image comprehension or generation, Gemini is clearly a step ahead in this dimension. In an environment where multimodal competition is intensifying daily, this is a tangible highlight.

An even more critical moat is Google's ecosystem. Chrome, Gmail, Google Docs, Android—Gemini can be seamlessly embedded into these billion-user products. This distribution power is something OpenAI and Anthropic cannot even begin to chase. Users don't need to specifically seek out AI; AI is already embedded in the very first app they open every day.

Google's strategy is to be the most comprehensive. As long as it avoids major blunders, it possesses a natural, structural advantage in the long run. However, "avoiding major blunders" is never a given for a massive bureaucracy.

The New Battlefield: Smart Models Don't Equal Smart Products

Having sketched the profiles of all three, let's discuss something even more crucial—the dimension of competition is undergoing a fundamental shift.

Over the past two years, the arms race in the AI sector was concentrated squarely on models and compute. Whoever had more parameters and higher benchmark scores was king. But as we move deep into 2026, this logic is increasingly falling apart.

The reason is simple: model capabilities are converging.

The gap between the three flagship models on most tasks has shrunk to single-digit percentages. You could argue that GPT leads by 5% in raw intelligence, Claude leads by 10% in workflow efficiency, and Gemini leads by 10% in imaging—these gaps certainly exist, but for the vast majority of users, they no longer constitute the critical factor in decision-making.

What truly widens the gap is using roughly identical models to create highly usable products.

With the same baseline model capabilities, how you orchestrate workflows, how you manage context, and how you design interactions—the experience differences brought about by this "product wisdom" are vastly greater than a few percentage points on a model leaderboard.

Two Key Metrics

What is the measuring stick for this product war? I believe there are two core metrics.

1. Response Speed

I've discussed this extensively in previous articles: in human-machine collaboration scenarios, speed is everything.

No matter how brilliant a model is, if every interaction requires a five or ten-second wait, your train of thought breaks down, and the rhythm of collaboration collapses. Users don't need a genius who takes half a day to think; they need a highly responsive partner.

Whoever can compress end-to-end latency to the absolute minimum will win the user's mindshare in daily usage. This isn't solved just by stacking compute; it's a systems engineering problem encompassing model distillation, inference optimization, caching strategies, and architectural design.

OpenAI has already cut token overhead in half and boosted speeds by 25% with GPT-5.3. Furthermore, their recently launched GPT-5.3-Codex-Spark (designed specifically for high-frequency interactions) and its corresponding Fast Mode, have escalated response speeds to a new magnitude—surpassing 1000 Tokens/second. This is absolutely the right direction. But it's only the beginning.

2. Driving Down Costs

If speed determines whether a product is "delightful to use," then cost determines whether it is "affordable to use."

For AI to transition from geek toys to everyday tools, and from enterprise pilot projects to full-scale deployments, cost is the unavoidable hurdle. Current API pricing for flagship models remains too steep for massive-scale applications. True democratization still has a long way to go.

Whoever can be the first to drive costs down to an order-of-magnitude inflection point will devour the largest slice of the market.

This means all-out competition in inference efficiency, model distillation, Mixture of Experts (MoE), and edge deployment. The race is no longer about whose model is larger or smarter, but about who can provide sufficiently good capabilities at a fraction of the cost.

The Alternative Path of Chinese Large Models: Extreme Cost and Scenario Breakthroughs

If Silicon Valley's "Big Three" are still attempting to anchor pricing power through ever-more-powerful general-purpose models, the Chinese legion of large language models (such as DeepSeek, Kimi, Qwen, and Doubao) has taken a deeply pragmatic alternative route—violently reshaping the cost structure, and then relentlessly dominating specific use cases.

Whether it's DeepSeek driving inference prices down to the floor through extreme engineering, or Kimi's manic extension of ultra-large context lengths, or how various models are rapidly deploying across e-commerce, search, and entertainment ecosystems, the underlying logic is highly consistent: I don't necessarily need to place first on every benchmark test, but I must make you an offer you can't refuse when it comes to real-world pricing and specific scenario user experience.

This is, in fact, an alternative solution to the "product war." While foreign giants were still polishing the UI interactions of their flagship products, domestic vendors were already enduring grueling price wars, slashing API costs to mere pennies, or even offering them for free.

This seemingly hyper-competitive "price butcher" strategy has actually vastly accelerated the prosperity of downstream AI products. When the cost of underlying API calls approaches zero, product managers no longer need to nervously tabulate token fees. Various features that were once considered luxuries—like silent background analysis, massive pre-loading, and continuous workflows—have finally become structurally viable.

To some extent, Silicon Valley is defining the upper limit of the models, while China is radically dragging down the threshold for applications.

The competition among AI giants is shifting entirely from "who is smarter" to "who is more usable."

The gap in raw model capabilities is shrinking, while the gap in product experience is magnifying. The future winners of this market won't be the ones holding the strongest model, but rather the ones who can most efficiently convert model capabilities into tangible user value.

The model is the engine, but the product is the car.

When the engines are all more or less the same, what really matters is who builds the tightest, most stable, fastest, and most affordable car.

This race has only just begun.