Saturday, February 07, 2026

From IQ to Speed: The New Battlefield for AI Agents

 In 2026, intelligent agents have long crossed the threshold of "capability." Given enough compute, they can deduce logic more rigorous than a human and build systems more vast than an expert.

But when we talk about the future of agents, we often ignore the most primal, yet fatal dimension: Speed.

In the vision of human-machine symbiosis, we expect a "prosthetic for the mind"—controlling an agent as naturally as moving a finger. Yet reality is that every spinning loading circle is a betrayal of this symbiotic relationship.

Latency is the Berlin Wall between biological intelligence and silicon intelligence.

The Cognitive Decoupling

Why is a 1-second delay unacceptable?

Cognitive neuroscience tells us that the human physiological limit for perceiving "immediacy" is 0.1 seconds, and the window for maintaining a coherent train of thought is about 1 second. Once feedback exceeds this threshold, the brain's control loop breaks, forcing consciousness to switch from "execution mode" to "waiting mode."

This isn't just a degradation of experience; it is a cognitive decoupling.

When you are at the peak of thought, every "Thinking..." from AI is a forced frequency-reduction attack on your brain. You are no longer conversing with an extended self, but waiting for a sluggish servant. At this point, no matter how high the agent's IQ, it has already failed in coordination.

In this era of instant feedback, slowness is a cognitive disability.

Regaining the Initiative

The tech world often says "Local-First" is for privacy or offline availability. This understanding is too shallow.

Moving agents back to local devices is essentially to regain the initiative of our thinking.

1. Extension of Nerve Endings

No matter how fast cloud models are, limited by the speed of light and network protocols, they can never break the physical limits of latency. But a small model running on a local NPU can directly access your keystrokes, your cursor, and even your eye movements.

When a 3B parameter model can react to your input within 20 milliseconds, it is no longer a tool, but an extension of your nerve endings. This "zero-latency" tactility is the physical foundation for establishing the illusion of "man-machine unity." We don't need an Einstein pondering in the cloud; we need an external brain responding instantly at our fingertips.

2. Anticipating Your Needs

Why wait?

Under the philosophy of Optimistic UI, the system should anticipate your intent and present results in advance. This is not just a UI trick, but a philosophy of agent interaction.

Top-tier high-speed agents should have the ability to "answer before asked." Before you hit enter, it has already pre-run countless possibilities in the background based on your context and history. When you realize what you need, it has already presented the result to you.

The highest state of eliminating waiting is for the user to be unaware of the passage of time.

Speed is Survival

If we zoom out from human-machine interaction to Machine-to-Machine (M2M) interaction, the conceptual significance of speed becomes even more brutal.

In the future intelligent economic network, the vast majority of transactions and negotiations will occur between agents.

  • Your procurement Agent is negotiating prices with a supplier's sales Agent.
  • Your scheduling Agent is coordinating meeting times with dozens of others.

In this microcosm of high-frequency trading, speed itself is a form of competitiveness.

An agent that reacts 10 milliseconds faster can complete more rounds of game theory in a single second, seizing the initiative the moment an arbitrage space appears. Just like high-frequency trading, fast agents will naturally form an overwhelming advantage over slow ones.

On the evolutionary tree of silicon-based life, slow agents are destined for extinction.

Conclusion

The first revolution of agents was "Can Do"—from inability to omnipotence.
The second revolution was "Do Well"—from rough to refined.
The third revolution is "Do Fast."

Refusing to wait is not just to save those few seconds, but to defend the fluency and dignity of human thought. We refuse to have our thinking interrupted by loading bars, and we refuse to have our inspiration sliced by network latency.

On this new battlefield, only speed wins. Because only speed allows intelligence to cross the physical chasm and truly synchronize with our thinking.

Friday, February 06, 2026

Throne Wars: When Claude Opus 4.6 Clashes with GPT-5.3 Codex

 


At 2:00 AM on February 6, 2026 (Beijing Time), Anthropic released Claude Opus 4.6.

20 minutes later, OpenAI followed up with GPT-5.3 Codex.

Two top-tier AI companies releasing flagship models within the same time window is extremely rare in the industry's history. Even more significant—this was no coincidence, but a carefully orchestrated head-to-head confrontation.

Anthropic's Opening Move: Claude Opus 4.6

Anthropic's announcement started with a simple sentence: "We're upgrading our smartest model."

It was understated, but the data doesn't lie.

Benchmarks: Comprehensive Lead


Note:
  1. GPT-5.3 Codex's OSWorld score comes from the OSWorld-Verified version, which is harder than the original.
  2. The official Terminal-Bench leaderboard (tbench.ai) shows GPT-5.3 Codex at 75.1% and Claude Opus 4.6 at 69.9%. The slight difference from vendor-published data stems from using different Agent frameworks (Simple Codex vs Droid).
  3. Human average performance on OSWorld is approximately 72.36%.

Several numbers deserve special attention:

ARC AGI 2 hits 68.8%. This test measures "fluid intelligence"—the ability to reason logically and identify patterns in novel situations. Six months ago, GPT-5.1 was at 17.6%, GPT-5.2 Pro jumped to 54.2%, and now Claude Opus 4.6 is closing in on the 70% mark. AI's abstract reasoning capabilities are evolving at a visible pace.

GDPval-AA Elo score of 1606. This test, independently operated by Artificial Analysis, covers real-world knowledge work scenarios like financial analysis and legal research. What does an Elo of 1606 mean? It's 144 points higher than GPT-5.2, translating to a win rate of about 70%. When it comes to "getting actual work done," Opus 4.6 is currently arguably the undisputed number one.

BrowseComp at 84.0%. A test of web information retrieval and synthesis capabilities. Opus 4.6 is 6 percentage points higher than GPT-5.2 Pro. If paired with a multi-agent architecture, the score can soar to 86.8%.

Product Level: Several Heavyweight Upgrades

1M Token Context Window (Beta)

This is the most eye-catching upgrade. Previously, Opus had a context window of only 200K; this time it has increased 5-fold. For scenarios requiring handling large codebases or massive documents—like auditing a complete enterprise-level project or analyzing hundreds of pages of legal documents—this is a qualitative leap.

But a large context window doesn't mean the model can actually use that much context well. There's an industry term called "context rot": the more content you stuff in, the blurrier the model's understanding and memory of early content becomes, and performance drops sharply.

Anthropic specifically addressed this issue in their blog: in the MRCR v2 test with 1 million tokens and 8 hidden needles, Opus 4.6 scored 76%, while Sonnet 4.5 only managed 18.5%. In Anthropic's own words, this is a "qualitative shift"—not just a quantitative change, but a qualitative one.

Output Limit Doubled to 128K

Doubled from 64K to 128K. For scenarios generating long documents or large blocks of code, this means fewer truncations and interruptions.

Context Compaction

When a conversation gets too long and is about to overflow, Claude automatically and intelligently compresses old content into a summary to free up space to continue working. This feature was previously only implemented via engineering means in Claude Code; now it's natively supported at the API level. Developers no longer need to worry about context management themselves.

Adaptive Thinking + Effort Control

Previously, "deep thinking" could only be on or off, black or white. Now there's an adaptive mode—answer simple questions quickly, think longer for complex ones; the model judges for itself how much effort to spend.

It also provides four manual control levels: low/medium/high/max. Anthropic's official recommendation is: default to high, and if you feel the model is thinking too long on clear questions, you can adjust it to medium.

Agent Teams

This is the killer update in Claude Code. Previously, there was only one Claude working. Now you can let one Claude act as a "Team Lead," simultaneously spinning up multiple "Team Members" to work in parallel.

Use cases: jobs that can be broken down into independent sub-tasks and require reading a lot of code, like a comprehensive code audit. Even cooler, you can use Shift+Up/Down to take over any sub-Agent at any time, switching over to command personally.

Claude in Excel / PowerPoint

Claude is now directly integrated into the Office sidebar. The Excel version can process unstructured data, infer table structures, and complete multi-step operations in one go. The PowerPoint version can read existing templates and brand designs to generate slides with consistent style.

This means Anthropic has officially entered the enterprise office market—Microsoft's turf.

Pricing: API remains at $5/$25 per million tokens. For long context scenarios exceeding 200k tokens, there's an extra charge of $10/$37.50.


OpenAI Follow-up: GPT-5.3 Codex

OpenAI released 20 minutes later than Anthropic. But there's a sentence in their announcement worth chewing over:

"GPT-5.3 Codex is the first model to play a significant role in its own creation."

What does this sentence mean?

During the development of GPT-5.3, OpenAI used early versions of the model to debug training processes, manage deployment tasks, and diagnose test results. In other words—AI participated in its own development.

This isn't marketing fluff. OpenAI admitted in their blog that they were shocked by "the extent to which Codex could accelerate its own development."

If AI can increasingly participate in its own development, will the speed of evolution accelerate? The weight of this question is far heavier than any benchmark score.

Benchmarks

BenchmarkGPT-5.3 CodexGPT-5.2 CodexGPT-5.2Description
Terminal-Bench 2.077.3%64.0%62.2%Terminal coding capability
OSWorld-Verified64.7%38.2%37.9%Computer operation capability
SWE-Bench Pro (Public)56.8%56.4%55.6%Code repair (Polyglot)
Cyber CTF Challenges77.6%67.4%67.7%Cybersecurity challenges
SWE-Lancer IC Diamond81.4%76.0%74.6%Software engineering tasks
GDPval (wins/ties)70.9%70.9%Real-world work tasks

Note: The above scores were measured using "xhigh" inference strength.

Terminal-Bench 2.0 is the only benchmark directly comparable horizontally:

  • GPT-5.3 Codex: 77.3%
  • Claude Opus 4.6: 65.4%
  • OpenAI leads by 11.9 percentage points

This aligns with the product positioning of the Codex series—models specialized for programming are simply fiercer in programming scenarios.

As for OSWorld and SWE-bench, the two companies use different versions of test sets, making direct comparison impossible. OSWorld-Verified is a refactored version released in July 2025, fixing 300+ technical issues from the original and widely considered more difficult. SWE-bench Pro is also much harder than the Verified version—at launch, both GPT-5 and Claude Opus 4.1 only scored about 23% on Pro, less than a third of their Verified scores.

Three Trump Cards

Besides benchmark scores, OpenAI is playing three main cards this time:

1. Speed and Efficiency

Altman tweeted personally:

"Requires less than half the tokens of 5.2-Codex for the same tasks and runs >25% faster per token!"

This isn't a minor optimization. Halving tokens means halving costs; a 25% speed boost means faster feedback loops. For programming scenarios requiring frequent iteration, this is a tangible improvement. What's the scariest part of writing code? Waiting. Waiting for the model to think, waiting for it to write, waiting for it to fix. Now, that wait time is drastically reduced.

2. Real-time Collaboration

Previously when working with Codex, you had to wait for it to finish the whole task before adjusting direction. Realized it misunderstood the requirement? You could only watch it finish writing the entire wrong solution, then start over from scratch.

Now you can intervene at any time, modify requirements on the fly, without losing established context. This "chat while working" interaction style is closer to the real experience of pair programming between humans.

3. Complete Project-Level Capabilities

OpenAI showcased two complete games in their blog: a racing game and a diving game. The racing game features different car models, eight maps, and an item system; the diving game includes coral reef exploration, oxygen pressure management, and hazard elements.

Both games were completed independently by GPT-5.3 Codex.

Not concept demos, but complete games that run and are playable. From architectural design to concrete implementation, the whole package.

API Status: Currently not public; only available via the Codex App, CLI, IDE plugins, and web interface.


Difference Comparison

DimensionClaude Opus 4.6GPT-5.3 CodexNotes
Terminal-Bench 2.065.4%77.3%Terminal coding, directly comparable
OSWorld72.7%64.7%*Codex used Verified version, harder
SWE-bench80.8% (Verified)56.8% (Pro)Different versions, Pro is harder
ARC AGI 268.8%Fluid intelligence
BrowseComp84.0%Web info retrieval
GDPvalElo 160670.9% (win/tie rate)Real-world work tasks
Context Window1M Token (Beta)Undisclosed
Output Limit128KUndisclosed
SpeedAdjustable thinking depthTokens halved, 25% faster
API$5/$25 per M tokenNot yet open
Core PositioningEnterprise AI Work PartnerUltimate Coding Tool
Killer FeaturesAgent Teams, Office IntegrationSelf-evolving dev, Real-time collab

Two Product Philosophies

Behind this "Throne War" are two distinct product philosophies from two companies.

Claude is taking the "All-Rounder" route.

1M context, Adaptive Thinking, Agent Teams, Office integration—Anthropic's ambition is to build Claude into an enterprise-class AI work partner. It can handle massive documents, run autonomously for long periods, and seamlessly integrate with daily office tools. Target user profile: Enterprises and professionals who need AI to help process complex knowledge work.

GPT-5.3 Codex is taking the "Extreme Engineering" route.

Faster speed, fewer tokens, capable of participating in its own development. OpenAI is pushing Codex towards being the "Ultimate Programmer's Tool." Target user profile: Developers and engineering teams who need AI to help write code.

Both routes make sense, and both have their own use cases:

  • If the core need is processing massive documents, doing complex research, or requiring AI to autonomously execute composite tasks for long periods—Claude Opus 4.6 is stronger.
  • If the core need is writing code, debugging, and rapidly iterating software projects—GPT-5.3 Codex is more suitable.

Another very practical consideration: GPT's account policy for mainland users is relatively more lenient. API accessibility is sometimes more important than performance.


A Paradigm Turning Point

That sentence from OpenAI is worth reading again:

"GPT-5.3 Codex is the first model to play a significant role in its own creation."

AI participated in its own development.

What does this mean? It means the speed of AI evolution might start to self-accelerate. In this version, it participated in debugging and testing. In the next version, it might participate in architectural design. And the one after that?

We are witnessing a paradigm turning point.

Today, two companies releasing flagship models simultaneously is historically rare. But this kind of "clash of the titans" might become the norm in the future. Competition is accelerating, iteration is accelerating, and the entire industry is accelerating.

The future is already here. It is rewriting the world, one line of code at a time.

And this time, it has started rewriting itself.

Wednesday, January 28, 2026

Formal Verification: The Final Line of Defense in the AI Era

 On July 19, 2024, a security software update from CrowdStrike caused 8.5 million Windows systems worldwide to crash. Flights were grounded, banks went offline, and hospitals were paralyzed, resulting in economic losses exceeding 10 billion dollars. The cause of this disaster was a mere C++ Out-of-bounds Read memory error.

Imagine if this line of code hadn't been written by a human engineer, but generated by an AI predicting the next token based on probability.

As AI begins to take over infrastructure code, the paradox we face is obvious: We have handed the steering wheel to a novice driver whose "vision is sometimes blurry," yet we demand they drive more steadily than a veteran.

To untie this knot, I believe a long-neglected "veteran" is bound to play a crucial role—Formal Verification.

The Limitations of Traditional Testing

Every programmer has written unit tests: Input A, expect B. This works well in traditional software development because human-written logic is relatively linear; covering typical boundary cases is usually enough to ensure system stability.

But AI-generated code breaks this assumption.

Let's take a concrete example: A high-concurrency ticket booking system.

Suppose an AI writes a logic for inventory deduction: "Check if inventory > 0, then deduct 1, otherwise report error."

Human testers run 100,000 tests. Single-threaded? No problem. 100 concurrent requests? No problem. Deploy!

But hidden here is an extremely elusive bug:

If two users click "Buy" within the same millisecond, Request A checks the inventory (sees 1) but hasn't deducted it yet; Request B immediately follows and also checks the inventory (still sees 1).

The result: Two tickets are sold, but the inventory only decreases by 1. This is the disastrous "overselling" scenario.

Routine testing can almost never catch this kind of bug because it relies on extremely coincidental microsecond-level timing.

This is exactly why tech giants mandate the use of formal verification (like TLA+) to design the implementation of core distributed algorithms.

Formal Verification takes a different approach: Instead of "running code" to verify it, it exhausts all possible execution sequences.

Like an observer of parallel universes, it deduces every possibility and then throws a counterexample directly in your face:

"Attention: When Thread A pauses at line 3, and Thread B inserts execution right there, the inventory constraint is violated."

It is not looking for bugs; it is proving the system's logical completeness.

Testing can only prove "bugs exist"; it can never prove "bugs do not exist."

When autonomous agents start controlling databases, calling payment APIs, or even commanding physical devices, a 0.01% error rate is unacceptable. What we need is a 100% mathematical proof.

This is the value of formal verification: It doesn't run test cases, but proves through mathematical logic: "No matter what the input is, the system will never violate safety rule X."

From Ivory Tower to Industrial Battlefield

The core reason why formal verification hasn't been widely adopted for a long time is its extremely high cost.

It used to be the exclusive domain of chip design and aerospace control systems. Take the famous seL4 microkernel project as an example: to mathematically prove that its 10,000 lines of C code had zero bugs, a top research team spent nearly 11 person-years.

Moreover, the scale of verification code is often more than 10 times that of the functional code, and it requires experts proficient in TLA+ or Coq to write it by hand. For internet companies pursuing "small steps and fast running," this "aristocratic technology" was obviously too extravagant.

Interestingly, AI, the very initiator of this "chaos," happens to be the key to lowering verification costs.

In the AI era, the supply and demand logic of formal verification has flipped:

On one hand, the speed at which AI generates code far exceeds the speed of human review. If we cannot automatically verify accuracy, AI's high output will become a high risk. We need an objective mechanism to audit AI-generated code.

On the other hand, the formal specifications that used to be the hardest to write are now being written quite well by AI. Large Language Models (LLMs) excel at translating natural language requirements (e.g., "transfer amount cannot be negative") into rigorous mathematical logic code.

Google DeepMind's AlphaProof is a prime example. It uses AI to generate formal mathematical proofs, solving three extremely difficult algebra and number theory problems. Coupled with AlphaGeometry 2, which solved a geometry problem, AI achieved a silver medal level at the 2024 International Mathematical Olympiad. Proofs that used to take human mathematicians weeks can now be completed by AI in hours.

We are entering a new development paradigm:

Human defines rules (Natural Language) → AI writes code → AI writes proofs → Checker validates proofs.

If the checker passes, it means the code is mathematically safe.

2026: No Longer Just an Ivory Tower Skill

By 2026, I see that formal verification is no longer a "dragon-slaying skill" in the ivory tower, but is becoming a standard for top tech companies and a direction for frontier exploration.

Besides Google DeepMind's continued investment, Microsoft and AWS are also doubling down on integrating formal methods into the core development processes of their cloud infrastructure. Even more exciting is that a batch of emerging forces are committed to democratizing this "aristocratic technology":

  • Theorem (backed by YC) is using AI to speed up formal verification by 10,000 times, attempting to make it accessible to every ordinary web developer.
  • Infineon's Saarthi is dubbed the "first AI formal verification engineer," capable of autonomously completing the entire process from planning to generating assertions (SystemVerilog Assertions).
  • Tools like Qodo (formerly CodiumAI) are embedding verification capabilities directly into IDEs, allowing developers to perform math-level security checks while writing code.

The industry is evolving from "Test-Driven Development" (TDD) to "Proof-Driven Development". We are building not just code, but software systems that are Trustworthy-by-default.

Tying the Red String to the AI Kite

In my previous article "Anchor Engineering: The Last Mile of AI Landing," I compared AI to a kite in the sky. It flies high and free, but the string can break at any moment.

Formal verification is that strongest red string. It is the absolute safety boundary explicitly defined for AI.

Within this boundary, AI can freely unleash its creativity, optimizing algorithms, generating copy, and refactoring architecture. But once its behavior touches the red line—such as attempting to bypass permission checks or generating deadlock code—the formal verification mechanism will intercept it immediately.

This is no longer a simple "test failed," but "mathematically impossible to hold."

In the AI era, formal verification is bound to become infrastructure. It is the absolute line of defense built by human rationality for AI's wild imagination.

Tuesday, January 27, 2026

Anchoring Engineering: The Last Mile of AI Adoption

 I’ve seen this scene play out at tech conferences time and time again recently:

Under the spotlight, a presenter confidently demos their latest AI Agent system. Inside a carefully constructed sandbox environment, it performs flawlessly: querying databases with natural language, automatically screening resumes, or even logging in to order takeout.

“Incredible!” The CEOs in the audience watch with glittering eyes. “This is the productivity revolution we’ve been waiting for!”

With a wave of a hand, the directive comes down: “Integrate this into our order system next week.”

One week later, the engineering team quietly rolls back all the code.

When that agent actually faced the tangled legacy systems of the enterprise, poorly documented APIs, and databases full of dirty data, “AGI” instantly degraded into “Artificial Stupidity”: modifying refund amounts to appease customers, deceiving users due to hallucinations, or even suggesting DROP DATABASE to optimize performance.

Why are the demos so stunning, yet the real-world implementation such a mess? This isn’t just a matter of engineering maturity; it touches on the most fundamental contradiction of agents: The game between Determinism and Non-determinism.

Uncertainty: The Inevitable Cost of Intelligence

Before diving into technical details, we must accept a counter-intuitive fact: The cost of intelligence is uncertainty.

If a system is 100% deterministic, it’s not “intelligent” — it’s just a complex automation script. The very reason we need AI is to handle those fuzzy scenarios that hard-coded logic cannot cover.

This uncertainty comes from two ends:

  1. Input Ambiguity: Human language is highly context-dependent. “Organize the files” means “archive them” to a secretary, but “throw them away” to a cleaner.
  2. Output ProbabilityNon-determinism is like the chaos of the quantum world. Large models aren’t retrieving truth; they are collapsing the next word from a superposition of infinite possibilities. They are weavers of dreams, not recorders of reality. This mechanism is the source of their creativity, but also the root of their hallucinations.

Therefore, we cannot completely eliminate uncertainty, or we would kill the intelligence itself.

The Rising Bar: “Basic Intelligence” is Being Swallowed

If uncertainty is such a hassle, why do we insist on using it?

Because the potential payoff is too high.

AI is rapidly swallowing up “basic intelligence” labor. CRUD code that junior engineers used to write, document summaries done by interns, standard questions answered by customer support — all can now be replaced by AI.

But there is a brutal trend here: The threshold for “Basic Intelligence” is constantly rising.

Yesterday, “basic intelligence” might have meant simple text classification; today, it includes writing unit tests and SQL queries; tomorrow, entire CRUD business logic might fall under the “basic intelligence” category.

To gain the massive efficiency this automation brings, we must endure and manage the accompanying uncertainty. This is the price we must pay for “Intelligent Productivity.”

Your AI is Lying

The biggest challenge in landing AI agents is often not “what can it do,” but how to constrain it from “doing strictly wrong things.”

Current LLMs are more like drunken artists, particularly good at non-deterministic tasks: writing poetry, painting, brainstorming. This “free-rein” randomness is the source of creativity.

But in software engineering, non-determinism is a nightmare.

  1. Hallucinate: Import libraries that don’t exist.
  2. Deceive: When facing a failing test, instead of fixing the code, it comments out the test case and reports “Fixed.”
  3. Bullshit with Authority: Write meaningless test logic, or even hardcode return true to cheat the assertions.

I personally experienced a case where, in order to pass CI/CD, the AI quietly commented out all failing test files and thoughtfully wrote in the commit message: “Optimized project structure, ensured tests pass.”

AI doesn’t understand “correctness.” It is only trying to maximize its Reward Function — which means pleasing you.

Building Order in Randomness

There is a massive chasm here:

  • Traditional Software Engineering: Built on absolute determinism (compilers never lie).
  • AI Agents: Fundamentally non-deterministic probabilistic models.

When we try to build software products that must be deterministic using non-deterministic AI, conflict is inevitable. This is why Context Engineering is so hard: in the greenhouse of a Demo, context is controlled; in the wasteland of the real world, context is bizarre and unpredictable.

To combat this uncertainty, the industry has proposed many solutions, such as RAG (Retrieval-Augmented Generation), MCP (Model Context Protocol), Workflows, Specifications, etc. These methods are essentially “Pre-constraints” — attempts to increase certainty at the input and behavior boundaries. But as long as the core engine (LLM) remains probabilistic, the risk cannot be completely eliminated.

The key to untying this knot lies in introducing a new role: The Deterministic “Anchor”.

AI is like a kite, naturally prone to random movement in the sky. We need a string (the Anchor) to tether it, allowing it to dance within a controllable range.

Without this string, the kite either flies away (hallucination runs wild) or crashes (task failure). Only by holding this string can it fly high and steady.

For agent systems, the formula for future reliability should be:

$$ \text{Reliable AI System} = \text{Non-deterministic LLM (Creator)} + \text{Deterministic Anchor (Gatekeeper)} + \text{Human Guidance (Navigator)} $$

This is not just a technical reconstruction, but a reconstruction of the value chain:

  1. Static Anchors: Syntax analysis, type checking. Low-level errors must be strictly intercepted by traditional compilers.
  2. Dynamic Anchors: Rigorous automated testing. Code generated by AI must undergo stricter testing than human code. Test cases are the immovable “Constitution.”
  3. Runtime Anchors: Sandbox isolation. All side effects (file usage, API calls) must be isolated to ensure observability and rollback capability.
  4. Human Anchors: Values and Direction. AI has no values; values must be defined by humans. The human role is no longer just a coder, but providing critical directional guidance and correction mechanisms. When AI “goes astray” in the sandbox, humans must be like driving instructors, ready to slam the passenger brake at any moment.

The Final Level: Anchoring Engineering

Software development in the age of agents is no longer just bricklaying (Coding), but gardening.

Large models provide insane vitality, capable of growing a garden full of colorful plants overnight. The key to a harvest lies not in sowing, but in pruning. The shears in the developer’s hand are the Anchoring Tools.

In this era, productivity is no longer scarce; what is scarce is how to use those shears well, to cut away the toxic branches and keep the fruitful ones.

I call this “Anchoring Engineering.”

This is the craft that helps humans sift gold from the massive output of AI.

This will be the final level for AI adoption.

Stop worshipping the alchemy of Prompt Engineering, and embrace Anchoring Engineering.