At 2:00 AM on February 6, 2026 (Beijing Time), Anthropic released Claude Opus 4.6.
20 minutes later, OpenAI followed up with GPT-5.3 Codex.
Two top-tier AI companies releasing flagship models within the same time window is extremely rare in the industry's history. Even more significant—this was no coincidence, but a carefully orchestrated head-to-head confrontation.
Anthropic's Opening Move: Claude Opus 4.6
Anthropic's announcement started with a simple sentence: "We're upgrading our smartest model."
It was understated, but the data doesn't lie.
Benchmarks: Comprehensive Lead
- GPT-5.3 Codex's OSWorld score comes from the OSWorld-Verified version, which is harder than the original.
- The official Terminal-Bench leaderboard (tbench.ai) shows GPT-5.3 Codex at 75.1% and Claude Opus 4.6 at 69.9%. The slight difference from vendor-published data stems from using different Agent frameworks (Simple Codex vs Droid).
- Human average performance on OSWorld is approximately 72.36%.
Several numbers deserve special attention:
ARC AGI 2 hits 68.8%. This test measures "fluid intelligence"—the ability to reason logically and identify patterns in novel situations. Six months ago, GPT-5.1 was at 17.6%, GPT-5.2 Pro jumped to 54.2%, and now Claude Opus 4.6 is closing in on the 70% mark. AI's abstract reasoning capabilities are evolving at a visible pace.
GDPval-AA Elo score of 1606. This test, independently operated by Artificial Analysis, covers real-world knowledge work scenarios like financial analysis and legal research. What does an Elo of 1606 mean? It's 144 points higher than GPT-5.2, translating to a win rate of about 70%. When it comes to "getting actual work done," Opus 4.6 is currently arguably the undisputed number one.
BrowseComp at 84.0%. A test of web information retrieval and synthesis capabilities. Opus 4.6 is 6 percentage points higher than GPT-5.2 Pro. If paired with a multi-agent architecture, the score can soar to 86.8%.
Product Level: Several Heavyweight Upgrades
1M Token Context Window (Beta)
This is the most eye-catching upgrade. Previously, Opus had a context window of only 200K; this time it has increased 5-fold. For scenarios requiring handling large codebases or massive documents—like auditing a complete enterprise-level project or analyzing hundreds of pages of legal documents—this is a qualitative leap.
But a large context window doesn't mean the model can actually use that much context well. There's an industry term called "context rot": the more content you stuff in, the blurrier the model's understanding and memory of early content becomes, and performance drops sharply.
Anthropic specifically addressed this issue in their blog: in the MRCR v2 test with 1 million tokens and 8 hidden needles, Opus 4.6 scored 76%, while Sonnet 4.5 only managed 18.5%. In Anthropic's own words, this is a "qualitative shift"—not just a quantitative change, but a qualitative one.
Output Limit Doubled to 128K
Doubled from 64K to 128K. For scenarios generating long documents or large blocks of code, this means fewer truncations and interruptions.
Context Compaction
When a conversation gets too long and is about to overflow, Claude automatically and intelligently compresses old content into a summary to free up space to continue working. This feature was previously only implemented via engineering means in Claude Code; now it's natively supported at the API level. Developers no longer need to worry about context management themselves.
Adaptive Thinking + Effort Control
Previously, "deep thinking" could only be on or off, black or white. Now there's an adaptive mode—answer simple questions quickly, think longer for complex ones; the model judges for itself how much effort to spend.
It also provides four manual control levels: low/medium/high/max. Anthropic's official recommendation is: default to high, and if you feel the model is thinking too long on clear questions, you can adjust it to medium.
Agent Teams
This is the killer update in Claude Code. Previously, there was only one Claude working. Now you can let one Claude act as a "Team Lead," simultaneously spinning up multiple "Team Members" to work in parallel.
Use cases: jobs that can be broken down into independent sub-tasks and require reading a lot of code, like a comprehensive code audit. Even cooler, you can use Shift+Up/Down to take over any sub-Agent at any time, switching over to command personally.
Claude in Excel / PowerPoint
Claude is now directly integrated into the Office sidebar. The Excel version can process unstructured data, infer table structures, and complete multi-step operations in one go. The PowerPoint version can read existing templates and brand designs to generate slides with consistent style.
This means Anthropic has officially entered the enterprise office market—Microsoft's turf.
Pricing: API remains at $5/$25 per million tokens. For long context scenarios exceeding 200k tokens, there's an extra charge of $10/$37.50.
OpenAI Follow-up: GPT-5.3 Codex
OpenAI released 20 minutes later than Anthropic. But there's a sentence in their announcement worth chewing over:
"GPT-5.3 Codex is the first model to play a significant role in its own creation."
What does this sentence mean?
During the development of GPT-5.3, OpenAI used early versions of the model to debug training processes, manage deployment tasks, and diagnose test results. In other words—AI participated in its own development.
This isn't marketing fluff. OpenAI admitted in their blog that they were shocked by "the extent to which Codex could accelerate its own development."
If AI can increasingly participate in its own development, will the speed of evolution accelerate? The weight of this question is far heavier than any benchmark score.
Benchmarks
| Benchmark | GPT-5.3 Codex | GPT-5.2 Codex | GPT-5.2 | Description |
|---|---|---|---|---|
| Terminal-Bench 2.0 | 77.3% | 64.0% | 62.2% | Terminal coding capability |
| OSWorld-Verified | 64.7% | 38.2% | 37.9% | Computer operation capability |
| SWE-Bench Pro (Public) | 56.8% | 56.4% | 55.6% | Code repair (Polyglot) |
| Cyber CTF Challenges | 77.6% | 67.4% | 67.7% | Cybersecurity challenges |
| SWE-Lancer IC Diamond | 81.4% | 76.0% | 74.6% | Software engineering tasks |
| GDPval (wins/ties) | 70.9% | — | 70.9% | Real-world work tasks |
Note: The above scores were measured using "xhigh" inference strength.
Terminal-Bench 2.0 is the only benchmark directly comparable horizontally:
- GPT-5.3 Codex: 77.3%
- Claude Opus 4.6: 65.4%
- OpenAI leads by 11.9 percentage points
This aligns with the product positioning of the Codex series—models specialized for programming are simply fiercer in programming scenarios.
As for OSWorld and SWE-bench, the two companies use different versions of test sets, making direct comparison impossible. OSWorld-Verified is a refactored version released in July 2025, fixing 300+ technical issues from the original and widely considered more difficult. SWE-bench Pro is also much harder than the Verified version—at launch, both GPT-5 and Claude Opus 4.1 only scored about 23% on Pro, less than a third of their Verified scores.
Three Trump Cards
Besides benchmark scores, OpenAI is playing three main cards this time:
1. Speed and Efficiency
Altman tweeted personally:
"Requires less than half the tokens of 5.2-Codex for the same tasks and runs >25% faster per token!"
This isn't a minor optimization. Halving tokens means halving costs; a 25% speed boost means faster feedback loops. For programming scenarios requiring frequent iteration, this is a tangible improvement. What's the scariest part of writing code? Waiting. Waiting for the model to think, waiting for it to write, waiting for it to fix. Now, that wait time is drastically reduced.
2. Real-time Collaboration
Previously when working with Codex, you had to wait for it to finish the whole task before adjusting direction. Realized it misunderstood the requirement? You could only watch it finish writing the entire wrong solution, then start over from scratch.
Now you can intervene at any time, modify requirements on the fly, without losing established context. This "chat while working" interaction style is closer to the real experience of pair programming between humans.
3. Complete Project-Level Capabilities
OpenAI showcased two complete games in their blog: a racing game and a diving game. The racing game features different car models, eight maps, and an item system; the diving game includes coral reef exploration, oxygen pressure management, and hazard elements.
Both games were completed independently by GPT-5.3 Codex.
Not concept demos, but complete games that run and are playable. From architectural design to concrete implementation, the whole package.
API Status: Currently not public; only available via the Codex App, CLI, IDE plugins, and web interface.
Difference Comparison
| Dimension | Claude Opus 4.6 | GPT-5.3 Codex | Notes |
|---|---|---|---|
| Terminal-Bench 2.0 | 65.4% | 77.3% | Terminal coding, directly comparable |
| OSWorld | 72.7% | 64.7%* | Codex used Verified version, harder |
| SWE-bench | 80.8% (Verified) | 56.8% (Pro) | Different versions, Pro is harder |
| ARC AGI 2 | 68.8% | — | Fluid intelligence |
| BrowseComp | 84.0% | — | Web info retrieval |
| GDPval | Elo 1606 | 70.9% (win/tie rate) | Real-world work tasks |
| Context Window | 1M Token (Beta) | Undisclosed | |
| Output Limit | 128K | Undisclosed | |
| Speed | Adjustable thinking depth | Tokens halved, 25% faster | |
| API | $5/$25 per M token | Not yet open | |
| Core Positioning | Enterprise AI Work Partner | Ultimate Coding Tool | |
| Killer Features | Agent Teams, Office Integration | Self-evolving dev, Real-time collab |
Two Product Philosophies
Behind this "Throne War" are two distinct product philosophies from two companies.
Claude is taking the "All-Rounder" route.
1M context, Adaptive Thinking, Agent Teams, Office integration—Anthropic's ambition is to build Claude into an enterprise-class AI work partner. It can handle massive documents, run autonomously for long periods, and seamlessly integrate with daily office tools. Target user profile: Enterprises and professionals who need AI to help process complex knowledge work.
GPT-5.3 Codex is taking the "Extreme Engineering" route.
Faster speed, fewer tokens, capable of participating in its own development. OpenAI is pushing Codex towards being the "Ultimate Programmer's Tool." Target user profile: Developers and engineering teams who need AI to help write code.
Both routes make sense, and both have their own use cases:
- If the core need is processing massive documents, doing complex research, or requiring AI to autonomously execute composite tasks for long periods—Claude Opus 4.6 is stronger.
- If the core need is writing code, debugging, and rapidly iterating software projects—GPT-5.3 Codex is more suitable.
Another very practical consideration: GPT's account policy for mainland users is relatively more lenient. API accessibility is sometimes more important than performance.
A Paradigm Turning Point
That sentence from OpenAI is worth reading again:
"GPT-5.3 Codex is the first model to play a significant role in its own creation."
AI participated in its own development.
What does this mean? It means the speed of AI evolution might start to self-accelerate. In this version, it participated in debugging and testing. In the next version, it might participate in architectural design. And the one after that?
We are witnessing a paradigm turning point.
Today, two companies releasing flagship models simultaneously is historically rare. But this kind of "clash of the titans" might become the norm in the future. Competition is accelerating, iteration is accelerating, and the entire industry is accelerating.
The future is already here. It is rewriting the world, one line of code at a time.
And this time, it has started rewriting itself.

