Verify any claim · lenz.io
Claim analyzed
Tech“Claude Opus 4.7 outperforms Claude Opus 4.6 on coding tasks according to measurable benchmarks.”
The conclusion
Claude Opus 4.7 does show clear, quantified improvements over Opus 4.6 on multiple coding-specific benchmarks, including SWE-bench Verified (80.8%→87.6%), SWE-bench Pro (53.4%→64.3%), and CursorBench (58%→70%). These figures are consistently reported across Anthropic's official documentation, the AWS News Blog, and numerous third-party writeups. The primary caveat is that the benchmark data originates from Anthropic's own reporting and has not yet been independently replicated by a third-party benchmark aggregator.
Based on 26 sources: 20 supporting, 0 refuting, 6 neutral.
Caveats
- The benchmark figures cited across sources originate primarily from Anthropic's own vendor-reported data; no independent third-party benchmark aggregator has yet verified the coding improvements.
- The gains are most pronounced in agentic/autonomous coding contexts (SWE-bench, CursorBench) and may not uniformly apply to all coding sub-tasks; some users report instruction-following regressions in consumer Claude.ai usage.
- Opponent objections citing BrowseComp regression and Terminal-Bench comparison to GPT-5.4 are out of scope — BrowseComp is not a coding benchmark, and the GPT-5.4 comparison is irrelevant to the 4.6 vs. 4.7 question — but they do signal that Opus 4.7 is not a uniform improvement across all task types.
Get notified if new evidence updates this analysis
Create a free account to track this claim.
Sources
Sources used in the analysis
Claude Opus 4.7 uses a new tokenizer, contributing to its improved performance on a wide range of tasks.
Today, we're announcing Claude Opus 4.7 in Amazon Bedrock, Anthropic's most intelligent Opus model for advancing performance across coding, long-running agents, and professional work. According to Anthropic, the model records high-performance scores with 64.3% on SWE-bench Pro, 87.6% on SWE-bench Verified, and 69.4% on Terminal-Bench 2.0, extending Opus 4.6's lead in agentic coding.
Coding is the clear headline. SWE-bench Verified jumps from 80.8% to 87.6%, a nearly 7-point gain that puts Opus 4.7 ahead of Gemini 3.1 Pro (80.6%). On SWE-bench Pro, the harder multi-language variant, Opus 4.7 goes from 53.4% to 64.3%, leapfrogging both GPT-5.4 (57.7%) and Gemini (54.2%).
Which AI is best for coding? Grok 4 leads raw SWE-bench scores (75%), followed closely by GPT-5.4 (74.9%) and Claude Opus 4.6 (74%+). In practice, Claude dominates the developer tooling ecosystem — it powers Cursor, Windsurf, and Claude Code.
On SWE-Bench style evaluations, Opus 4.7 shows roughly a 6–8 point improvement over 4.6 on complex multi-file tasks. That gap is meaningful for anyone running autonomous coding agents. The biggest practical gap between the two versions shows up in agentic coding tasks. Opus 4.6 was already strong here, but it had a known failure mode: longer autonomous coding sessions would drift. Opus 4.7 addresses this directly.
Opus 4.7's 64.3% on SWE-bench Pro means it resolves more real-world GitHub issues end-to-end than any other generally available model. That's an 11-point jump from Opus 4.6 (53.4%) and a 6.6-point lead over GPT-5.5 (58.6%). For pure code quality on hard problems, Opus 4.7 wins. For agentic coding workflows with tool coordination, GPT-5.5 has the edge.
On SWE-bench Pro, Opus 4.7 hits 64.3%, up from 53.4% on Opus 4.6. That is an 11-point jump on the benchmark most closely tied to real-world software engineering. On SWE-bench Verified, Opus 4.7 scores 87.6%, versus 80.8% for Opus 4.6. On CursorBench, Opus 4.7 is a meaningful jump in capabilities, clearing 70% versus Opus 4.6 at 58%.
On SWE-Bench, the industry-standard benchmark for evaluating autonomous code repair across real GitHub issues, Opus 4.7 shows a meaningful step up from Opus 4.6, with early reported scores suggesting improvements in the range of 8–12 percentage points depending on task category. On HumanEval, which tests functional code generation, Opus 4.7 continues to perform competitively.
Claude Opus 4.7 is Anthropic's current flagship, released in early 2026. It's a meaningful upgrade over Opus 4.6 — better at extended agentic tasks, stronger at following multi-step instructions, and more reliable in long coding sessions. The Opus 4.7 vs 4.6 comparison breaks down exactly what changed, but the short version is that Opus 4.7 improved task completion and reduced mid-task failures in agentic settings.
SWE-bench Verified: Opus 4.6 80.8%, Opus 4.7 87.6%; SWE-bench Pro(多语言): Opus 4.6 53.4%, Opus 4.7 64.3%; CursorBench: Opus 4.6 58%, Opus 4.7 70%; Terminal-Bench 2.0: Opus 4.6 65.4%, Opus 4.7 69.4%. Opus 4.7 shows improvements in most coding benchmarks over Opus 4.6, though slight decline in BrowseComp from 83.7% to 79.3%.
The model has significant improvements in advanced software engineering compared to Opus 4.6, especially in handling the most complex tasks. In long-context reasoning, Opus 4.7 scores 58.6% on BFS 1M vs Opus 4.6's 41.2%, a 17-point gap, with greater improvements on harder tasks.
Anthropic: Opus 4.7 is a notable improvement on Opus 4.6 in advanced software engineering, with particular gains on the most difficult tasks. Many quantified the improvements, usually in the 10%-20% range. Overall I believe it is a substantial improvement over Claude Opus 4.6. It can do things previous models failed to do, or make agentic or long work flows reliable and worthwhile where they weren't before, such as fast reliable author identification.
SWE-bench Verified: 87.6% vs Opus 4.6's 80.8%, up nearly 7 points; CursorBench: 70% vs 58%, up 12 points. Opus 4.7 leads in programming benchmarks among public models, with core upgrades in advanced software engineering.
SWE-bench Pro: Opus 4.6 53.4%, Opus 4.7 64.3%, up nearly 11 points; SWE-bench Verified: 80.8% to 87.6%, nearly 7 points; CursorBench: 58% to 70%, 12 points. Low effort Opus 4.7 matches mid effort Opus 4.6, with coding ability improved 11%.
Claude Opus 4.7 is a meaningful upgrade from 4.6 in two specific areas: software engineering benchmarks improved by roughly 10%, and visual reasoning jumped by about 13%. But there's a catch — agentic search performance took a step backward. The 10% improvement on SWE-Bench is the most credible gain in this release.
Anthropic reports a 13% improvement in coding benchmarks over Opus 4.6, and 3x more production-grade tasks solved without human intervention. On CursorBench — which tests real-world coding tasks inside an IDE environment — Opus 4.7 scores 70% compared to Opus 4.6's 58%. For backend engineering, algorithm work, and anything not vision-related, the gap between 4.6 and 4.7 is narrower than the CursorBench numbers suggest.
Opus 4.7's 3x vision improvement narrows the gap on image tasks specifically, but if you need audio or video processing, Gemini is still the leader.
Anthropic's official announcements for Claude model updates consistently include benchmark comparisons showing iterative improvements on coding tasks like SWE-Bench, where newer versions outperform predecessors by measurable margins, as seen in prior releases from Claude 3 to 4 series.
Official evaluations show significant upgrades in code capabilities and visual processing compared to previous version. Code ability has been significantly enhanced.
Opus 4.7 is the new flagship. Anthropic has positioned it as the current strongest general available Opus model.
Claude Opus 4.7 (Adaptive Reasoning, Max Effort) scores 57.3 on the Quality Index for coding, placing it first among proprietary models, ahead of Gemini 3.1 Pro Preview (57.2) and GPT-5.4 (xhigh) (56.8).
Terminal-Bench 2.0, which was real coding tasks done in the terminal: Opus 4.6: 65.4% Opus 4.5: 59.8%. For coding, reasoning, and agentic workflows, Opus 4.6 is the clear choice for an upgrade (over 4.5). This article focuses on 4.6 vs 4.5, but establishes 4.6's strong baseline in coding.
This page collates benchmark data from independent sources to help you compare models. Data current as of: SWE-bench (February 2026), Aider (October 2025), Arena Code (February 2026). Claude Opus 4.5 leads at 76.8% on SWE-bench Verified, but now Minimax M2.5 (75.8%) and Gemini 3 Flash (75.8%) are right behind.
Claude Opus 4.7 is here, and after running it through agentic coding, terminal work, and long-running tasks, it's a clear step up from 4.6. Claude Opus 4.7 sets a new standard in advanced software engineering and coding tasks, demonstrating significant improvements over previous versions.
SWE-bench Verified hit 87.6% in vendor tests, but Terminal-Bench 2.0 regressed versus GPT-5.4 and r/ClaudeAI users report consumer Claude.ai following instructions worse than 4.6. You’ll see the agent run more on its own, give deeper reviews, and generally feel “smarter” on one‑shot tasks.
Claude Opus 4.7 just dropped and it's seriously pushing the limits of what AI can do for coding. In this video, I put Claude Opus 4.7 through tests and it beats everything.
What do you think of the claim?
Your challenge will appear immediately.
Challenge submitted!
Expert review
How each expert evaluated the evidence and arguments
Expert 1 — The Logic Examiner
The logical chain from evidence to claim is strong and direct: multiple sources (Sources 2, 3, 7, 10, 13, 14) report specific, quantified benchmark improvements on coding-specific evaluations — SWE-bench Verified (80.8%→87.6%), SWE-bench Pro (53.4%→64.3%), and CursorBench (58%→70%) — which directly support the claim that Opus 4.7 outperforms Opus 4.6 on coding tasks according to measurable benchmarks. The opponent's rebuttals introduce scope errors: BrowseComp is not a coding benchmark, "agentic search regression" is a separate domain from software engineering, and "regressed versus GPT-5.4 on Terminal-Bench" is irrelevant to the 4.6 vs. 4.7 comparison — the 4.7 Terminal-Bench score (69.4%) still exceeds 4.6's (65.4%); the opponent's strongest point — that third-party blogs merely reproduce Anthropic's vendor figures without independent testing — is a legitimate concern about source independence, but it does not logically refute the claim, since the consistent reporting of identical figures across many independent outlets, combined with Anthropic's official documentation (Source 1) and AWS's corroboration (Source 2), constitutes convergent evidence rather than a single self-serving assertion, and the claim itself is scoped narrowly to "measurable benchmarks," which the evidence directly satisfies.
Expert 2 — The Context Analyst
The claim is well-supported by multiple coding-specific benchmarks (SWE-bench Verified: 80.8%→87.6%, SWE-bench Pro: 53.4%→64.3%, CursorBench: 58%→70%, Terminal-Bench 2.0: 65.4%→69.4%), reported consistently across sources including the official AWS announcement and numerous third-party writeups. The opponent's objections are largely misdirected: BrowseComp is not a coding benchmark, "agentic search" regression is a separate domain, and Terminal-Bench 2.0 regressing versus GPT-5.4 is irrelevant to whether 4.7 outperforms 4.6 (it does, per Source 10). The claim does omit that the benchmark data is primarily vendor-reported rather than independently verified, that gains are concentrated in specific coding sub-domains (particularly agentic/autonomous coding), and that some users report real-world instruction-following regressions on consumer Claude.ai (Source 25) — but none of these omissions reverse the core conclusion that measurable benchmark improvements in coding tasks exist between 4.6 and 4.7. The claim is essentially true with the minor caveat that "coding tasks" improvements are strongest in agentic/autonomous coding contexts and the data is largely vendor-sourced.
Expert 3 — The Source Auditor
The highest-authority source in this pool is Source 1 (Claude Platform Documentation, platform.claude.com), which confirms improved performance with a new tokenizer, and Source 2 (AWS News Blog, aws.amazon.com) — both high-authority and recently dated (April 2026) — which explicitly reports Anthropic's benchmark figures: 64.3% on SWE-bench Pro (up from 53.4%), 87.6% on SWE-bench Verified (up from 80.8%), and 69.4% on Terminal-Bench 2.0, framing these as extensions of Opus 4.6's lead in agentic coding. The opponent's strongest point — that most secondary sources merely republish Anthropic's vendor-reported figures without independent testing — is well-founded and partially correct; Sources 3, 7, 10, 13, 14, and others are blog aggregators, not independent testers. However, the claim only requires that measurable benchmarks show outperformance, not that those benchmarks were independently conducted; Anthropic's own official documentation and AWS's announcement (a credible, independent publishing platform) both confirm the specific numeric gains on coding benchmarks. The regressions flagged (BrowseComp, agentic search) are real but pertain to non-coding or tangential tasks, not the core coding benchmarks (SWE-bench, CursorBench, Terminal-Bench) where the claim is situated — and even Terminal-Bench 2.0 shows a net gain over 4.6 (69.4% vs. 65.4%), though it lags GPT-5.4. The claim as stated — that Opus 4.7 outperforms Opus 4.6 on coding tasks according to measurable benchmarks — is clearly supported by the most authoritative sources available, with consistent numeric evidence across multiple coding-specific benchmarks, making the verdict Mostly True rather than True only because the evidence pool relies heavily on vendor-reported figures with limited truly independent third-party testing.
Expert summary
The arguments
Two AI advocates debated this claim using the research gathered.
Argument for
Multiple independent benchmark writeups report clear, measurable coding gains for Claude Opus 4.7 over 4.6, including SWE-bench Verified improving from 80.8% to 87.6% and SWE-bench Pro from 53.4% to 64.3% (Source 3, Vellum; Source 7, Miraflow AI; Source 10, gitcode.csdn.net). These same sources also show broader coding-task improvements like CursorBench rising from 58% to 70% and Terminal-Bench 2.0 from 65.4% to 69.4%, aligning with Anthropic-distributed benchmark claims via AWS that 4.7 “extends Opus 4.6's lead in agentic coding” (Source 2, AWS News Blog; Source 10, gitcode.csdn.net).
Your characterization of Sources 3, 7, and 10 as "independent" is a critical logical error — these are third-party blog writeups that explicitly reproduce Anthropic's own vendor-reported figures without conducting any original testing, meaning they are merely secondary amplifiers of the same self-serving data I already flagged, not independent verification. Moreover, you conveniently sidestep the regressions your own cited sources acknowledge: Source 10 records a BrowseComp decline from 83.7% to 79.3%, Source 25 notes Terminal-Bench 2.0 regressed versus GPT-5.4, and Source 15 concedes "agentic search performance took a step backward" — directly contradicting your claim of broad, measurable coding gains and confirming that the improvements are selective, not the consistent cross-benchmark outperformance the claim requires.
Argument against
The benchmark evidence is far from unambiguous: Source 10 explicitly notes a regression in BrowseComp from 83.7% to 79.3%, Source 25 flags that "Terminal-Bench 2.0 regressed versus GPT-5.4," and Source 15 concedes "agentic search performance took a step backward," revealing that Claude Opus 4.7's gains are selective and domain-specific rather than a consistent, across-the-board coding improvement. Furthermore, the benchmark data cited across Sources 2, 3, 7, and 14 originates primarily from Anthropic's own vendor-reported figures — a self-serving source — while the only pre-Opus-4.7 independent benchmark aggregator, Source 23, does not even include Opus 4.7 data (its data is current only through February 2026), meaning no truly independent, third-party verification of the claimed coding superiority exists in the research brief.
You're conflating non-coding or differently-scoped metrics with the motion's coding-benchmark comparison: BrowseComp isn't a coding task (Source 10, gitcode.csdn.net), “agentic search” is explicitly a separate area from software engineering benchmarks (Source 15, MindStudio), and “regressed versus GPT-5.4” is irrelevant to whether 4.7 beats 4.6—on Terminal-Bench 2.0 the reported 4.7 score is still higher than 4.6 (Source 2, AWS News Blog; Source 10, gitcode.csdn.net). Your “no independent verification” claim is an argument from silence because Source 23 simply predates the 4.7 release, while multiple third-party writeups still report the same measurable 4.6→4.7 coding gains on SWE-bench Verified and SWE-bench Pro (Source 3, Vellum; Source 7, Miraflow AI; Source 10, gitcode.csdn.net).