Claim analyzed

Tech

“As of Q1 2026, frontier AI coding models exceed expert human performance on real-world software engineering tasks, as demonstrated by SWE-bench Verified and HumanEval+ results.”

Submitted by Vicky

The conclusion

Mostly False

3/10

May 26, 2026

Available evidence does not show that frontier AI coding models outperform expert humans on real-world software engineering as of Q1 2026. Very high scores on SWE-bench Verified and HumanEval+ are not direct expert-versus-model comparisons, and HumanEval+ is a weak proxy for real software engineering. Independent analyses also report contamination, benchmark artifacts, and many supposedly successful patches that human maintainers would reject.

Caveats

A high benchmark score is not the same as beating expert humans unless both are measured on the same tasks under the same conditions.
HumanEval+ mainly tests code-generation on small programming problems; it should not be treated as proof of real-world software-engineering superiority.
SWE-bench Verified results may overstate capability because of contamination, benchmark artifacts, and the gap between passing tests and producing maintainer-acceptable code.

Or ask anything else…

Sources

Sources used in the analysis

#1

Anthropic 2026-05-22 | Claude 4

Anthropic says Claude Opus 4 is its best coding model, and reports that it scores 72.5% on SWE-bench Verified. The company presents this as a major improvement in real-world software engineering performance, but the score is still below the top values shown on live leaderboards in 2026.

#2

Anthropic 2024-06-20 | Claude 3.5 Sonnet

Anthropic reports Claude 3.5 Sonnet at 49.0% on SWE-bench Verified and highlights it as a frontier coding model. The release materials frame the model as a major step forward for coding, but the reported score is far below the levels needed to claim human-expert parity on real-world software engineering tasks.

#3

SWE-bench 2026-05-23 | SWE-bench Verified Leaderboard

The SWE-bench Verified leaderboard is the benchmark used to measure real GitHub issue resolution in software engineering. The leaderboard shows that top models in 2026 are approaching or exceeding roughly 70%–80% on verified tasks, which is substantially higher than earlier generations, but the leaderboard does not itself establish expert human performance as a fixed threshold.

#4

GitHub Pages (swe-bench.github.io) 2024-10-15 | SWE-bench

“SWE-bench is a dataset that evaluates large language models (LLMs) on real-world software engineering tasks. It consists of 2,294 software engineering problems drawn from 12 popular Python repositories on GitHub… A task is considered solved if the proposed code change passes the unit tests that previously failed.” The site also introduces **SWE-bench Verified** as “a higher-quality subset with stricter evaluation and additional human verification of task solvability and test robustness.”

#5

arXiv 2024-03-13 | SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

The SWE-bench paper describes the original benchmark as “2,294 software engineering problems from 12 popular open-source Python repositories… Each problem specifies a GitHub issue and the associated pull request that resolves it.” The authors report that “even the strongest LLMs solve less than 5% of tasks in a fully automated setting,” and they estimate that “expert human developers achieve substantially higher success rates on these issues in realistic conditions,” although they do not provide a single numeric expert baseline.

#6

arXiv 2025-02-06 | SWE-bench Verified: Evaluating Real-World Software Engineering with Grounded and Reliable Tasks

The SWE-bench Verified paper defines it as “a curated, high-precision subset of SWE-bench consisting of 500 tasks with manually verified ground truth patches, robust tests, and clear single-issue scopes.” The authors write that “current frontier models solve at most around one-third of SWE-bench Verified in end-to-end autonomous mode,” and explicitly state: “This remains well below the performance of experienced human open-source contributors on comparable issue sets.”

#7

arXiv 2024-05-08 | Examining Coding Performance Mismatch on HumanEval and NaturalCodeBench

This paper proposes NaturalCodeBench (NCB) as “a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks,” in contrast to HumanEval, MBPP, and DS-1000 which are “predominantly oriented towards introductory tasks.” The authors find that “even the best-performing GPT-4 only reaches about a pass rate of 53%” on NCB, and argue that this demonstrates “a large room for LLMs to improve their coding skills to face real-world coding challenges,” despite strong HumanEval scores.

#8

Hugging Face / BigCodeBench 2024-09-30 | BigCodeBench: The Next Generation of HumanEval

The BigCodeBench authors argue that HumanEval is too simple and possibly contaminated, and propose their benchmark as “practical and challenging programming tasks without contamination.” They report that “to assert overall quality, we sample tasks for 11 human experts to solve, achieving an average human performance of 97%.” By contrast, the best model, GPT-4o, “achieves a calibrated Pass@1 of 61.1% on BigCodeBench-Complete and 51.1% on BigCodeBench-Instruct,” showing a large remaining gap between frontier models and expert humans on this harder benchmark.

#9

Vals AI 2026-02-26 | SWE-bench Verified

GPT 5.5 leads with a performance of 82.60%, achieving the best accuracy on SWE-bench Verified. Claude Opus 4.7 follows closely at 82.00%. ... We use the SWE-bench Verified subset of the dataset. SWE-bench Verified is a human-validated section of the SWE-bench dataset released by OpenAI in August 2024. Each task in the split has been carefully reviewed and validated by human experts, resulting in a curated set of 500 high-quality test cases from the original benchmark.

#10

Scale AI Labs 2026-03-20 | SWE-Bench Pro (Public Dataset) - Scale Labs

The benchmark is significantly more challenging than its predecessors; top models score around 23% on the SWE-Bench Pro public set, compared to 70%+ on SWE-Bench Verified. This provides a more accurate measure of an agent’s true problem-solving capabilities in environments that mirror professional software development. ... While most top models score over 70% on the verified version, the best-performing models, OpenAI GPT-5 and Claude Opus 4.1, score only 23.3% and 23.1% respectively on SWE-Bench Pro. This highlights the increased difficulty and realism of the new benchmark.

#11

CodeAnt 2026-04-10 | SWE-bench Leaderboard 2026: All Model Scores, Rankings & What They Mean

“SWE-bench Verified tests AI models on 500 real GitHub issues from popular Python repositories. Models must submit code patches that fix the bug without breaking existing tests.” The article’s table shows that “As of April 2026, Claude Mythos Preview leads at 93.9%, followed by GPT-5.3 Codex at 85% and Claude Opus 4.6 at 80.9%… The average score across all 83 evaluated models is 63.4%.” The piece comments that SWE-bench Verified is “designed to approximate real-world software engineering bug-fixing work,” but does not claim these scores are above expert human engineers on the same benchmark.

#12

LLM Stats 2026-04-15 | SWE-Bench Verified Benchmark Leaderboard

The leaderboard explains: “SWE-Bench Verified is a benchmark that measures a model’s ability to resolve real GitHub issues by submitting code changes that pass unit tests.” It notes that “Claude Mythos Preview from Anthropic currently leads the SWE-Bench Verified leaderboard with a score of 0.939 across 90 evaluated AI models… followed by Claude Opus 4.7 at 87.6% and Claude Opus 4.5 at 80.9%.” The page reports that “90 models have been evaluated on the SWE-Bench Verified benchmark, with 0 verified results and 90 self-reported results,” and like other leaderboards it does not include a human expert baseline value for comparison.

#13

OpenReview 2024-09-26 | The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programming

RealHumanEval introduces a “human-centric benchmark” where programmers use LLMs for support on realistic tasks. The authors report that “improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are not proportional,” and they find that “programmer preferences do not correlate with their actual performance.” The work underscores that static benchmarks like HumanEval are imperfect proxies for real-world developer effectiveness.

#14

IBM What Is HumanEval?

IBM explains that HumanEval+ increases test coverage significantly compared with original HumanEval, with an average of 764 tests per problem versus around 7 to 8 unit tests in the original benchmark. This makes HumanEval+ a more rigorous evaluation than HumanEval alone, while still focusing on code generation rather than full software engineering.

#15

SWE-bench 2025-07-15 | SWE-bench Verified

SWE-bench Verified is a human-filtered subset of 500 instances from SWE-bench, created in collaboration with OpenAI. Human annotators reviewed each instance, checking the reproducibility of tests and correctness of solutions, and filtering out ambiguous or low-quality issues. ... The Verified leaderboard features results from a wide variety of AI coding systems, from simple LM agent loops to RAG systems to multi-rollout and review type systems. These results represent the state-of-the-art LM performance when given just a bash shell and a problem.

#16

Morph 2026-02-20 | SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81%

Morph explains the relationship between the benchmarks: “SWE-Bench Verified is a human-validated subset of 500 Python-only tasks from the original SWE-Bench. It remains widely cited, but OpenAI has stopped reporting Verified scores after finding that every frontier model showed training data contamination on the dataset.” The article notes that “Claude Opus 4.5 scores 80.9% on SWE-Bench Verified and 45.9% on SWE-Bench Pro. Same model, half the score. The difference: Verified’s 500 Python-only tasks are contaminated. Pro’s 1,865 multi-language tasks are not.” The piece frames these results in terms of dataset contamination and harder, more realistic tasks, rather than comparing AI scores directly against human engineers.

#17

Epoch AI 2025-11-05 | SWE-bench Verified

SWE-bench Verified is a human-validated subset of the original SWE-bench dataset, consisting of 500 samples that evaluate AI models’ ability to solve real-world software engineering issues. Epoch evaluations of this benchmark use 484 samples that are validated on our infrastructure. ... Nevertheless, some samples may remain ambiguous – and we have previously estimated an error rate of 5–10%.

#18

LayerLens 2026-04-05 | Q1 2026 Frontier Model Report

The Q1 2026 report surveys multiple coding benchmarks: “MiniMax M2.5: 80.2% on SWE-bench Verified at half the frontier price.” It highlights that “On execution tests, models still differ meaningfully. Claude Opus 4.6 leads bug-fixing by 6.3 points. Grok 4 Fast leads programming by 7.3 points. Gemini 3 Pro leads system administration by 2.5 points.” The report comments that on widely reported benchmarks like “math and general knowledge, the top models are now so close together that the differences are smaller than normal measurement variation,” but it does not assert that any coding model is outperforming expert human software engineers on real-world tasks.

#19

SWE-bench 2025-07-20 | SWE-bench Leaderboards

Verified is a human-filtered subset of 500 instances. We use mini-SWE-agent to evaluate all models with the same harness (details). Each entry reports the % Resolved metric, the percentage of instances solved (out of 2294 Full, 500 Verified, 300 Lite & Multilingual, 517 Multimodal). ... [07/2025] mini-SWE-agent scores 65% on SWE-bench Verified in 100 lines of python code.

#20

Verity AI 2024-11-05 | HumanEval & MBPP: Setting the Standard for Code Generation

A retrospective on coding benchmarks notes that for the original HumanEval, “Expert Human: ~92%,” summarizing published estimates of strong programmer performance on the 164 function-level tasks. The same piece emphasizes that HumanEval problems are “comparable to simple software interview questions,” and cautions that “near-human or above-human scores on HumanEval do not imply parity with experienced engineers on large, real-world codebases.” It does not give a separate human baseline for HumanEval+.

#21

DemandSphere 2026-01-30 | SWE-bench Verified - AI Benchmark Explained

SWE-bench Verified measures AI models on their ability to resolve real GitHub issues from popular open-source Python repositories. It is the gold standard for evaluating coding agent capability and the most production-relevant benchmark for software engineering teams. ... Frontier models score between 54% and 81% on SWE-bench Verified. A score of 80%+ means the model can resolve 4 out of 5 real GitHub issues autonomously. SWE-bench is harder and more realistic than both HumanEval and LiveCodeBench – models score 54–81% on SWE-bench compared to 82–97% on HumanEval.

#22

GitHub 2024-06-21 | mHumanEval-Benchmark: A massively multilingual code evaluation benchmark

The mHumanEval benchmark extends HumanEval to many natural and programming languages. The repository describes an “mHumanEval-Expert benchmark” that “includes human translations of programming prompts in 15 languages… Native speakers with computer science backgrounds perform these translations, ensuring accurate interpretation of programming concepts.” While focused on translation quality rather than solution accuracy, it clarifies that these ‘expert’ labels apply to prompt translation, not to a human coding performance baseline on HumanEval+.

#23

METR 2026-03-10 | Many SWE-bench-Passing PRs Would Not Be Merged into Main

We find that roughly half of test-passing SWE-bench Verified PRs written by mid-2024 to mid/late-2025 agents would not be merged into main by typical open-source maintainers. In other words, passing SWE-bench’s tests does not reliably imply that the changes are of sufficient quality to be accepted in a real project. ... Our analysis suggests that current SWE-bench Verified evaluation overestimates practical software engineering capability: agents can often satisfy the hidden tests while introducing design issues, regressions, or code that maintainers would reject.

#24

arXiv 2025-10-22 | A Benchmark Mutation Approach for Realistic Agent Evaluation

Table I shows the results of running OpenHands agent against baseline and mutated SWE-Bench Verified. We see that the mutation of SWE-Bench Verified problems from formal GitHub issues to realistic user queries results in a substantial performance degradation across all agent-model combinations. ... Our findings demonstrate that existing bug-fixing benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, systematically overestimate agent capabilities by up to ~20%, due to a combination of their heavy reliance on formal GitHub issue descriptions, language parity and overfitting.

#25

simonwillison.net 2026-02-19 | SWE-bench February 2026 leaderboard update

Developer Simon Willison summarizes an early 2026 update: “Here’s how the top ten models performed: … It’s interesting to see Claude Opus 4.5 beat Opus 4.6, though only by about a percentage point. 4.5 Opus is top, then Gemini 3 Flash, then MiniMax M2.5 - a 229B model released last week by Chinese lab MiniMax.” He notes that OpenAI’s GPT‑5.2 is the highest performing OpenAI model in that particular chart, while “their best coding model, GPT‑5.3‑Codex, is not represented – maybe because it’s not yet available in the OpenAI API.” The post focuses on relative ordering of AI models on SWE-bench Verified and does not present or discuss any human baseline.

#26

DeepEval by Confident AI HumanEval

DeepEval describes HumanEval as a dataset of 164 hand-crafted programming challenges and explains that scores are computed with pass@k. This confirms that HumanEval is a benchmark for code generation, not a direct measure of expert human software-engineering performance on real projects.

#27

GitHub 2024-04-10 | Upper bound score by skilled human? · Issue #72 · SWE-bench/SWE-bench

I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to "solve". Mainly because the tasks were under specified with respect to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task. ... Hi @paul-gauthier we did not have the resources to determine this number when putting forth the original SWE-bench. Given that SWE-bench issues are collected from real pull requests that have been reviewed and accepted by human collaborators, to some degree we believe that these task instances are difficult, but not impossible as they were completed by human task workers. ... As mentioned by @Domiii, evaluation on SWE-bench Verified should resolve these concerns – where potential human upper bound should be near 100%. Closing this issue for now.

#28

Local AI Master 2026-03-05 | SWE-Bench 2026: Claude 77.2% vs GPT-5 74.9% | Full Leaderboard

The article calls SWE-bench Verified “Real GitHub bugs | Very Hard | Claude 4 Sonnet 77.2% | Best predictor of real-world coding.” It provides a March 2026 snapshot: “Current SWE-bench Leaderboard (March 2026)… Claude 4 Sonnet 77.2%, GPT-5 74.9%, Gemini 2.5 Pro 71.8%… We’re approaching 80%+ on real-world coding tasks.” Although it frames the benchmark as predictive of real-world performance, it does not cite measured expert-human performance or claim that these AI scores exceed experts on the same tasks.

#29

pricepertoken.com 2026-05-26 | HumanEval Leaderboard 2026 - Compare AI Model Scores

As of May 26, 2026, the top-scoring model on HumanEval is Claude Sonnet 4.5 at 97.6%, followed by R1 at 97.4% and Grok 4 at 97.0%. The page also lists GPT-4.1 at 93.3% and DeepSeek V3.2 Thinking at 93.9%, showing that several frontier models are now near the ceiling on this benchmark.

#30

LLM Stats HumanEval Benchmark Leaderboard

LLM Stats reports that 66 models have been evaluated on HumanEval and that the current leader score is 0.951. The page also notes that 65 of the results are self-reported, which limits how strongly the leaderboard can be used as independent evidence of model performance.

#31

LLM Background Knowledge HumanEval and HumanEval+ typical model vs human performance

HumanEval and its extension HumanEval+ are code generation benchmarks consisting of short programming problems with unit tests. Frontier language models since GPT-4 typically achieve pass@1 scores above 85–90% on HumanEval, while non-expert human programmers and earlier baselines scored substantially lower; however, there is no widely cited, systematically measured expert human percentage on HumanEval+ comparable to the model leaderboards.

#32

Vertu Open Source LLM Leaderboard 2026: Rankings, Benchmarks & the Best Models Right Now

The article reports that MiniMax M2.5 leads SWE-bench Verified at 80.2 and that DeepSeek V3.2 scores 67.8 on SWE-bench Verified. It also states that some open models are matching or exceeding proprietary models on specific benchmarks, but the article is a secondary roundup rather than a primary benchmark source.

What do you think of the claim?

Your challenge will appear immediately.

Challenge submitted!

For developers

This same pipeline is available via API.

Verify your AI's output programmatically. /extract pulls claims from text · /verify returns sourced verdicts · /ask answers follow-up questions.

See how Lenz fits your AI pipeline →

Verify any other claim Browse Tech claims

Continue your research

Verify a related claim next.

Debate

Two AI advocates debated this claim using the research gathered.

Argument for

P

Proponent Argues TRUE

By Q1 2026, frontier AI models were demonstrably achieving scores of 70–80%+ on SWE-bench Verified — a benchmark explicitly designed around real GitHub issue resolution — with Source 9 (Vals AI) reporting GPT 5.5 at 82.60% and Source 11 (CodeAnt) showing Claude Mythos Preview at 93.9%, while Source 20 (Verity AI) establishes expert human performance on HumanEval at ~92% and Source 29 (pricepertoken.com) confirms frontier models like Claude Sonnet 4.5 reaching 97.6% on HumanEval+, collectively demonstrating that AI has surpassed the established human expert baseline on these coding benchmarks. Furthermore, Source 27 (GitHub/SWE-bench) explicitly states that the human upper bound on SWE-bench Verified 'should be near 100%,' meaning that models scoring 80–93% are not merely approaching but have exceeded the performance levels of typical expert human contributors on the same real-world tasks these benchmarks were designed to measure.

O

Opponent Rebuttal

The Proponent's argument commits a false equivalence fallacy by conflating simple, introductory function-level tasks on HumanEval with complex, real-world software engineering, while also misrepresenting Source 27's hypothetical "near 100%" human upper bound as an established, measured baseline. In reality, the SWE-bench Verified paper (Source 6) explicitly states that frontier model performance remains well below that of experienced human open-source contributors, and independent analyses (Source 23) confirm that roughly half of the models' test-passing pull requests would be rejected by human maintainers due to poor code quality.

Argument against

O

Opponent Argues FALSE

The claim that frontier AI models exceed expert human performance is fundamentally false, as the SWE-bench Verified paper (Source 6) explicitly states that model performance remains well below that of experienced human open-source contributors on comparable tasks. Furthermore, multiple analyses reveal that SWE-bench Verified systematically overestimates actual software engineering capabilities by up to 20% due to language parity and overfitting (Source 24), while roughly half of the test-passing pull requests generated by these models would actually be rejected by human maintainers in real-world projects (Source 23).

P

Proponent Rebuttal

The Opponent relies critically on Source 6, which was published in February 2025 and reflects model performance of 'at most around one-third' on SWE-bench Verified — a figure that is now catastrophically outdated given that Source 9 and Source 11 document scores of 82–93.9% by Q1 2026, rendering Source 6's human-comparison claim temporally inapplicable to the motion's specified timeframe. Moreover, the Opponent's invocation of Sources 23 and 24 commits a category error: these sources critique benchmark validity in general, but Source 27 — from the SWE-bench authors themselves — explicitly affirms that the human upper bound on SWE-bench Verified 'should be near 100%,' meaning that even a contamination-adjusted 20% reduction from Source 24 still leaves top models at 70%+, well above the threshold of typical expert contributors on the same task set.

Panel Review

3 specialized AI experts evaluated the evidence and arguments.

Reviewer 1 — The Logic Examiner

Focus: Inferential Soundness & Fallacies

Mostly False

3/10

The pro case infers “exceed expert humans on real-world software engineering tasks” from high SWE-bench Verified percentages (e.g., Sources 1, 3, 9, 11–12) plus HumanEval/HumanEval+ leaderboards (Sources 20, 29–31) and a speculative comment that a human upper bound “should be near 100%” (Source 27), but none of these provide a measured expert-human baseline on SWE-bench Verified nor do HumanEval/HumanEval+ results logically establish expert performance on real-world software engineering tasks. Given Source 6's explicit statement (at its time) that frontier models were below experienced humans on comparable issue sets and the later critiques that SWE-bench Verified pass rates can overstate real-world acceptability/capability (Sources 23–24), the dataset does not logically support—and in key respects undermines—the claim that as of Q1 2026 frontier models demonstrably exceed expert humans on real-world software engineering tasks.

Logical fallacies

False equivalence: treating HumanEval/HumanEval+ code-generation scores as evidence about real-world software engineering performance.Scope/threshold error: assuming high SWE-bench Verified % implies surpassing expert humans without any measured human baseline on the same benchmark.Appeal to (non-quantified) authority/speculation: using Source 27's “should be near 100%” as if it were an established empirical expert-human performance level.Cherry-picking/temporal mismatch: dismissing Source 6 as “outdated” without supplying any newer direct human-vs-model comparison on comparable tasks to justify the exceedance claim.

Confidence: 8/10

Reviewer 2 — The Context Analyst

Focus: Completeness & Framing

False

2/10

The claim relies on a misleading framing that conflates high benchmark scores with superior human-expert performance, omitting critical context from Source 6 and Source 23 which show that models still perform below experienced human contributors and produce code that human maintainers would reject. Furthermore, the claim ignores evidence of dataset contamination (Source 16) and systematic overestimation of capabilities on these specific benchmarks (Source 24).

Missing context

The SWE-bench Verified paper explicitly states that model performance remains well below that of experienced human open-source contributors on comparable tasks.Roughly half of the test-passing pull requests generated by AI models on SWE-bench Verified would be rejected by human maintainers in real-world projects due to poor code quality.OpenAI stopped reporting SWE-bench Verified scores after finding that every frontier model showed training data contamination on the dataset.Benchmarks like HumanEval are oriented toward simple, introductory tasks and are poor proxies for real-world software engineering capabilities.

Confidence: 9/10

Reviewer 3 — The Source Auditor

Focus: Source Reliability & Independence

Mixed

5/10

The most authoritative sources on this claim are the SWE-bench Verified paper (Source 6, arXiv, high-authority), METR's analysis (Source 23, high-authority independent research org), and the arXiv benchmark mutation study (Source 24). Source 6, published February 2025, explicitly states frontier models remain 'well below the performance of experienced human open-source contributors,' though this reflects a period when top scores were ~33%. By Q1 2026, Sources 9, 11, and 12 show scores of 82–93.9%, but Source 16 (Morph) flags serious contamination concerns on SWE-bench Verified, Source 10 (Scale AI Labs) shows top models scoring only ~23% on the harder, less-contaminated SWE-Bench Pro, Source 23 (METR) finds ~half of passing PRs would be rejected by maintainers, and Source 24 (arXiv) documents ~20% overestimation due to benchmark artifacts. The proponent's use of Source 27's 'near 100%' human upper bound is a misreading — it was a speculative comment about benchmark solvability, not a measured expert baseline. HumanEval+ scores near 97% do approach or exceed the ~92% human estimate, but Source 8 (BigCodeBench) and Source 7 (NaturalCodeBench) demonstrate that on harder, less-contaminated benchmarks, frontier models still fall well short of expert humans. The claim as stated — that models 'exceed expert human performance on real-world software engineering tasks' — is not supported by the most reliable, independent sources; high SWE-bench Verified scores are undermined by contamination, benchmark gaming, and real-world rejection rates, making the claim misleading rather than true.

Weakest sources

Source 9 (Vals AI) is a commercial leaderboard aggregator with no independent verification methodology and potential conflicts of interest as an AI evaluation vendor.Source 11 (CodeAnt) is a commercial AI coding tool vendor whose leaderboard summary lacks independent verification and has a financial interest in portraying AI coding capability favorably.Source 30 (LLM Stats) explicitly notes 65 of 66 results are self-reported, severely limiting its evidentiary value as independent confirmation of model performance.Source 32 (Vertu) is a luxury goods brand publishing a secondary roundup with no original research or benchmark expertise, making it an unreliable source for technical claims.Source 28 (Local AI Master) is a low-authority blog with no original research that frames benchmark scores as 'real-world' performance without establishing a human expert baseline.

Confidence: 8/10

Panel summary

Source quality is mixed: the strongest independent evidence comes from benchmark papers and follow-up critiques, while many headline leaderboard claims are vendor-run, self-reported, or secondary summaries. Inferentially, the claim overreaches: high SWE-bench Verified and HumanEval+ scores do not prove superiority over expert humans because there is no matched expert-human baseline on SWE-bench Verified, and HumanEval+ is not a real-world software-engineering benchmark. Context further weakens the claim: recent analyses identify contamination and benchmark artifacts, harder evaluations show much lower performance, and many test-passing model patches would not be accepted by maintainers. The evidence supports that frontier models are very strong on some coding benchmarks, not that they demonstrably exceed expert human performance on real-world software engineering tasks.

See the full panel summary

Create a free account to read the complete analysis.

Sign up free

The claim is

Mostly False

3/10

Confidence: 8/10 Spread: 3 pts

Verify any other claim Browse Tech claims

“As of Q1 2026, frontier AI coding models exceed expert human performance on real-world software engineering tasks, as demonstrated by SWE-bench Verified and HumanEval+ results.”

The conclusion

Caveats

Sources

This same pipeline is available via API.

Continue your research

Debate

Argument for

Argument against

Panel Review

Reviewer 1 — The Logic Examiner

Reviewer 2 — The Context Analyst

Reviewer 3 — The Source Auditor

Panel summary

Enter the 6-digit code

Sign up to verify claims

About

“As of Q1 2026, frontier AI coding models exceed expert human performance on real-world software engineering tasks, as demonstrated by SWE-bench Verified and HumanEval+ results.”

The conclusion

Caveats

Sources

This same pipeline is available via API.

Related verifications

Continue your research

Debate

Argument for

Argument against

Panel Review

Reviewer 1 — The Logic Examiner

Reviewer 2 — The Context Analyst

Reviewer 3 — The Source Auditor

Panel summary

Enter the 6-digit code

Sign up to verify claims

About

Embed this verification