Claim analyzed

Tech

“AI language models generate hallucinated or factually incorrect outputs in more than 20% of cases.”

Submitted by Vicky

The conclusion

Misleading

5/10

March 30, 2026

Hallucination rates above 20% are documented in specific high-stakes domains like medical literature review and clinical decision support, but the claim's unqualified framing suggests this is typical across all AI language model use — which the evidence does not support. Broad benchmarks show top current models averaging under 10%, and sometimes below 1%. The rate varies dramatically by model, task, domain, and how "hallucination" is measured, making a single blanket figure misleading.

Caveats

The >20% figures cited in supporting studies come primarily from specialized medical/clinical tasks and older models (e.g., GPT-3.5, Bard), not general-use scenarios.
Hallucination rates are highly dependent on task type, model generation, prompting method, and evaluation metric — 'cases' is undefined in the claim and could refer to prompts, answers, tokens, or citations.
Broad benchmark data (Vectara Hallucination Leaderboard, Frontiers survey) shows current top models averaging well under 10% on factual consistency tasks, directly contradicting the >20% threshold as a general rule.

Or ask anything else…

Sources

Sources used in the analysis

#1

Vectara Hallucination Leaderboard 2025-12-01 | vectara/hallucination-leaderboard - GitHub

We then computed the overall factual consistency rate (no hallucinations) and hallucination rate (100 - accuracy) for each model. The leaderboard tracks hallucination rates across many models, with top models like Google Gemini-2.0-Flash-001 and OpenAI o3-mini-high showing rates as low as 0.7% to 0.9%, while averages across models reach around 9.2%.

#2

PubMed Central 2024-05-15 | Hallucination Rates and Reference Accuracy of ChatGPT and Bard ...

Hallucination rates stood at 39.6% (55/139) for GPT-3.5, 28.6% (34/119) for GPT-4, and 91.4% (95/104) for Bard (P<.001). These rates exceed 20% for all tested models in this medical literature review task.

#3

PMC 2026-02-07 | Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support - PMC

A February 2026 study found that hallucination rates ranged from 50% to 82% across models and prompting methods in clinical decision support scenarios. For the best-performing model, GPT-4o, rates declined from 53% to 23% with prompt-based mitigation, but still remained above 20%.

#4

Frontiers in Artificial Intelligence 2025-01-20 | Survey and analysis of hallucinations in large language models

Statistical measures from this dataset revealed significantly lower factual and intrinsic hallucination rates for GPT-4 (under 10%) compared to other models. Models with low MV like GPT-4 achieved better factual accuracy, aligning with benchmarks like TruthfulQA.

#5

PMC 2025-02-03 | Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology - PMC

In ChatGPT‐4, 76.2% of responses were accurate, which was higher compared to 50.0% in Bard and 45.2% in ChatGPT‐3.5. This implies that ChatGPT-4 had an inaccuracy rate of 23.8%, Bard 50.0%, and ChatGPT-3.5 54.8%. Of 65/109 unique references from Bard, 13.8% were non‐existent/fabricated.

#6

arXiv 2025-02-10 | Survey of Hallucination in Natural Language Generation (Updated 2025)

Recent updates to hallucination surveys note average rates across LLMs on factual QA benchmarks range 10-25%, with improvements in 2025 models reducing to under 10% for leaders, but some reasoning models exceed 30% on complex queries.

#7

AIMultiple 2026-01-23 | AI Hallucination: Compare top LLMs like GPT-5.2 - AIMultiple

A January 2026 benchmark revealed that even the latest models have greater than 15% hallucination rates when asked to analyze provided statements.

#8

ScottGraffius.com 2026-01-07 | Are AI Hallucinations Getting Better or Worse? We Analyzed the Data | ScottGraffius.com

On comparable benchmarks, hallucinations are declining year-over-year for non-complex cases. Top models dropped from roughly 1–3% in 2024 to 0.7–1.5% in 2025 on grounded summarization tasks. However, hallucinations remain high in complex reasoning and open-domain factual recall, where rates can exceed 33%.

#9

The Stack Overflow Blog 2025-06-30 | Reliability for unreliable LLMs - The Stack Overflow Blog

“Even when you ground LLMs, 1 out of every 20 tokens coming out might be completely wrong, completely off topic, or not true,” said Amr Awadallah, CEO of GenAI platform Vectara. “Gemini 2.0 from Google broke new benchmarks and they're around 0.8%, 0.9% hallucinations, which is amazing. But I think we're going to be saturating around 0.5%. I don't think we'll be able to beat 0.5%. There are many, many fields where that 0.5% is not acceptable.”

#10

Rev 2025-09-30 | Study: Heavy AI Users See 3x More Hallucinations - Rev

When a session drags past 20 minutes, almost 9 out of 10 (88 %) say they 'very often' have to revise for hallucinations. Heavy users run into frequent hallucinations nearly three times than casual users (34% vs. 12%).

#11

UX Tigers 2025-02-13 | AI Hallucinations on the Decline - UX Tigers

A February 2025 analysis noted that while hallucinations are on the decline, a study found 29% false references from ChatGPT 4, compared to 40% from ChatGPT 3.5. It projects that AI could hit zero hallucinations by February 2027, though this is a projection.

#12

LLM Background Knowledge 2025-03-01 | Consensus on LLM Hallucination Rates from Major Benchmarks

Established benchmarks like TruthfulQA show top LLMs like GPT-4o and Llama 3.1 averaging 5-15% hallucination rates on general tasks, below 20% overall, though domain-specific tasks like legal or medical can exceed 20% in older models.

What do you think of the claim?

Your challenge will appear immediately.

Challenge submitted!

For developers

This same pipeline is available via API.

Verify your AI's output programmatically. /extract pulls claims from text · /verify returns sourced verdicts · /ask answers follow-up questions.

See how Lenz fits your AI pipeline →

Verify any other claim Browse Tech claims

The arguments

Two AI advocates debated this claim using the research gathered.

Argument for

P

Proponent Argues TRUE

Multiple peer‑reviewed evaluations show hallucination/inaccuracy rates well above 20% in realistic information-seeking settings: PubMed Central reports 39.6% for GPT‑3.5 and 28.6% for GPT‑4 in a medical literature review task (Source 2, “Hallucination Rates and Reference Accuracy of ChatGPT and Bard ...”), and a disease-epidemiology accuracy study finds GPT‑4 responses 76.2% accurate—i.e., 23.8% inaccurate—while Bard and GPT‑3.5 are far worse (Source 5, “Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology”). Reinforcing that this is not an isolated artifact, a 2026 clinical decision-support assurance analysis reports 50–82% hallucination rates across models and prompting methods, with even the best mitigated condition still at 23% (Source 3, “Multi-model assurance analysis ...”), establishing that AI language models can and do generate factually incorrect outputs in more than 20% of cases.

O

Opponent Rebuttal

Your entire case rests on a classic composition fallacy — you are extrapolating from narrow, high-stakes domain studies (medical literature review in Source 2, clinical decision support in Source 3, disease epidemiology in Source 5) to a sweeping claim about AI language models "in general," yet none of these sources test general-use scenarios, and Source 1, the most rigorous and comprehensive benchmark available, directly contradicts you with an average hallucination rate of ~9.2% across all tracked models. Furthermore, your reliance on older models like GPT-3.5 and Bard — which Source 4 and Source 12 confirm are far outperformed by current models operating well under 10% — means you are cherry-picking outdated, worst-case data points to prop up a claim that the best available evidence, measured across diverse models and tasks, decisively refutes.

Argument against

O

Opponent Argues FALSE

The claim that AI language models hallucinate in "more than 20% of cases" as a general rule is flatly contradicted by the most rigorous and authoritative benchmark data available: Source 1 (Vectara Hallucination Leaderboard) shows top models hallucinating at rates as low as 0.7–0.9%, with an average across all tracked models of only ~9.2%, and Source 4 (Frontiers in Artificial Intelligence) confirms GPT-4 operates well under 10% on factual tasks — both figures decisively below the 20% threshold. The supporting sources that appear to exceed 20% are narrowly scoped to adversarial clinical scenarios (Source 3), specialized medical literature review tasks (Source 2), or domain-specific inaccuracy metrics (Source 5), none of which represent general-use hallucination rates and therefore cannot sustain a sweeping claim about AI language models broadly.

P

Proponent Rebuttal

You're treating Source 1 (Vectara Hallucination Leaderboard) and Source 4 (Frontiers in Artificial Intelligence) as if they define “cases” for all real-world use, but they're benchmark- and task-dependent snapshots where grounded summarization/factual QA can be unusually low—so citing sub‑1% leaders and a ~9% average doesn't logically refute that other common information-seeking settings exceed 20%. More importantly, you dismiss Source 2 (Hallucination Rates and Reference Accuracy of ChatGPT and Bard ...), Source 5 (Disease Epidemiology accuracy), and Source 3 (clinical decision support) as “narrow,” yet they directly measure factual incorrectness in realistic high-stakes user queries and repeatedly clear the 20% bar (28.6%, 23.8%, and ~23% even after mitigation), which is enough to validate the claim that LLMs do generate >20% incorrect outputs in practice.

Expert review

3 specialized AI experts evaluated the evidence and arguments.

Expert 1 — The Logic Examiner

Focus: Inferential Soundness & Fallacies

Misleading

4/10

The claim asserts that AI language models generate hallucinated or factually incorrect outputs "in more than 20% of cases" — a universal, unqualified scope. The evidence is deeply bifurcated: Sources 2, 3, and 5 directly measure rates exceeding 20% in domain-specific tasks (medical literature review, clinical decision support, disease epidemiology), while Sources 1, 4, 8, 9, and 12 show top models and general benchmarks averaging well below 20% (9.2% average on Vectara, sub-10% for GPT-4 on factual QA). The logical flaw in the proponent's case is a hasty generalization — domain-specific high-stakes results (medical, clinical) cannot be extrapolated to "AI language models in general" without scope qualification. Conversely, the opponent commits a composition fallacy in reverse — cherry-picking top-performing models and grounded summarization benchmarks to represent all models across all use cases, while dismissing realistic domain studies as "narrow." The claim as written is unqualified and sweeping ("in more than 20% of cases"), which the aggregate evidence does not support: the best available cross-model benchmark (Source 1) shows a ~9.2% average, and Source 6 places the range at 10–25% with leaders under 10%, meaning the claim is only true in specific high-stakes or complex domains, not as a general rule. The claim is therefore misleading — it is true in certain contexts but false as a universal statement, and the reasoning used to support it relies on scope overgeneralization from domain-specific studies.

Logical fallacies

Hasty Generalization (Proponent): Sources 2, 3, and 5 measure hallucination rates in narrow, high-stakes medical/clinical domains and extrapolate to a universal claim about 'AI language models' broadly — the scope of the evidence does not match the scope of the claim.Cherry-Picking (Opponent): The opponent selects top-performing models and grounded summarization benchmarks (Source 1's 0.7–0.9% leaders) to represent the general case, while dismissing domain-specific studies that legitimately show >20% rates as irrelevant outliers.False Equivalence (Both sides): Both debaters treat 'hallucination rate' as a single unified metric, when in fact it varies dramatically by task type, domain, model generation, and evaluation methodology — making direct comparisons across sources logically unsound without controlling for these variables.Composition/Division Fallacy (Opponent): The opponent infers that because the best current models perform below 20% on benchmark tasks, AI language models as a class do not exceed 20% — ignoring that the claim covers all models and all use cases, not just leaders on curated benchmarks.

Confidence: 8/10

Expert 2 — The Context Analyst

Focus: Completeness & Framing

Misleading

5/10

The claim is framed as a general, across-the-board rate (“more than 20% of cases”) but the >20% figures come largely from domain-specific, high-stakes medical/clinical tasks and/or older models (e.g., GPT‑3.5/Bard) where error rates are known to be higher, while broader benchmark-style measurements for many current top models report well under 20% (often under 10% and sometimes ~1%) and even an across-model average around ~9% on certain grounded factual-consistency setups (Sources 1, 4, 6, 12 vs. 2, 3, 5). With full context, it's true that LLMs can exceed 20% in some settings, but the unqualified wording implies this is typical overall, which is misleading rather than generally true (Sources 1, 2, 3, 5, 6).

Missing context

Hallucination/error rates are highly task-, metric-, and prompting-dependent; “cases” is undefined and can mean prompts, answers, tokens, or citations, which changes the percentage materially.Many recent top models on grounded summarization/factual-consistency benchmarks report <10% (and sometimes ~1%) hallucination rates, so >20% is not representative of general performance across models/tasks (Sources 1, 4, 6, 12).Several >20% results cited are from specialized medical/clinical decision-support evaluations and/or older models, which are not necessarily generalizable to everyday use (Sources 2, 3, 5).Some sources measure “inaccuracy” or “bad references” rather than direct hallucination rate, which can inflate or shift what is being counted (Source 5).

Confidence: 8/10

Expert 3 — The Source Auditor

Focus: Source Reliability & Independence

Misleading

5/10

The highest-authority, most directly quantitative sources are the peer‑reviewed/PMC studies (Source 2 PubMed Central; Source 5 PMC; Source 3 PMC), which show >20% factual error/hallucination rates in specific medical/clinical tasks (e.g., GPT‑4 at 28.6% in one literature-review setup in Source 2; GPT‑4 at 23.8% inaccurate in Source 5; mitigated best case still ~23% in Source 3), while other credible but differently-scoped evidence (Source 4 Frontiers survey; Source 1 Vectara leaderboard) reports substantially lower rates (often <10% and even ~1%) on particular benchmarks like grounded summarization/factual consistency leaderboards. Because the claim is framed as a broad, general rate “in more than 20% of cases” across AI language model use, the most trustworthy evidence does not support that as a generalizable overall figure (it is highly task/model dependent and often below 20% on major benchmarks), so the claim is misleading rather than clearly true or false.

Weakest sources

Source 7 (AIMultiple) is a secondary aggregator with unclear benchmark methodology and potential commercial/SEO incentives, so it is weak support for a precise quantitative threshold.Source 8 (ScottGraffius.com) is an individual blog analysis without clear peer-review or standardized measurement, making its quantitative claims hard to weight against primary studies.Source 10 (Rev) reports self-reported user perceptions of hallucinations rather than measured model error rates, so it is not direct evidence for a >20% factual-incorrectness rate.Source 12 (LLM Background Knowledge) is not an independent citable primary source (it is an internal/unspecified knowledge base), so it should not be used to refute quantified empirical studies.

Confidence: 7/10

Expert summary

All three evaluation axes converge on the same verdict: Misleading. Source analysis confirms that the highest-authority peer-reviewed studies showing >20% rates are domain-specific (medical/clinical), while equally credible broad benchmarks (Vectara, Frontiers survey) report averages well under 20%. Logic analysis identifies hasty generalization in extrapolating domain-specific results to a universal claim, and context analysis highlights that "cases" is undefined and that the claim's unqualified framing implies a typical rate that the aggregate evidence does not support. The claim is directionally grounded — some models in some domains do exceed 20% — but the sweeping, unqualified phrasing materially misleads readers into believing this is the norm across AI language models generally.

See the full panel summary

Create a free account to read the complete analysis.

Sign up free

The claim is

Misleading

5/10

Confidence: 8/10 Spread: 1 pts

Verify any other claim Browse Tech claims

“AI language models generate hallucinated or factually incorrect outputs in more than 20% of cases.”

The conclusion

Caveats

Sources

This same pipeline is available via API.

The arguments

Argument for

Argument against

Expert review

Expert 1 — The Logic Examiner

Expert 2 — The Context Analyst

Expert 3 — The Source Auditor

Expert summary

Enter the 6-digit code

Sign up to verify claims

About

“AI language models generate hallucinated or factually incorrect outputs in more than 20% of cases.”

The conclusion

Caveats

Sources

This same pipeline is available via API.

Related verifications

The arguments

Argument for

Argument against

Expert review

Expert 1 — The Logic Examiner

Expert 2 — The Context Analyst

Expert 3 — The Source Auditor

Expert summary

Enter the 6-digit code

Sign up to verify claims

About

Embed this verification