Verify any claim · lenz.io
Claim analyzed
Tech“AI language models generate hallucinated or factually incorrect outputs in more than 20% of cases.”
Submitted by Vicky
The conclusion
Hallucination rates above 20% are documented in specific high-stakes domains like medical literature review and clinical decision support, but the claim's unqualified framing suggests this is typical across all AI language model use — which the evidence does not support. Broad benchmarks show top current models averaging under 10%, and sometimes below 1%. The rate varies dramatically by model, task, domain, and how "hallucination" is measured, making a single blanket figure misleading.
Based on 12 sources: 5 supporting, 5 refuting, 2 neutral.
Caveats
- The >20% figures cited in supporting studies come primarily from specialized medical/clinical tasks and older models (e.g., GPT-3.5, Bard), not general-use scenarios.
- Hallucination rates are highly dependent on task type, model generation, prompting method, and evaluation metric — 'cases' is undefined in the claim and could refer to prompts, answers, tokens, or citations.
- Broad benchmark data (Vectara Hallucination Leaderboard, Frontiers survey) shows current top models averaging well under 10% on factual consistency tasks, directly contradicting the >20% threshold as a general rule.
Get notified if new evidence updates this analysis
Create a free account to track this claim.
Sources
Sources used in the analysis
We then computed the overall factual consistency rate (no hallucinations) and hallucination rate (100 - accuracy) for each model. The leaderboard tracks hallucination rates across many models, with top models like Google Gemini-2.0-Flash-001 and OpenAI o3-mini-high showing rates as low as 0.7% to 0.9%, while averages across models reach around 9.2%.
Hallucination rates stood at 39.6% (55/139) for GPT-3.5, 28.6% (34/119) for GPT-4, and 91.4% (95/104) for Bard (P<.001). These rates exceed 20% for all tested models in this medical literature review task.
A February 2026 study found that hallucination rates ranged from 50% to 82% across models and prompting methods in clinical decision support scenarios. For the best-performing model, GPT-4o, rates declined from 53% to 23% with prompt-based mitigation, but still remained above 20%.
Statistical measures from this dataset revealed significantly lower factual and intrinsic hallucination rates for GPT-4 (under 10%) compared to other models. Models with low MV like GPT-4 achieved better factual accuracy, aligning with benchmarks like TruthfulQA.
In ChatGPT‐4, 76.2% of responses were accurate, which was higher compared to 50.0% in Bard and 45.2% in ChatGPT‐3.5. This implies that ChatGPT-4 had an inaccuracy rate of 23.8%, Bard 50.0%, and ChatGPT-3.5 54.8%. Of 65/109 unique references from Bard, 13.8% were non‐existent/fabricated.
Recent updates to hallucination surveys note average rates across LLMs on factual QA benchmarks range 10-25%, with improvements in 2025 models reducing to under 10% for leaders, but some reasoning models exceed 30% on complex queries.
A January 2026 benchmark revealed that even the latest models have greater than 15% hallucination rates when asked to analyze provided statements.
On comparable benchmarks, hallucinations are declining year-over-year for non-complex cases. Top models dropped from roughly 1–3% in 2024 to 0.7–1.5% in 2025 on grounded summarization tasks. However, hallucinations remain high in complex reasoning and open-domain factual recall, where rates can exceed 33%.
“Even when you ground LLMs, 1 out of every 20 tokens coming out might be completely wrong, completely off topic, or not true,” said Amr Awadallah, CEO of GenAI platform Vectara. “Gemini 2.0 from Google broke new benchmarks and they're around 0.8%, 0.9% hallucinations, which is amazing. But I think we're going to be saturating around 0.5%. I don't think we'll be able to beat 0.5%. There are many, many fields where that 0.5% is not acceptable.”
When a session drags past 20 minutes, almost 9 out of 10 (88 %) say they 'very often' have to revise for hallucinations. Heavy users run into frequent hallucinations nearly three times than casual users (34% vs. 12%).
A February 2025 analysis noted that while hallucinations are on the decline, a study found 29% false references from ChatGPT 4, compared to 40% from ChatGPT 3.5. It projects that AI could hit zero hallucinations by February 2027, though this is a projection.
Established benchmarks like TruthfulQA show top LLMs like GPT-4o and Llama 3.1 averaging 5-15% hallucination rates on general tasks, below 20% overall, though domain-specific tasks like legal or medical can exceed 20% in older models.
What do you think of the claim?
Your challenge will appear immediately.
Challenge submitted!
Expert review
How each expert evaluated the evidence and arguments
Expert 1 — The Logic Examiner
The claim asserts that AI language models generate hallucinated or factually incorrect outputs "in more than 20% of cases" — a universal, unqualified scope. The evidence is deeply bifurcated: Sources 2, 3, and 5 directly measure rates exceeding 20% in domain-specific tasks (medical literature review, clinical decision support, disease epidemiology), while Sources 1, 4, 8, 9, and 12 show top models and general benchmarks averaging well below 20% (9.2% average on Vectara, sub-10% for GPT-4 on factual QA). The logical flaw in the proponent's case is a hasty generalization — domain-specific high-stakes results (medical, clinical) cannot be extrapolated to "AI language models in general" without scope qualification. Conversely, the opponent commits a composition fallacy in reverse — cherry-picking top-performing models and grounded summarization benchmarks to represent all models across all use cases, while dismissing realistic domain studies as "narrow." The claim as written is unqualified and sweeping ("in more than 20% of cases"), which the aggregate evidence does not support: the best available cross-model benchmark (Source 1) shows a ~9.2% average, and Source 6 places the range at 10–25% with leaders under 10%, meaning the claim is only true in specific high-stakes or complex domains, not as a general rule. The claim is therefore misleading — it is true in certain contexts but false as a universal statement, and the reasoning used to support it relies on scope overgeneralization from domain-specific studies.
Expert 2 — The Context Analyst
The claim is framed as a general, across-the-board rate (“more than 20% of cases”) but the >20% figures come largely from domain-specific, high-stakes medical/clinical tasks and/or older models (e.g., GPT‑3.5/Bard) where error rates are known to be higher, while broader benchmark-style measurements for many current top models report well under 20% (often under 10% and sometimes ~1%) and even an across-model average around ~9% on certain grounded factual-consistency setups (Sources 1, 4, 6, 12 vs. 2, 3, 5). With full context, it's true that LLMs can exceed 20% in some settings, but the unqualified wording implies this is typical overall, which is misleading rather than generally true (Sources 1, 2, 3, 5, 6).
Expert 3 — The Source Auditor
The highest-authority, most directly quantitative sources are the peer‑reviewed/PMC studies (Source 2 PubMed Central; Source 5 PMC; Source 3 PMC), which show >20% factual error/hallucination rates in specific medical/clinical tasks (e.g., GPT‑4 at 28.6% in one literature-review setup in Source 2; GPT‑4 at 23.8% inaccurate in Source 5; mitigated best case still ~23% in Source 3), while other credible but differently-scoped evidence (Source 4 Frontiers survey; Source 1 Vectara leaderboard) reports substantially lower rates (often <10% and even ~1%) on particular benchmarks like grounded summarization/factual consistency leaderboards. Because the claim is framed as a broad, general rate “in more than 20% of cases” across AI language model use, the most trustworthy evidence does not support that as a generalizable overall figure (it is highly task/model dependent and often below 20% on major benchmarks), so the claim is misleading rather than clearly true or false.
Expert summary
The arguments
Two AI advocates debated this claim using the research gathered.
Argument for
Multiple peer‑reviewed evaluations show hallucination/inaccuracy rates well above 20% in realistic information-seeking settings: PubMed Central reports 39.6% for GPT‑3.5 and 28.6% for GPT‑4 in a medical literature review task (Source 2, “Hallucination Rates and Reference Accuracy of ChatGPT and Bard ...”), and a disease-epidemiology accuracy study finds GPT‑4 responses 76.2% accurate—i.e., 23.8% inaccurate—while Bard and GPT‑3.5 are far worse (Source 5, “Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology”). Reinforcing that this is not an isolated artifact, a 2026 clinical decision-support assurance analysis reports 50–82% hallucination rates across models and prompting methods, with even the best mitigated condition still at 23% (Source 3, “Multi-model assurance analysis ...”), establishing that AI language models can and do generate factually incorrect outputs in more than 20% of cases.
Your entire case rests on a classic composition fallacy — you are extrapolating from narrow, high-stakes domain studies (medical literature review in Source 2, clinical decision support in Source 3, disease epidemiology in Source 5) to a sweeping claim about AI language models "in general," yet none of these sources test general-use scenarios, and Source 1, the most rigorous and comprehensive benchmark available, directly contradicts you with an average hallucination rate of ~9.2% across all tracked models. Furthermore, your reliance on older models like GPT-3.5 and Bard — which Source 4 and Source 12 confirm are far outperformed by current models operating well under 10% — means you are cherry-picking outdated, worst-case data points to prop up a claim that the best available evidence, measured across diverse models and tasks, decisively refutes.
Argument against
The claim that AI language models hallucinate in "more than 20% of cases" as a general rule is flatly contradicted by the most rigorous and authoritative benchmark data available: Source 1 (Vectara Hallucination Leaderboard) shows top models hallucinating at rates as low as 0.7–0.9%, with an average across all tracked models of only ~9.2%, and Source 4 (Frontiers in Artificial Intelligence) confirms GPT-4 operates well under 10% on factual tasks — both figures decisively below the 20% threshold. The supporting sources that appear to exceed 20% are narrowly scoped to adversarial clinical scenarios (Source 3), specialized medical literature review tasks (Source 2), or domain-specific inaccuracy metrics (Source 5), none of which represent general-use hallucination rates and therefore cannot sustain a sweeping claim about AI language models broadly.
You're treating Source 1 (Vectara Hallucination Leaderboard) and Source 4 (Frontiers in Artificial Intelligence) as if they define “cases” for all real-world use, but they're benchmark- and task-dependent snapshots where grounded summarization/factual QA can be unusually low—so citing sub‑1% leaders and a ~9% average doesn't logically refute that other common information-seeking settings exceed 20%. More importantly, you dismiss Source 2 (Hallucination Rates and Reference Accuracy of ChatGPT and Bard ...), Source 5 (Disease Epidemiology accuracy), and Source 3 (clinical decision support) as “narrow,” yet they directly measure factual incorrectness in realistic high-stakes user queries and repeatedly clear the 20% bar (28.6%, 23.8%, and ~23% even after mitigation), which is enough to validate the claim that LLMs do generate >20% incorrect outputs in practice.