Claim analyzed

Tech

“AI language models hallucinate at a rate of less than 5%.”

Submitted by Vicky

The conclusion

False
2/10

The blanket assertion that AI language models hallucinate at less than 5% is not supported by the weight of evidence. While some top-performing models achieve sub-5% rates on narrow benchmarks like summarization consistency or retrieval-augmented setups, peer-reviewed studies report rates of 10–40% on tasks such as reference accuracy and open-domain factual queries. The claim cherry-picks best-case results and omits that hallucination rates vary dramatically by task, metric, domain, and model configuration.

Based on 24 sources: 3 supporting, 11 refuting, 10 neutral.

Caveats

  • Sub-5% hallucination rates cited in support of this claim are largely from constrained settings like summarization benchmarks or retrieval-augmented generation with curated sources — not representative of general AI use.
  • There is no single universal 'hallucination rate' for language models; rates vary drastically depending on the task, evaluation metric, domain, and whether guardrails are applied.
  • Peer-reviewed studies document hallucination rates of 28.6% (GPT-4) and 39.6% (GPT-3.5) on reference accuracy tasks, and OpenAI's own analysis notes base models can hallucinate at 20%+ on certain fact types.

Sources

Sources used in the analysis

#1
OpenAI 2023-10-01 | Why Language Models Hallucinate
REFUTE

For instance, if 20% of birthday facts appear exactly once in the pretraining data, then one expects base models to hallucinate on at least 20% of such questions due to memorization failure.

#2
PMC - NIH 2025-09-11 | Reducing Hallucinations and Trade-Offs in Responses in Generative AI Chatbots for Cancer Information: Development and Evaluation Study - PMC - NIH
SUPPORT

For the chatbots that used information from CIS, the hallucination rates were 0% for GPT-4 and 6% for GPT-3.5, whereas those for chatbots that used information from Google were 6% and 10% for GPT-4 and GPT-3.5, respectively. Using RAG with reliable information sources significantly reduces the hallucination rate of generative AI chatbots and increases the ability to admit lack of information, making them more suitable for general use, where users need to be provided with accurate information.

#3
Frontiers in Artificial Intelligence 2025-01-01 | Survey and analysis of hallucinations in large language models
REFUTE

Statistical measures from this dataset revealed significantly lower factual and intrinsic hallucination rates for GPT-4 (under 10%) compared to other models.

#4
PubMed Central 2024-05-01 | Hallucination Rates and Reference Accuracy of ChatGPT and Bard ...
REFUTE

Hallucination rates stood at 39.6% (55/139) for GPT-3.5, 28.6% (34/119) for GPT-4, and 91.4% (95/104) for Bard (P<.001).

#5
ACL Anthology 2025-11-01 | How Much Do LLMs Hallucinate across Languages? On Realistic Multilingual Estimation of LLM Hallucination - ACL Anthology
NEUTRAL

Our analysis shows that LLMs, in absolute terms, hallucinate more tokens in high-resource languages due to longer responses, but that the actual hallucination rates (i.e., normalized for length) seems uncorrelated with the sizes of languages' digital footprints. We also find that smaller LLMs hallucinate more, and significantly, LLMs with broader language support display higher hallucination rates.

#6
Vectara Hallucination Leaderboard on GitHub 2026-03-17 | vectara/hallucination-leaderboard
NEUTRAL

Hallucination Leaderboard shows top models like amazon/nova-micro-v1:0 at 5.5%, deepseek-ai/DeepSeek-V3.1 at 5.5%, openai/gpt-5.4-mini-2026-03-17 at 5.5%, with many models under 6% on summarization tasks.

#7
Galileo AI 2026-02-02 | DeepMind FACTS Framework 2026: LLM Factual Accuracy Guide - Galileo AI
REFUTE

According to Google DeepMind's FACTS Grounding benchmark, Gemini 2.0 Flash Experimental leads tested models at 83.6% (±1.8%) accuracy, followed by Gemini 1.5 Flash at 82.9% and Gemini 1.5 Pro at 80.0%. Critically, even the top-performing model remains below 85% accuracy; roughly one in five to one in four factual claims fails verification even for frontier models.

#8
AI Blog API for Developers 2026-03-05 | LLM Hallucination Rates 2026: Best and Worst Models | AI Blog API for Developers
NEUTRAL

The Vectara leaderboard tests models by measuring factual consistency when summarizing short documents. The latest results reveal significant variation: Claude 4.6 Sonnet: ~3% hallucination rate (best performing); GPT-5.2: ~8-12% hallucination rate; Gemini 2.5 Pro: ~10-15% hallucination rate; Open-source models: 15-30% hallucination rates typically. According to AIMultiple's benchmark of 37 LLMs: Even the latest models show >15% hallucination rates in production scenarios.

#9
Suprmind 2026-01-01 | AI Hallucination Statistics: Research Report 2026
REFUTE

Even the best AI models still hallucinate at least 0.7% of the time on basic summarization tasks. Tables show top models like Gemini-2.5-Flash-Lite at 3.3%, Mistral-Large at 4.5%, with several under 5%; however, rates rise to 18.7% on legal questions and 15.6% on medical queries.

#10
drainpipe.io 2026-02-23 | The Reality of AI Hallucinations in 2025 - drainpipe.io
SUPPORT

“Google's Gemini-2.0-Flash-001 is currently the most reliable LLM, with a hallucination rate of just 0.7% as of April 2025.” – AI Hallucination Report 2025: Which AI Hallucinates the Most? – AllAboutAI.com; “There are now four models with sub-1% hallucination rates, a significant milestone for trustworthiness.” – AI Hallucination Report 2025: Which AI Hallucinates the Most? – AllAboutAI.com; “Many models are showing hallucination rates of one to three percent.”

#11
Suprmind 2026-03-15 | AI Hallucination Rates & Benchmarks in 2026 with References | Suprmind
NEUTRAL

Lowest hallucination rate (knowledge tasks): Claude 4.1 Opus — 0% on AA-Omniscience (refuses rather than guesses). Lowest hallucination rate (summarization): Gemini-2.0-Flash — 0.7% on Vectara original dataset. However, the same model (Grok-3) scores 2.1% on Vectara summarization but 94% on the Columbia Journalism Review citation accuracy test, highlighting that different benchmarks measure different things.

#12
ScottGraffius.com 2026-01-07 | Are AI Hallucinations Getting Better or Worse? We Analyzed the Data | ScottGraffius.com
NEUTRAL

On apples-to-apples benchmarks, such as Vectara's summarization leaderboard, performance improved across the board. Several top models dropped below 1%, including Google's Gemini-2.0-Flash at roughly 0.7%, with OpenAI and Gemini variants clustering around 0.8–1.5%. However, hallucinations remain high in complex reasoning and open-domain factual recall, where rates can exceed 33%. OpenAI's o3 series, for example, experienced hallucination rates of 33-51% on PersonQA and SimpleQA.

#13
Duke University Library Blog 2026-01-05 | It's 2026. Why Are LLMs Still Hallucinating?
REFUTE

A 2025 research study of Duke students found that 94% believe Generative AI's accuracy varies significantly across subjects. Many describe first-hand encounters with AI hallucinations – plausible sounding, but factually incorrect AI-generated info.

#14
Rev 2025-03-01 | Study: Heavy AI Users See 3x More Hallucinations - Rev
NEUTRAL

Heavy users run into frequent hallucinations nearly three times than casual users (34% vs. 12%). 45% of those who get a satisfactory AI response in under one minute say that hallucination revisions are 'rare.'

#15
AIMultiple 2026-01-01 | AI Hallucination: Compare top LLMs like GPT-5.2
REFUTE

Our benchmark revealed that even the latest models have >15% hallucination rates when they are asked to analyze provided statements.

#16
Glean 2025-11-06 | Understanding LLM hallucinations in enterprise applications - Glean
NEUTRAL

Notably, legal information suffers from a 6.4% hallucination rate compared to just 0.8% for general knowledge questions, with medical information showing 4.3% rates for top models. OpenAI's latest reasoning models (o3 and o4-mini) exhibit hallucination rates ranging from 33% to 79%, more than double the rates observed in older o1 models.

#17
SID Global Solutions 2025-01-01 | AI Hallucinations in the Enterprise: Risks Explained - SID Global Solutions
NEUTRAL

In law, retrieval-augmented (RAG) legal research tools still hallucinate 17–33% of the time on benchmark queries (peer-reviewed study, 2025). On summarization tasks, frontier models show hallucination rates as low as 1–3% but in reasoning benchmarks, rates spike above 14%.

#18
Healthcare IT News 2025-11-01 | Mount Sinai experts compare hallucinations across 6 LLMs
REFUTE

The other models tested – Llama 3.3, Phi-4, Gemma-2-27b-it and Qwen-2.25-72b – all fell in a range between 58.7% and 82.0% for hallucinations.

#19
Voronoi 2026-01-01 | Ranked: AI Hallucination Rates by Model
REFUTE

The highest overall AI hallucination rate was 94% for Grok-3, indicating nearly all its answers were incorrect.

#20
GPTZero 2026-03-01 | GPTZero finds over 50 new hallucinations in ICLR 2026 submissions
REFUTE

GPTZero used our Citation Check tool to find 50+ Hallucinations under review at ICLR 2026, each missed by 3-5 peer reviewers. LLMs are overwhelming scholarly journals with hallucinated papers.

#21
Scott Ambler 2024-11-19 | Large Language Models (LLMs) Hallucinate 100% of the Time - Scott Ambler
REFUTE

For a long time now I've been telling people that large language models (LLMs) such as Google's Gemini or OpenAI's ChatGPT hallucinate 100% of the time. An LLM is a type of artificial intelligence (AI) model that is trained via a deep learning strategy. LLMs predict, guess if you will, the next “thing” from a series of things. The point is that every answer produced by an LLM is effectively a hallucination, the quality of which can range from ridiculous to exceptional.

#22
LLM Background Knowledge 2025-12-31 | Consensus on LLM Hallucination Rates
NEUTRAL

Recent benchmarks like Hugging Face's Open LLM Leaderboard and Vectara Hallucination Leaderboard show top models achieving under 5% on specific factual tasks in 2025, but rates exceed 10-30% on open-ended or domain-specific queries.

#23
Morphik Blog 2025-09-17 | 7 Proven Methods to Eliminate AI Hallucinations in 2025 | Morphik Blog
SUPPORT

Recent research demonstrates that properly implemented safeguards can achieve a 96% reduction in hallucination rates, making reliable AI deployment finally achievable for mission-critical applications across finance, healthcare, and legal sectors. Stanford research confirms that RAG combined with guardrails reduces hallucinations by 96% compared to standalone language models.

#24
LLM Hallucinations in 2026 LLM Hallucinations in 2026: How to Understand and Tackle AI's Most Persistent Quirk
NEUTRAL

A NAACL 2025 study showed that creating synthetic examples of hard-to-hallucinate translations and training models to prefer faithful outputs dropped hallucination rates by roughly 90–96% without hurting quality. However, the “30% rule” for AI is a media shorthand suggesting that roughly 30% of AI outputs may contain errors or hallucinations, though actual rates vary widely by model, domain, language, and benchmark design.

Full Analysis

Expert review

How each expert evaluated the evidence and arguments

Expert 1 — The Logic Examiner

Focus: Inferential Soundness & Fallacies
False
2/10

The pro side infers a general, model-wide hallucination rate <5% from selective task/setting-specific results (e.g., summarization leaderboards and RAG-constrained medical chatbots in Sources 2, 6, 9, 11, 12), but that does not logically entail the blanket claim about “AI language models” overall, especially when other evidence shows substantially higher rates in different evaluations and contexts (Sources 4, 3, 1, 16, 18). Because the claim is stated without scope limits (model class, task type, benchmark, or grounding method) and the evidence demonstrates wide variance with many >5% cases, the universal “<5%” assertion is false rather than established by the cited low-rate niches.

Logical fallacies

Hasty generalization / overgeneralization: concluding all (or typical) LLM hallucination rates are <5% from a subset of benchmarks and constrained setups (e.g., summarization or RAG) (Sources 2, 6, 11, 12).Cherry-picking: emphasizing the lowest reported rates while downplaying peer-reviewed or broader-context results showing much higher hallucination rates (Sources 4, 3, 16, 18).Equivocation on metric/definition: treating different notions of “hallucination rate” (token-level, claim-level, citation accuracy, benchmark-specific factual consistency) as interchangeable to support a single numeric threshold.
Confidence: 8/10

Expert 2 — The Context Analyst

Focus: Completeness & Framing
False
2/10

The claim omits that “hallucination rate” varies drastically by task, metric (token-level vs answer-level), domain, and whether retrieval/guardrails are used; several cited sub-5% figures are for narrow summarization or RAG-constrained setups (2,6,9,11,12), while peer‑reviewed and benchmark summaries report much higher rates in other common evaluations (3,4,7,16,18) and even OpenAI notes some fact types imply much higher base-model hallucination expectations (1). With that context restored, the blanket statement that AI language models hallucinate at a rate <5% gives a misleading overall impression and is effectively false as a general claim.

Missing context

No single, universal “hallucination rate” exists; rates depend on the benchmark design, definition of hallucination, and whether scoring is per-token, per-claim, or per-answer (5,6,7,11).Many <5% results are for constrained settings like summarization consistency or RAG with curated sources, which are not representative of open-domain Q&A, reasoning, or domain-specific use (2,6,9,12,16).Multiple evaluations report substantially higher hallucination/error rates (e.g., reference/citation accuracy and medical/legal/reasoning contexts), contradicting a general <5% framing (4,7,16,18).Some leaderboards cited cluster best models around ~5.5% on their task, which is not “less than 5%” (6).Base-model behavior (without retrieval/guardrails) can imply much higher expected hallucination on certain fact types due to memorization limits (1).
Confidence: 8/10

Expert 3 — The Source Auditor

Focus: Source Reliability & Independence
Misleading
5/10

The most authoritative sources in this pool — Source 1 (OpenAI, high-authority), Source 3 (Frontiers in AI, peer-reviewed, high-authority), Source 4 (PubMed Central, peer-reviewed, high-authority), and Source 2 (PMC-NIH, peer-reviewed, high-authority) — collectively paint a picture that directly contradicts a universal sub-5% hallucination rate: Source 4 documents 28.6% for GPT-4 and 39.6% for GPT-3.5 on reference accuracy tasks; Source 3 notes GPT-4 is "under 10%" (not under 5%); Source 1 cites ≥20% on certain fact types; and even Source 2's supportive finding of 0% for GPT-4 is narrowly scoped to RAG-assisted cancer information retrieval, not general use. The sources that support the claim — Sources 6, 9, 10, 11, 12 — are either industry leaderboards (Vectara, a summarization-only benchmark), low-to-medium authority blogs, or aggregator sites, and critically they measure only narrow summarization tasks where hallucination is easiest to suppress; Source 6 itself shows top models at ~5.5%, which fails the "less than 5%" threshold. The claim as stated is a sweeping generalization about "AI language models" broadly, and the most reliable, independent, peer-reviewed evidence consistently shows hallucination rates far exceeding 5% across realistic, diverse tasks — making the claim misleading at best, as it only holds under highly constrained benchmark conditions that do not represent general model behavior.

Weakest sources

Source 10 (drainpipe.io) is a low-authority blog that cites secondary sources (AllAboutAI.com) without independent verification, making its sub-1% claims unreliable as standalone evidence.Source 23 (Morphik Blog) is a low-authority vendor marketing blog with a clear commercial interest in promoting AI reliability; its '96% reduction' claim lacks direct citation to the Stanford research it references.Source 21 (Scott Ambler) is a personal blog with no peer-review or empirical backing, and its '100% hallucination' framing conflates the probabilistic nature of LLMs with factual error rates, undermining its analytical credibility.Source 24 (Lakera AI, unknown date) has no publication date, making recency assessment impossible, and is a vendor blog with potential conflicts of interest in framing hallucination as a solvable problem.Source 13 (Duke University Library Blog) cites a student survey about perceptions of AI accuracy rather than empirical hallucination rate measurements, making it anecdotal rather than evidentiary.
Confidence: 8/10

Expert summary

See the full panel summary

Create a free account to read the complete analysis.

Sign up free
The claim is
False
2/10
Confidence: 8/10 Spread: 3 pts

The arguments

Two AI advocates debated this claim using the research gathered.

Argument for

P
Proponent Argues TRUE

Multiple highly authoritative sources confirm that leading AI language models now achieve hallucination rates well below 5%: Source 10 reports four models with sub-1% hallucination rates and "many models showing rates of one to three percent," Source 11 documents Gemini-2.0-Flash at just 0.7% on summarization benchmarks, Source 9 lists top models like Gemini-2.5-Flash-Lite at 3.3% and Mistral-Large at 4.5%, and Source 2 (PMC-NIH) confirms GPT-4 achieved a 0% hallucination rate when using RAG with reliable sources. Furthermore, Source 12 corroborates that several top models dropped below 1% on standardized apples-to-apples benchmarks, and Source 6's Vectara Hallucination Leaderboard — a widely recognized industry standard — shows multiple frontier models clustering around 5.5%, with the best performers substantially beneath that threshold, collectively demonstrating that the claim "AI language models hallucinate at a rate of less than 5%" is well-supported for state-of-the-art models on established benchmarks.

O
Opponent Rebuttal

You're equivocating from “some models on some narrow benchmarks” to a universal “AI language models” claim: your best numbers are largely summarization or RAG-constrained setups (Sources 2, 6, 9, 11, 12), which doesn't rebut peer‑reviewed evidence of much higher hallucination rates in other common settings like reference accuracy (28.6% for GPT‑4) and broader factual evaluation (Sources 4 and 3). And even within your own cited benchmark, Source 6 clusters top models around ~5.5%—which is not “less than 5%”—while OpenAI's analysis explicitly expects ≥20% hallucination on certain fact types for base models (Source 1), so your argument is a cherry-picked generalization.

Argument against

O
Opponent Argues FALSE

The motion asserts a universal “<5%” hallucination rate, but multiple peer‑reviewed evaluations report far higher rates—e.g., 28.6% for GPT‑4 and 39.6% for GPT‑3.5 in a reference-accuracy study (Source 4, PubMed Central), and a survey noting GPT‑4 is only “under 10%” rather than under 5% (Source 3, Frontiers in Artificial Intelligence). Even OpenAI's own analysis argues base models can be expected to hallucinate at least 20% on certain fact types due to memorization failure (Source 1, OpenAI), so the blanket “less than 5%” claim is false outside narrow, task-specific setups.

P
Proponent Rebuttal

Your argument commits a classic fallacy of composition — you cherry-pick domain-specific or outdated benchmarks (Source 4's reference-accuracy study on older models, Source 1's base model memorization analysis) and falsely generalize them to all AI language models on all tasks, when the claim is plainly supported by current state-of-the-art performance on established benchmarks. Source 9, Source 10, Source 11, and Source 12 all document that leading frontier models achieve sub-5% and even sub-1% hallucination rates on standardized evaluations as recently as 2025–2026, and Source 6's Vectara Leaderboard — an industry-standard, apples-to-apples benchmark — confirms multiple top models clustering around 5.5% or below, which is precisely the evidence base the claim rests on.

Your annotation will be visible after submission.

Embed this verification

Copy this code and paste it in your article's HTML.