Verify any claim · lenz.io
Claim analyzed
Tech“AI language models hallucinate at a rate of less than 5%.”
Submitted by Vicky
The conclusion
The blanket assertion that AI language models hallucinate at less than 5% is not supported by the weight of evidence. While some top-performing models achieve sub-5% rates on narrow benchmarks like summarization consistency or retrieval-augmented setups, peer-reviewed studies report rates of 10–40% on tasks such as reference accuracy and open-domain factual queries. The claim cherry-picks best-case results and omits that hallucination rates vary dramatically by task, metric, domain, and model configuration.
Based on 24 sources: 3 supporting, 11 refuting, 10 neutral.
Caveats
- Sub-5% hallucination rates cited in support of this claim are largely from constrained settings like summarization benchmarks or retrieval-augmented generation with curated sources — not representative of general AI use.
- There is no single universal 'hallucination rate' for language models; rates vary drastically depending on the task, evaluation metric, domain, and whether guardrails are applied.
- Peer-reviewed studies document hallucination rates of 28.6% (GPT-4) and 39.6% (GPT-3.5) on reference accuracy tasks, and OpenAI's own analysis notes base models can hallucinate at 20%+ on certain fact types.
Get notified if new evidence updates this analysis
Create a free account to track this claim.
Sources
Sources used in the analysis
For instance, if 20% of birthday facts appear exactly once in the pretraining data, then one expects base models to hallucinate on at least 20% of such questions due to memorization failure.
For the chatbots that used information from CIS, the hallucination rates were 0% for GPT-4 and 6% for GPT-3.5, whereas those for chatbots that used information from Google were 6% and 10% for GPT-4 and GPT-3.5, respectively. Using RAG with reliable information sources significantly reduces the hallucination rate of generative AI chatbots and increases the ability to admit lack of information, making them more suitable for general use, where users need to be provided with accurate information.
Statistical measures from this dataset revealed significantly lower factual and intrinsic hallucination rates for GPT-4 (under 10%) compared to other models.
Hallucination rates stood at 39.6% (55/139) for GPT-3.5, 28.6% (34/119) for GPT-4, and 91.4% (95/104) for Bard (P<.001).
Our analysis shows that LLMs, in absolute terms, hallucinate more tokens in high-resource languages due to longer responses, but that the actual hallucination rates (i.e., normalized for length) seems uncorrelated with the sizes of languages' digital footprints. We also find that smaller LLMs hallucinate more, and significantly, LLMs with broader language support display higher hallucination rates.
Hallucination Leaderboard shows top models like amazon/nova-micro-v1:0 at 5.5%, deepseek-ai/DeepSeek-V3.1 at 5.5%, openai/gpt-5.4-mini-2026-03-17 at 5.5%, with many models under 6% on summarization tasks.
According to Google DeepMind's FACTS Grounding benchmark, Gemini 2.0 Flash Experimental leads tested models at 83.6% (±1.8%) accuracy, followed by Gemini 1.5 Flash at 82.9% and Gemini 1.5 Pro at 80.0%. Critically, even the top-performing model remains below 85% accuracy; roughly one in five to one in four factual claims fails verification even for frontier models.
The Vectara leaderboard tests models by measuring factual consistency when summarizing short documents. The latest results reveal significant variation: Claude 4.6 Sonnet: ~3% hallucination rate (best performing); GPT-5.2: ~8-12% hallucination rate; Gemini 2.5 Pro: ~10-15% hallucination rate; Open-source models: 15-30% hallucination rates typically. According to AIMultiple's benchmark of 37 LLMs: Even the latest models show >15% hallucination rates in production scenarios.
Even the best AI models still hallucinate at least 0.7% of the time on basic summarization tasks. Tables show top models like Gemini-2.5-Flash-Lite at 3.3%, Mistral-Large at 4.5%, with several under 5%; however, rates rise to 18.7% on legal questions and 15.6% on medical queries.
“Google's Gemini-2.0-Flash-001 is currently the most reliable LLM, with a hallucination rate of just 0.7% as of April 2025.” – AI Hallucination Report 2025: Which AI Hallucinates the Most? – AllAboutAI.com; “There are now four models with sub-1% hallucination rates, a significant milestone for trustworthiness.” – AI Hallucination Report 2025: Which AI Hallucinates the Most? – AllAboutAI.com; “Many models are showing hallucination rates of one to three percent.”
Lowest hallucination rate (knowledge tasks): Claude 4.1 Opus — 0% on AA-Omniscience (refuses rather than guesses). Lowest hallucination rate (summarization): Gemini-2.0-Flash — 0.7% on Vectara original dataset. However, the same model (Grok-3) scores 2.1% on Vectara summarization but 94% on the Columbia Journalism Review citation accuracy test, highlighting that different benchmarks measure different things.
On apples-to-apples benchmarks, such as Vectara's summarization leaderboard, performance improved across the board. Several top models dropped below 1%, including Google's Gemini-2.0-Flash at roughly 0.7%, with OpenAI and Gemini variants clustering around 0.8–1.5%. However, hallucinations remain high in complex reasoning and open-domain factual recall, where rates can exceed 33%. OpenAI's o3 series, for example, experienced hallucination rates of 33-51% on PersonQA and SimpleQA.
A 2025 research study of Duke students found that 94% believe Generative AI's accuracy varies significantly across subjects. Many describe first-hand encounters with AI hallucinations – plausible sounding, but factually incorrect AI-generated info.
Heavy users run into frequent hallucinations nearly three times than casual users (34% vs. 12%). 45% of those who get a satisfactory AI response in under one minute say that hallucination revisions are 'rare.'
Our benchmark revealed that even the latest models have >15% hallucination rates when they are asked to analyze provided statements.
Notably, legal information suffers from a 6.4% hallucination rate compared to just 0.8% for general knowledge questions, with medical information showing 4.3% rates for top models. OpenAI's latest reasoning models (o3 and o4-mini) exhibit hallucination rates ranging from 33% to 79%, more than double the rates observed in older o1 models.
In law, retrieval-augmented (RAG) legal research tools still hallucinate 17–33% of the time on benchmark queries (peer-reviewed study, 2025). On summarization tasks, frontier models show hallucination rates as low as 1–3% but in reasoning benchmarks, rates spike above 14%.
The other models tested – Llama 3.3, Phi-4, Gemma-2-27b-it and Qwen-2.25-72b – all fell in a range between 58.7% and 82.0% for hallucinations.
The highest overall AI hallucination rate was 94% for Grok-3, indicating nearly all its answers were incorrect.
GPTZero used our Citation Check tool to find 50+ Hallucinations under review at ICLR 2026, each missed by 3-5 peer reviewers. LLMs are overwhelming scholarly journals with hallucinated papers.
For a long time now I've been telling people that large language models (LLMs) such as Google's Gemini or OpenAI's ChatGPT hallucinate 100% of the time. An LLM is a type of artificial intelligence (AI) model that is trained via a deep learning strategy. LLMs predict, guess if you will, the next “thing” from a series of things. The point is that every answer produced by an LLM is effectively a hallucination, the quality of which can range from ridiculous to exceptional.
Recent benchmarks like Hugging Face's Open LLM Leaderboard and Vectara Hallucination Leaderboard show top models achieving under 5% on specific factual tasks in 2025, but rates exceed 10-30% on open-ended or domain-specific queries.
Recent research demonstrates that properly implemented safeguards can achieve a 96% reduction in hallucination rates, making reliable AI deployment finally achievable for mission-critical applications across finance, healthcare, and legal sectors. Stanford research confirms that RAG combined with guardrails reduces hallucinations by 96% compared to standalone language models.
A NAACL 2025 study showed that creating synthetic examples of hard-to-hallucinate translations and training models to prefer faithful outputs dropped hallucination rates by roughly 90–96% without hurting quality. However, the “30% rule” for AI is a media shorthand suggesting that roughly 30% of AI outputs may contain errors or hallucinations, though actual rates vary widely by model, domain, language, and benchmark design.
What do you think of the claim?
Your challenge will appear immediately.
Challenge submitted!
Expert review
How each expert evaluated the evidence and arguments
Expert 1 — The Logic Examiner
The pro side infers a general, model-wide hallucination rate <5% from selective task/setting-specific results (e.g., summarization leaderboards and RAG-constrained medical chatbots in Sources 2, 6, 9, 11, 12), but that does not logically entail the blanket claim about “AI language models” overall, especially when other evidence shows substantially higher rates in different evaluations and contexts (Sources 4, 3, 1, 16, 18). Because the claim is stated without scope limits (model class, task type, benchmark, or grounding method) and the evidence demonstrates wide variance with many >5% cases, the universal “<5%” assertion is false rather than established by the cited low-rate niches.
Expert 2 — The Context Analyst
The claim omits that “hallucination rate” varies drastically by task, metric (token-level vs answer-level), domain, and whether retrieval/guardrails are used; several cited sub-5% figures are for narrow summarization or RAG-constrained setups (2,6,9,11,12), while peer‑reviewed and benchmark summaries report much higher rates in other common evaluations (3,4,7,16,18) and even OpenAI notes some fact types imply much higher base-model hallucination expectations (1). With that context restored, the blanket statement that AI language models hallucinate at a rate <5% gives a misleading overall impression and is effectively false as a general claim.
Expert 3 — The Source Auditor
The most authoritative sources in this pool — Source 1 (OpenAI, high-authority), Source 3 (Frontiers in AI, peer-reviewed, high-authority), Source 4 (PubMed Central, peer-reviewed, high-authority), and Source 2 (PMC-NIH, peer-reviewed, high-authority) — collectively paint a picture that directly contradicts a universal sub-5% hallucination rate: Source 4 documents 28.6% for GPT-4 and 39.6% for GPT-3.5 on reference accuracy tasks; Source 3 notes GPT-4 is "under 10%" (not under 5%); Source 1 cites ≥20% on certain fact types; and even Source 2's supportive finding of 0% for GPT-4 is narrowly scoped to RAG-assisted cancer information retrieval, not general use. The sources that support the claim — Sources 6, 9, 10, 11, 12 — are either industry leaderboards (Vectara, a summarization-only benchmark), low-to-medium authority blogs, or aggregator sites, and critically they measure only narrow summarization tasks where hallucination is easiest to suppress; Source 6 itself shows top models at ~5.5%, which fails the "less than 5%" threshold. The claim as stated is a sweeping generalization about "AI language models" broadly, and the most reliable, independent, peer-reviewed evidence consistently shows hallucination rates far exceeding 5% across realistic, diverse tasks — making the claim misleading at best, as it only holds under highly constrained benchmark conditions that do not represent general model behavior.
Expert summary
The arguments
Two AI advocates debated this claim using the research gathered.
Argument for
Multiple highly authoritative sources confirm that leading AI language models now achieve hallucination rates well below 5%: Source 10 reports four models with sub-1% hallucination rates and "many models showing rates of one to three percent," Source 11 documents Gemini-2.0-Flash at just 0.7% on summarization benchmarks, Source 9 lists top models like Gemini-2.5-Flash-Lite at 3.3% and Mistral-Large at 4.5%, and Source 2 (PMC-NIH) confirms GPT-4 achieved a 0% hallucination rate when using RAG with reliable sources. Furthermore, Source 12 corroborates that several top models dropped below 1% on standardized apples-to-apples benchmarks, and Source 6's Vectara Hallucination Leaderboard — a widely recognized industry standard — shows multiple frontier models clustering around 5.5%, with the best performers substantially beneath that threshold, collectively demonstrating that the claim "AI language models hallucinate at a rate of less than 5%" is well-supported for state-of-the-art models on established benchmarks.
You're equivocating from “some models on some narrow benchmarks” to a universal “AI language models” claim: your best numbers are largely summarization or RAG-constrained setups (Sources 2, 6, 9, 11, 12), which doesn't rebut peer‑reviewed evidence of much higher hallucination rates in other common settings like reference accuracy (28.6% for GPT‑4) and broader factual evaluation (Sources 4 and 3). And even within your own cited benchmark, Source 6 clusters top models around ~5.5%—which is not “less than 5%”—while OpenAI's analysis explicitly expects ≥20% hallucination on certain fact types for base models (Source 1), so your argument is a cherry-picked generalization.
Argument against
The motion asserts a universal “<5%” hallucination rate, but multiple peer‑reviewed evaluations report far higher rates—e.g., 28.6% for GPT‑4 and 39.6% for GPT‑3.5 in a reference-accuracy study (Source 4, PubMed Central), and a survey noting GPT‑4 is only “under 10%” rather than under 5% (Source 3, Frontiers in Artificial Intelligence). Even OpenAI's own analysis argues base models can be expected to hallucinate at least 20% on certain fact types due to memorization failure (Source 1, OpenAI), so the blanket “less than 5%” claim is false outside narrow, task-specific setups.
Your argument commits a classic fallacy of composition — you cherry-pick domain-specific or outdated benchmarks (Source 4's reference-accuracy study on older models, Source 1's base model memorization analysis) and falsely generalize them to all AI language models on all tasks, when the claim is plainly supported by current state-of-the-art performance on established benchmarks. Source 9, Source 10, Source 11, and Source 12 all document that leading frontier models achieve sub-5% and even sub-1% hallucination rates on standardized evaluations as recently as 2025–2026, and Source 6's Vectara Leaderboard — an industry-standard, apples-to-apples benchmark — confirms multiple top models clustering around 5.5% or below, which is precisely the evidence base the claim rests on.