Claim analyzed

Health

“AI chatbots frequently repeat medical misinformation when prompted with misleading health claims.”

Submitted by Vicky

The conclusion

Mostly True
7/10
Created: February 26, 2026
Updated: March 01, 2026

Multiple peer-reviewed studies confirm that AI chatbots often repeat and even elaborate on medical misinformation when prompted with misleading health claims. A Mount Sinai study found chatbots confidently explained fabricated conditions, and an Annals of Internal Medicine study reported 88% false responses to misleading prompts. However, the claim overgeneralizes: performance varies significantly by model, with some chatbots consistently refusing to generate false health information. The most dramatic findings also come from adversarial experimental setups rather than typical real-world usage.

Based on 10 sources: 6 supporting, 0 refuting, 4 neutral.

Caveats

  • The word 'frequently' does not apply uniformly — some AI models consistently avoid generating health misinformation, while others are highly susceptible, making this a model-dependent pattern rather than a universal one.
  • The most cited statistics (e.g., 88% false responses) come from controlled adversarial experiments designed to stress-test chatbots, not from naturalistic user interactions, so real-world frequency may differ.
  • AI model safeguards are evolving rapidly; findings from studies conducted in 2023–2025 may not reflect the current capabilities or guardrails of the latest chatbot versions.

This analysis is for informational purposes only and does not constitute health or medical advice, diagnosis, or treatment. Always consult a qualified healthcare professional before making health-related decisions.

Sources

Sources used in the analysis

#1
Mount Sinai Health System 2025-08-02 | AI Chatbots Can Run With Medical Misinformation, Study Finds
SUPPORT

A new study by researchers at the Icahn School of Medicine at Mount Sinai finds that widely used AI chatbots are highly vulnerable to repeating and elaborating on false medical information. The team created fictional patient scenarios, each containing one fabricated medical term such as a made-up disease, symptom, or test, and submitted them to leading large language models. They not only repeated the misinformation but often expanded on it, offering confident explanations for non-existent conditions.

#2
PMC (PubMed Central) 2023-11-01 | AI chatbots and (mis)information in public health - PMC
SUPPORT

Due to the lack of transparency regarding the development of the model, we express concern over the possibility that groups of users may select specific health topics and influence ChatGPT and similar AI technologies to propagate false health-related information... chatbots can disseminate incorrect or biased healthcare information in a way that will be difficult to see through in terms of perceived quality and details.

#3
PubMed Central (NIH) 2025-01-01 | Evaluation of the readability, quality, and accuracy of AI chatbot responses to health information queries
NEUTRAL

The evaluation of misinformation and GQS scores revealed significant differences among the chatbots, consistent with findings from other recent studies examining AI chatbot accuracy in health contexts.

#4
JMIR Formative Research 2025-01-01 | Medical Misinformation in AI-Assisted Self-Diagnosis: Development ...
SUPPORT

In experiment 1, we found that ChatGPT-4.0 was deemed correct for 31% (29/94) of the questions by both nonexperts and experts... Studies have assessed its performance... However, the increased usage of ChatGPT raises significant ethical concerns regarding plagiarism, bias, transparency, inaccuracy, and health equity.

#5
KFF (Kaiser Family Foundation) 2025-01-01 | AI Chatbots as Health Information Sources — The Monitor
SUPPORT

A study published earlier this year in BMJ evaluated how well large language models (LLMs) could prevent users from prompting chatbots to create health disinformation. It found that while some AI chatbots consistently avoided creating false information, other models frequently created false health claims, especially when prompted with ambiguous or complex health scenarios. The safeguards were inconsistent – some models provided accurate information in one instance but not in others under similar conditions.

#6
American Psychological Association (APA) 2024-06-01 | Use of generative AI chatbots and wellness applications for mental health
NEUTRAL

The APA provides recommendations to ensure consumer safety and well-being when using chatbots and apps to address unmet mental health needs, acknowledging concerns about the reliability and safety of AI chatbots in health contexts.

#7
University of South Australia 2025-06-30 | AI chatbots could spread 'fake news' with serious health consequences - University of South Australia
SUPPORT

Chatbots can easily be programmed to deliver false medical and health information, according to an international team of researchers who have exposed some concerning weaknesses in machine learning systems. In the study, published today in the Annals of Internal Medicine, researchers evaluated the five foundational and most advanced AI systems... “In total, 88% of all responses were false,” Dr Modi says, “and yet they were presented with scientific terminology, a formal tone and fabricated references that made the information appear legitimate.”

#8
JMIR Formative Research 2024-01-15 | Leveraging Chatbots to Combat Health Misinformation for Older Adults
NEUTRAL

The study reveals people's concerns over using chatbots for misinformation management and notes that health information can change rapidly. Participants were aware that knowledge evolved and what was presented as factual one day could be revealed as wrong in a few weeks, making it difficult to track changes in information.

#9
Cleveland Clinic LibGuides 2023-11-01 | Health Misinformation: AI Generated Content - Subject Guides
NEUTRAL

Researchers used the technology behind the artificial intelligence (AI) chatbot ChatGPT to create a fake clinical-trial data set to support an unverified scientific claim... The authors used a combination of a large language model and a data analysis model to compare the outcomes of two surgical procedures.

#10
ECRI Video | The Misuse of AI Chatbots in Healthcare: Risks, Realities, and
SUPPORT

Identify risks associated with the inappropriate use of AI chatbots for patient care–related purposes, including exposure to misinformation, overreliance on ...

Full Analysis

Expert review

How each expert evaluated the evidence and arguments

Expert 1 — The Logic Examiner

Focus: Inferential Soundness & Fallacies
Mostly True
7/10

The logical chain from evidence to claim is partially sound but contains inferential gaps: Source 7's 88% false response rate in a controlled adversarial study (Annals of Internal Medicine) and Source 1's finding that chatbots expanded on fabricated medical terms both directly support the "frequently repeat misinformation when prompted with misleading claims" assertion, while Sources 2, 4, and 5 provide corroborating but less direct support. However, the opponent correctly identifies a scope-matching problem — the experimental setups (fabricated terms, adversarial prompts) measure susceptibility under artificial conditions, not the general frequency of misinformation repetition in typical use, and Source 5 explicitly notes that "some models consistently avoided" false claims, meaning the evidence supports a model-dependent pattern rather than a universal "AI chatbots frequently" blanket claim. The claim is Mostly True because the preponderance of direct experimental evidence does establish that many AI chatbots frequently produce medical misinformation when prompted with misleading inputs, but the word "frequently" applied universally across all chatbots overgeneralizes from studies that show significant variance between models.

Logical fallacies

Hasty generalization: The claim uses 'AI chatbots' as a universal category, but Source 5 (KFF) explicitly documents that some models consistently avoided generating false health claims while others did not — the evidence supports a model-dependent pattern, not a blanket universal frequency.Scope mismatch (not a named fallacy but an inferential gap): Source 1 and Source 7 use adversarial/fabricated prompts in controlled experimental settings, which measure vulnerability under artificial conditions rather than the general frequency of misinformation repetition in real-world typical use, making the leap to a broad general claim inferentially incomplete.
Confidence: 8/10

Expert 2 — The Context Analyst

Focus: Completeness & Framing
Mostly True
7/10

The claim uses the word "frequently" as a blanket descriptor across all AI chatbots, but the evidence reveals a more nuanced picture: Source 5 (KFF) explicitly notes that "some AI chatbots consistently avoided creating false information" while others did not, and safeguards were "inconsistent," meaning the behavior is model-dependent rather than universal. Source 4 (JMIR) shows ChatGPT-4.0 was correct only 31% of the time, but this reflects general accuracy, not specifically misinformation repetition when prompted with misleading claims. The most dramatic statistic — 88% false responses (Source 7) — comes from a controlled adversarial experimental setup designed to expose vulnerabilities, not from naturalistic usage patterns, and Source 1's methodology uses fabricated medical terms rather than real-world misleading health claims. These are important framing omissions: the claim conflates adversarial/experimental vulnerability with routine behavior, and ignores that performance varies significantly across models and contexts. That said, the convergence of multiple high-authority sources (Mount Sinai, NIH PMC, Annals of Internal Medicine, JMIR) consistently documenting that AI chatbots do repeat and elaborate on misinformation when prompted — even if not universally or uniformly — supports the core thrust of the claim. The word "frequently" is broadly substantiated across the literature even if not uniformly true for every model, making the claim mostly true but with meaningful framing omissions around model variability and experimental context.

Missing context

Performance varies significantly by model — some AI chatbots consistently avoid generating health disinformation (Source 5, KFF), so the blanket 'frequently' descriptor does not apply uniformly across all chatbots.The most dramatic statistics (e.g., 88% false responses in Source 7) come from adversarial experimental setups specifically designed to expose vulnerabilities, not from naturalistic or typical user interactions.Source 1's methodology uses entirely fabricated medical terms, which tests susceptibility to confabulation rather than directly measuring how often chatbots repeat real-world misleading health claims in everyday use.Some sources (Source 2, PMC; Source 6, APA) frame the risk as a concern or possibility rather than a measured, confirmed frequency of occurrence.The claim does not distinguish between different generations or versions of AI models, some of which have improved safeguards over time, making temporal context relevant.
Confidence: 8/10

Expert 3 — The Source Auditor

Focus: Source Reliability & Independence
Mostly True
7/10

The most reliable and relevant sources are the peer-reviewed/academic or major institutional items: Mount Sinai's 2025 study write-up (Source 1) reports leading LLMs repeated and elaborated fabricated medical falsehoods when seeded in prompts, and the University of South Australia release summarizing an Annals of Internal Medicine study (Source 7) reports very high rates of false responses ("88%") to misleading health prompts; KFF (Source 5) also cites a BMJ study finding some models frequently generated false health claims under certain prompts, though with inconsistent safeguards. However, several other sources are either non-quantitative/concern-based (Source 2, Source 6), neutral/mixed (Source 3), or secondary/educational (Source 9) and thus don't independently establish a general, across-chatbots frequency claim, so the best evidence supports susceptibility and frequent failure in some tested settings/models but not uniformly across all chatbots in typical use.

Weakest sources

Source 10 (ECRI) is weak for adjudicating the claim because it is an undated blog/video page and appears to be general risk commentary rather than presenting independently verifiable, quantified findings.Source 9 (Cleveland Clinic LibGuides) is a secondary educational guide that summarizes examples and risks but is not primary research establishing how frequently chatbots repeat medical misinformation when prompted.
Confidence: 7/10

Expert summary

See the full panel summary

Create a free account to read the complete analysis.

Sign up free
The claim is
Mostly True
7/10
Confidence: 8/10 Unanimous

The arguments

Two AI advocates debated this claim using the research gathered.

Argument for

P
Proponent Argues TRUE

The claim is powerfully supported by multiple high-authority, peer-reviewed sources: Source 1 (Mount Sinai Health System, authority score 0.92) documents a rigorous study where leading AI chatbots not only repeated fabricated medical terms but "often expanded on it, offering confident explanations for non-existent conditions," while Source 7 (University of South Australia, published in the Annals of Internal Medicine) found that a staggering 88% of AI chatbot responses to misleading health prompts were false, yet were "presented with scientific terminology, a formal tone and fabricated references that made the information appear legitimate." This convergence of independent findings — from Mount Sinai, NIH's PubMed Central (Source 2), JMIR (Source 5), and KFF (Source 5) — across multiple years and methodologies establishes an overwhelming, consistent pattern: AI chatbots frequently and confidently propagate medical misinformation when prompted with misleading health claims.

O
Opponent Rebuttal

You're treating two narrow, adversarial experiments as proof of a general “frequently” pattern, but Mount Sinai's setup uses invented medical terms rather than real misleading health claims in typical use (Source 1, Mount Sinai), so it shows susceptibility to confabulation—not measured frequency of repeating misinformation. And your “overwhelming convergence” claim is inflated: Source 2 (PMC) explicitly frames this as concern about a possibility, while KFF reports inconsistent safeguards where some models consistently avoid generating disinformation (Source 5, KFF), which directly undercuts your blanket assertion that chatbots frequently repeat misinformation.

Argument against

O
Opponent Argues FALSE

The motion says chatbots “frequently repeat” medical misinformation when prompted, but the brief's strongest items either describe vulnerability and possibility rather than measured frequency (Source 2, PMC: “express concern over the possibility”; Source 6, APA: general “concerns”) or report mixed, model-dependent behavior where some systems consistently refuse disinformation prompts (Source 5, KFF: “some…consistently avoided” while others did not), which contradicts a blanket “frequently” claim. Even the empirical studies cited are not clearly about “misleading health claims” being repeated in realistic use—Mount Sinai's test uses fabricated medical terms (Source 1, Mount Sinai) and the Annals-of-Internal-Medicine coverage reports false responses in a specific experimental setup (Source 7, University of South Australia), so the evidence doesn't substantiate the broad, general-frequency assertion in the motion.

P
Proponent Rebuttal

Your argument commits a classic fallacy of dismissing experimental evidence as unrealistic, but Source 1's fabricated medical terms and Source 7's controlled prompts are precisely the gold-standard methodology for testing misinformation vulnerability — if 88% of responses in a rigorous Annals of Internal Medicine study were false, that is a direct, quantified measure of frequency, not mere "possibility." Furthermore, your appeal to Source 5's nuance that "some models consistently avoided" disinformation actually undermines your own case, because that same source explicitly confirms that "other models frequently created false health claims, especially when prompted with ambiguous or complex health scenarios," which directly substantiates the motion's "frequently" threshold across the AI chatbot landscape as a whole.

Your annotation will be visible after submission.

Embed this verification

Copy this code and paste it in your article's HTML.