Fact-check any claim · lenz.io
Claim analyzed
Health“AI chatbots frequently repeat medical misinformation when prompted with misleading health claims.”
The conclusion
Multiple peer-reviewed studies confirm that AI chatbots often repeat and even elaborate on medical misinformation when prompted with misleading health claims. A Mount Sinai study found chatbots confidently explained fabricated conditions, and an Annals of Internal Medicine study reported 88% false responses to misleading prompts. However, the claim overgeneralizes: performance varies significantly by model, with some chatbots consistently refusing to generate false health information. The most dramatic findings also come from adversarial experimental setups rather than typical real-world usage.
Caveats
- The word 'frequently' does not apply uniformly — some AI models consistently avoid generating health misinformation, while others are highly susceptible, making this a model-dependent pattern rather than a universal one.
- The most cited statistics (e.g., 88% false responses) come from controlled adversarial experiments designed to stress-test chatbots, not from naturalistic user interactions, so real-world frequency may differ.
- AI model safeguards are evolving rapidly; findings from studies conducted in 2023–2025 may not reflect the current capabilities or guardrails of the latest chatbot versions.
What do you think of the claim?
The debate
Two AI advocates debated this claim using the research gathered.
The claim is powerfully supported by multiple high-authority, peer-reviewed sources: Source 1 (Mount Sinai Health System, authority score 0.92) documents a rigorous study where leading AI chatbots not only repeated fabricated medical terms but "often expanded on it, offering confident explanations for non-existent conditions," while Source 7 (University of South Australia, published in the Annals of Internal Medicine) found that a staggering 88% of AI chatbot responses to misleading health prompts were false, yet were "presented with scientific terminology, a formal tone and fabricated references that made the information appear legitimate." This convergence of independent findings — from Mount Sinai, NIH's PubMed Central (Source 2), JMIR (Source 5), and KFF (Source 5) — across multiple years and methodologies establishes an overwhelming, consistent pattern: AI chatbots frequently and confidently propagate medical misinformation when prompted with misleading health claims.
You're treating two narrow, adversarial experiments as proof of a general “frequently” pattern, but Mount Sinai's setup uses invented medical terms rather than real misleading health claims in typical use (Source 1, Mount Sinai), so it shows susceptibility to confabulation—not measured frequency of repeating misinformation. And your “overwhelming convergence” claim is inflated: Source 2 (PMC) explicitly frames this as concern about a possibility, while KFF reports inconsistent safeguards where some models consistently avoid generating disinformation (Source 5, KFF), which directly undercuts your blanket assertion that chatbots frequently repeat misinformation.
The motion says chatbots “frequently repeat” medical misinformation when prompted, but the brief's strongest items either describe vulnerability and possibility rather than measured frequency (Source 2, PMC: “express concern over the possibility”; Source 6, APA: general “concerns”) or report mixed, model-dependent behavior where some systems consistently refuse disinformation prompts (Source 5, KFF: “some…consistently avoided” while others did not), which contradicts a blanket “frequently” claim. Even the empirical studies cited are not clearly about “misleading health claims” being repeated in realistic use—Mount Sinai's test uses fabricated medical terms (Source 1, Mount Sinai) and the Annals-of-Internal-Medicine coverage reports false responses in a specific experimental setup (Source 7, University of South Australia), so the evidence doesn't substantiate the broad, general-frequency assertion in the motion.
Your argument commits a classic fallacy of dismissing experimental evidence as unrealistic, but Source 1's fabricated medical terms and Source 7's controlled prompts are precisely the gold-standard methodology for testing misinformation vulnerability — if 88% of responses in a rigorous Annals of Internal Medicine study were false, that is a direct, quantified measure of frequency, not mere "possibility." Furthermore, your appeal to Source 5's nuance that "some models consistently avoided" disinformation actually undermines your own case, because that same source explicitly confirms that "other models frequently created false health claims, especially when prompted with ambiguous or complex health scenarios," which directly substantiates the motion's "frequently" threshold across the AI chatbot landscape as a whole.
Jump into a live chat with the Proponent and the Opponent. Challenge their reasoning, ask your own questions, and investigate this topic on your terms.
Panel review
How each panelist evaluated the evidence and arguments
The most reliable and relevant sources are the peer-reviewed/academic or major institutional items: Mount Sinai's 2025 study write-up (Source 1) reports leading LLMs repeated and elaborated fabricated medical falsehoods when seeded in prompts, and the University of South Australia release summarizing an Annals of Internal Medicine study (Source 7) reports very high rates of false responses ("88%") to misleading health prompts; KFF (Source 5) also cites a BMJ study finding some models frequently generated false health claims under certain prompts, though with inconsistent safeguards. However, several other sources are either non-quantitative/concern-based (Source 2, Source 6), neutral/mixed (Source 3), or secondary/educational (Source 9) and thus don't independently establish a general, across-chatbots frequency claim, so the best evidence supports susceptibility and frequent failure in some tested settings/models but not uniformly across all chatbots in typical use.
The logical chain from evidence to claim is partially sound but contains inferential gaps: Source 7's 88% false response rate in a controlled adversarial study (Annals of Internal Medicine) and Source 1's finding that chatbots expanded on fabricated medical terms both directly support the "frequently repeat misinformation when prompted with misleading claims" assertion, while Sources 2, 4, and 5 provide corroborating but less direct support. However, the opponent correctly identifies a scope-matching problem — the experimental setups (fabricated terms, adversarial prompts) measure susceptibility under artificial conditions, not the general frequency of misinformation repetition in typical use, and Source 5 explicitly notes that "some models consistently avoided" false claims, meaning the evidence supports a model-dependent pattern rather than a universal "AI chatbots frequently" blanket claim. The claim is Mostly True because the preponderance of direct experimental evidence does establish that many AI chatbots frequently produce medical misinformation when prompted with misleading inputs, but the word "frequently" applied universally across all chatbots overgeneralizes from studies that show significant variance between models.
The claim uses the word "frequently" as a blanket descriptor across all AI chatbots, but the evidence reveals a more nuanced picture: Source 5 (KFF) explicitly notes that "some AI chatbots consistently avoided creating false information" while others did not, and safeguards were "inconsistent," meaning the behavior is model-dependent rather than universal. Source 4 (JMIR) shows ChatGPT-4.0 was correct only 31% of the time, but this reflects general accuracy, not specifically misinformation repetition when prompted with misleading claims. The most dramatic statistic — 88% false responses (Source 7) — comes from a controlled adversarial experimental setup designed to expose vulnerabilities, not from naturalistic usage patterns, and Source 1's methodology uses fabricated medical terms rather than real-world misleading health claims. These are important framing omissions: the claim conflates adversarial/experimental vulnerability with routine behavior, and ignores that performance varies significantly across models and contexts. That said, the convergence of multiple high-authority sources (Mount Sinai, NIH PMC, Annals of Internal Medicine, JMIR) consistently documenting that AI chatbots do repeat and elaborate on misinformation when prompted — even if not universally or uniformly — supports the core thrust of the claim. The word "frequently" is broadly substantiated across the literature even if not uniformly true for every model, making the claim mostly true but with meaningful framing omissions around model variability and experimental context.
Panel summary
Sources
Sources used in the analysis
“A new study by researchers at the Icahn School of Medicine at Mount Sinai finds that widely used AI chatbots are highly vulnerable to repeating and elaborating on false medical information. The team created fictional patient scenarios, each containing one fabricated medical term such as a made-up disease, symptom, or test, and submitted them to leading large language models. They not only repeated the misinformation but often expanded on it, offering confident explanations for non-existent conditions.”
“Due to the lack of transparency regarding the development of the model, we express concern over the possibility that groups of users may select specific health topics and influence ChatGPT and similar AI technologies to propagate false health-related information... chatbots can disseminate incorrect or biased healthcare information in a way that will be difficult to see through in terms of perceived quality and details.”
“The evaluation of misinformation and GQS scores revealed significant differences among the chatbots, consistent with findings from other recent studies examining AI chatbot accuracy in health contexts.”
“In experiment 1, we found that ChatGPT-4.0 was deemed correct for 31% (29/94) of the questions by both nonexperts and experts... Studies have assessed its performance... However, the increased usage of ChatGPT raises significant ethical concerns regarding plagiarism, bias, transparency, inaccuracy, and health equity.”
“A study published earlier this year in BMJ evaluated how well large language models (LLMs) could prevent users from prompting chatbots to create health disinformation. It found that while some AI chatbots consistently avoided creating false information, other models frequently created false health claims, especially when prompted with ambiguous or complex health scenarios. The safeguards were inconsistent – some models provided accurate information in one instance but not in others under similar conditions.”
“The APA provides recommendations to ensure consumer safety and well-being when using chatbots and apps to address unmet mental health needs, acknowledging concerns about the reliability and safety of AI chatbots in health contexts.”
“Chatbots can easily be programmed to deliver false medical and health information, according to an international team of researchers who have exposed some concerning weaknesses in machine learning systems. In the study, published today in the Annals of Internal Medicine, researchers evaluated the five foundational and most advanced AI systems... “In total, 88% of all responses were false,” Dr Modi says, “and yet they were presented with scientific terminology, a formal tone and fabricated references that made the information appear legitimate.””
“The study reveals people's concerns over using chatbots for misinformation management and notes that health information can change rapidly. Participants were aware that knowledge evolved and what was presented as factual one day could be revealed as wrong in a few weeks, making it difficult to track changes in information.”
“Researchers used the technology behind the artificial intelligence (AI) chatbot ChatGPT to create a fake clinical-trial data set to support an unverified scientific claim... The authors used a combination of a large language model and a data analysis model to compare the outcomes of two surgical procedures.”
“Identify risks associated with the inappropriate use of AI chatbots for patient care–related purposes, including exposure to misinformation, overreliance on ...”
Shared by other users
- False “5G towers emit radiation that causes cancer in humans.”
- False “Romantic love typically lasts for three years.”
- False “Plastic waste eventually biodegrades in landfills.”