Verify any claim · lenz.io
Claim analyzed
Health“AI chatbots, such as ChatGPT, provide medical advice that is consistently reliable and safe for users.”
The conclusion
The claim that AI chatbots like ChatGPT provide "consistently reliable and safe" medical advice is not supported by the evidence. Multiple high-quality studies from 2024–2026 show ChatGPT gave incorrect advice in over 51% of medical emergencies, exhibited hallucination rates of 50–82%, and correctly identified conditions in fewer than 34.5% of real-world cases. ECRI designated AI chatbot misuse as the top health technology hazard for 2026. While chatbots show promise in narrow, controlled tasks, their performance is neither consistent nor safe for general medical advice.
Caveats
- AI chatbots can hallucinate medical information with high apparent plausibility — studies document hallucination rates between 50% and 82%, meaning users may receive confidently stated but entirely fabricated guidance.
- Strong performance on curated benchmarks or common scenarios does not translate to reliable real-world medical advice; studies of actual user interactions show dramatically lower accuracy rates.
- General-purpose chatbots like ChatGPT are not regulated or validated as medical devices and should not be used as substitutes for professional medical consultation, especially in emergencies.
Sources
Sources used in the analysis
Due to the lack of transparency regarding the development of the model, we express concern over the possibility that groups of users may select specific health topics and influence ChatGPT and similar AI technologies to propagate false health-related information, a phenomenon that is already widespread, e.g., through the use of social media. In contrast to existing internet-based mis- and disinformation, chatbots can disseminate incorrect or biased healthcare information in a way that will be difficult to see through in terms of perceived quality and details.
AI has potential to assist clinicians in making better diagnoses, and has contributed to the fields of drug development, personalized medicine, and patient care monitoring. However, with the deployment of AI in health care, several risks and challenges can emerge at an individual level (eg, awareness, education, trust), macrolevel (eg, regulation and policies, risk of injuries due to AI errors), and technical level (eg, usability, performance, data privacy and security).
In Europe, the new AI Act classifies most AI based medical devices as high risk systems. By August 2026, these devices must comply with new conformity assessments, transparency rules, and risk management requirements. In the United States, the FDA continues to release new guidance documents for AI and machine learning devices, emphasizing transparency, validation, and post-market monitoring to ensure that learning algorithms remain safe and effective over time.
Our analysis reveals that LLMs are currently not ready for autonomous clinical decision-making while providing a dataset and framework to guide future studies. We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians), follow neither diagnostic nor treatment guidelines, and cannot interpret laboratory results, thus posing a serious risk to the health of patients.
Artificial intelligence (AI) chatbot misuse ranks as the top health technology hazard for 2026, according to an annual report from ECRI, an independent, nonpartisan patient safety organization. ECRI cites the rapid adoption of chatbots, their lack of regulatory oversight, and mounting evidence that they can generate unsafe or misleading medical guidance as key reasons for the top rating. In its evaluation, ECRI found examples of chatbots suggesting incorrect diagnoses, recommending unnecessary tests, promoting substandard medical supplies, and even inventing nonexistent anatomy when asked medical questions.
This systematic review and NMA examined 168 articles encompassing 35896 questions and 3063 clinical cases. ChatGPT-4o (SUCRA=0.9207) demonstrated strong performance in terms of accuracy for objective questions... In terms of accuracy for top 1 diagnosis and top 3 diagnosis of clinical cases, human experts (SUCRA=0.9001 and SUCRA=0.7126, respectively) ranked the highest, while Claude 3 Opus (SUCRA=0.9672) performed well at the top 5 diagnosis.
Advanced LLMs showed high diagnostic accuracy (>90%) in common scenarios, with Claude 3.7 achieving perfect accuracy (100%) in certain conditions. In complex cases, Claude 3.7 achieved the highest accuracy (83.3%) at the final diagnostic stage, significantly outperforming smaller models. Leading LLMs show remarkable diagnostic accuracy in diverse clinical cases.
Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings.
During the study, 1,298 participants in the UK were asked to use a large language model, such as ChatGPT or Meta's Llama 3, for medical advice. When used in this way, the LLM correctly identified medical conditions in fewer than 34.5% of cases. After the initial diagnosis, the LLMs provided the correct follow-up steps to the person just 44.2% of the time.
A new study by researchers at the Icahn School of Medicine at Mount Sinai finds that widely used AI chatbots are highly vulnerable to repeating and elaborating on false medical information, revealing a critical need for stronger safeguards before these tools can be trusted in health care. The results revealed hallucination rates between 50 and 82 per cent, with chatbots often elaborating on the fake details as if they were genuine.
Misuse of artificial intelligence-powered chatbots in healthcare has topped ECRI's annual list of the top health technology hazards. The nonprofit ECRI said chatbots built on ChatGPT and other large language models can provide false or misleading information that could result in significant patient harm. While AI chatbots are not validated for healthcare purposes, ECRI said clinicians, patients and healthcare personnel are increasingly using the tools in that context.
People using the AI chatbots were only able to identify their health problem around a third of the time, while only around 45 percent figured out the right course of action. This was no better than the control group, according to the study, published in the Nature Medicine journal. In 52% of emergency cases, the bots 'under-triaged,' meaning treated the ailment as less serious than it was.
ChatGPT ended up providing the right advice for only 35.2 percent of non-urgent conditions and only 48.4 percent of medical emergencies that the research team offered it in the study. For 51.6 percent or 33 out of 54 of the true emergency, ChatGPT only recommended 24-to-48 hour observation.
The largest user study of large language models (LLMs) for assisting the general public in medical decisions has found that they present risks to people seeking medical advice due to their tendency to provide inaccurate and inconsistent information. 'Despite all the hype, AI just isn't ready to take on the role of the physician. Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed.'
In a study published recently in the journal Nature Medicine, researchers tried to simulate how people use AI chatbots by giving participants medical scenarios and asking them to consult AI tools. After conversing with the bots, participants correctly identified the hypothetical condition only about a third of the time. Only 43% made the correct decision about next steps, such as whether to go to the emergency room or stay home. In 52% of emergency cases, the bots "under-triaged," meaning treated the ailment as less serious than it was.
ChatGPT and Gemini both demonstrated potential for generating medical information. Despite their current limitations, both showed promise as complementary tools in patient education and clinical decision-making. Their accuracy and reliability can vary, and they often lack the completeness and adherence to guidelines that traditional sources provide.
While AI may serve as a beneficial tool in efforts to dispel misinformation, it may also increase the spread of false or misleading claims if misused. Notably, when it comes to information provided by AI chatbots, most adults (56%) – including half of AI users – are not confident that they can tell the difference between what is true and what is false.
If an AI system makes an error here, the impact can be severe: delayed treatment, missed diagnosis, privacy exposure, or unfair care decisions for specific patient groups. That's why “AI risk” in healthcare isn't just a technical topic. It's a patient safety topic.
Chatbots can easily be programmed to deliver false medical and health information, according to an international team of researchers who have exposed some concerning weaknesses in machine learning systems. In total, 88% of all responses were false, and yet they were presented with scientific terminology, a formal tone and fabricated references that made the information appear legitimate.
Hybrid chatbots in healthcare have shown significant benefits, such as reducing hospital readmissions by up to 25%, improving patient engagement by 30%, and cutting consultation wait times by 15%. They are widely used for chronic disease management, mental health support, and patient education, demonstrating their efficiency in both developed and developing countries. However, gaps remain in trust, data security, system integration, and user experience, which hinder widespread adoption.
AI medical advisors in 2025 offer unprecedented accessibility but require careful consideration of limitations. These systems excel at pattern recognition and data processing but struggle with nuanced clinical judgment. Safety depends on using AI as a supplement to, not replacement for, human medical care. The technology also struggles with complete patient complexity, including psychological factors and social determinants of health significantly impacting outcomes.
In February 2025, a new study from Stanford made the headline “Physicians make better decisions with the help of AI chatbots.” What was once considered a flashy add-on is now becoming a serious and functional part of healthcare business operations. AI-powered chatbots can assist in alleviating staffing shortages by automating administrative tasks and low-risk clinical duties.
Expert review
How each expert evaluated the evidence and arguments
The proponent infers “consistently reliable and safe” from evidence of high accuracy in constrained settings (objective questions in a systematic review and high accuracy in “common scenarios” in a performance study: Sources 6–7) plus general operational benefits of chatbots (Source 20), but that chain does not establish consistency or safety for users across realistic medical-advice use because it shifts scope from narrow benchmarks to broad real-world advising and ignores documented failure modes. The opponent's evidence directly targets reliability/safety in medical-advice contexts—showing frequent inaccuracies, hallucination vulnerability, under-triage, and explicit safety-risk conclusions (Sources 4–5, 9–15, 10)—so the claim that such advice is consistently reliable and safe is logically contradicted and therefore false.
The claim's framing (“consistently reliable and safe”) omits that performance is highly task-, prompt-, and context-dependent and that real-world user studies and safety analyses report frequent inaccuracies, hallucinations, and under-triage risks, plus lack of validation/regulatory oversight for general-purpose chatbots (Sources 4, 5, 10, 11, 14). Even though some studies show strong accuracy on narrow, objective questions or selected/common scenarios (Sources 6, 7) and potential benefits in supportive roles (Source 20), the full context shows reliability is not consistent and safety is not assured for users, so the overall impression is false.
The most reliable, independent evidence in the pool is the peer‑reviewed/academic literature and major patient-safety bodies: PMC/NIH (Source 4) concludes current LLMs are not ready for autonomous clinical decision-making and can pose serious patient risk, Mount Sinai researchers (Source 10) find high vulnerability to medical misinformation with very high hallucination rates, and ECRI reporting via RISE (Source 5) and MedTech Dive (Source 11) flags chatbot misuse as a top safety hazard—together directly contradicting “consistently reliable and safe.” While PubMed-indexed studies (Sources 6 and 7) show strong performance in some constrained tasks or “common scenarios,” they do not establish consistent reliability/safety for users broadly and are outweighed by higher-salience safety findings and real-world risk evidence, so the claim is false as stated.
Expert summary
What do you think of the claim?
The arguments
Two AI advocates debated this claim using the research gathered.
Multiple peer-reviewed evaluations show that leading LLM chatbots can answer many medical questions with high accuracy—e.g., a large systematic review found ChatGPT-4o performed strongly on objective clinical questions (Source 6, PubMed: JMIR systematic review), and a separate performance study reported advanced models achieving >90% diagnostic accuracy in common scenarios (Source 7, PubMed). Given this demonstrated high accuracy in typical use-cases and the documented benefits of healthcare chatbots in patient education and chronic-disease support (Source 20, PMC), it is reasonable to conclude that AI chatbots like ChatGPT can provide medical advice that is consistently reliable and safe for users when used as intended.
Your selective citation of Source 6 and Source 7 commits a classic cherry-picking fallacy — Source 6 itself explicitly ranks human experts highest for clinical diagnosis, and Source 7's "common scenarios" performance tells us nothing about safety in the real-world conditions documented by Source 9 (CNET), where LLMs correctly identified conditions fewer than 34.5% of the time in a study of 1,298 actual users, and Source 13 (Forbes), where ChatGPT failed in over 51% of true medical emergencies. Your appeal to Source 20's "benefits" in chronic disease management cannot rescue the claim of "consistently reliable and safe" advice when Source 5 (RISE) and Source 11 (MedTech Dive) confirm that ECRI — an independent patient safety body — designated AI chatbot misuse the single top health technology hazard for 2026 precisely because these tools lack regulatory validation and generate unsafe guidance at scale.
The claim that AI chatbots provide "consistently reliable and safe" medical advice is decisively refuted by a overwhelming body of high-authority evidence: Source 9 (CNET) reveals that LLMs correctly identified medical conditions in fewer than 34.5% of cases, Source 13 (Forbes) shows ChatGPT gave correct advice in only 35.2% of non-urgent conditions and failed in over 51% of true medical emergencies, and Source 10 (Mount Sinai) documents hallucination rates between 50–82%, with chatbots elaborating on false medical information as if it were genuine. Furthermore, Source 5 (RISE) and Source 11 (MedTech Dive) report that ECRI — an independent patient safety organization — ranked AI chatbot misuse as the single top health technology hazard for 2026, explicitly citing their lack of regulatory validation and mounting evidence of unsafe guidance, while Source 4 (PMC-NIH) confirms that current LLMs "do not accurately diagnose patients across all pathologies, performing significantly worse than physicians," making the word "consistently" in the claim not just inaccurate but dangerously misleading.
You're treating “chatbots” as a monolith and then smuggling those worst-case outcomes into a blanket conclusion about ChatGPT's medical advice, but the higher-quality comparative evidence you ignore shows strong accuracy on objective clinical questions (Source 6, PubMed: JMIR systematic review) and >90% diagnostic accuracy in common scenarios for leading models (Source 7, PubMed), which directly undercuts your claim that unreliability is decisive and universal. You also commit a category error by citing hallucination vulnerability and “misuse” hazard warnings (Source 10, Mount Sinai; Sources 5 and 11 on ECRI) as if they prove intended-use guidance is inherently unsafe, when even the clinical-decisionmaking critique you cite targets autonomous decision-making across all pathologies (Source 4, PMC-NIH) rather than the narrower, common, educational and support use-cases where benefits and safe complementarity are documented (Source 20, PMC).