Claim analyzed

Health

“AI chatbots, such as ChatGPT, provide medical advice that is consistently reliable and safe for users.”

The conclusion

False

2/10

March 15, 2026

The claim that AI chatbots like ChatGPT provide "consistently reliable and safe" medical advice is not supported by the evidence. Multiple high-quality studies from 2024–2026 show ChatGPT gave incorrect advice in over 51% of medical emergencies, exhibited hallucination rates of 50–82%, and correctly identified conditions in fewer than 34.5% of real-world cases. ECRI designated AI chatbot misuse as the top health technology hazard for 2026. While chatbots show promise in narrow, controlled tasks, their performance is neither consistent nor safe for general medical advice.

Caveats

AI chatbots can hallucinate medical information with high apparent plausibility — studies document hallucination rates between 50% and 82%, meaning users may receive confidently stated but entirely fabricated guidance.
Strong performance on curated benchmarks or common scenarios does not translate to reliable real-world medical advice; studies of actual user interactions show dramatically lower accuracy rates.
General-purpose chatbots like ChatGPT are not regulated or validated as medical devices and should not be used as substitutes for professional medical consultation, especially in emergencies.

Or ask anything else…

This analysis is for informational purposes only and does not constitute health or medical advice, diagnosis, or treatment. Always consult a qualified healthcare professional before making health-related decisions.

Sources

Sources used in the analysis

#1

PMC AI chatbots and (mis)information in public health: impact on vulnerable communities - PMC

REFUTE

Due to the lack of transparency regarding the development of the model, we express concern over the possibility that groups of users may select specific health topics and influence ChatGPT and similar AI technologies to propagate false health-related information, a phenomenon that is already widespread, e.g., through the use of social media. In contrast to existing internet-based mis- and disinformation, chatbots can disseminate incorrect or biased healthcare information in a way that will be difficult to see through in terms of perceived quality and details.

#2

PMC Role of Artificial Intelligence in Patient Safety Outcomes: Systematic Literature Review

NEUTRAL

AI has potential to assist clinicians in making better diagnoses, and has contributed to the fields of drug development, personalized medicine, and patient care monitoring. However, with the deployment of AI in health care, several risks and challenges can emerge at an individual level (eg, awareness, education, trust), macrolevel (eg, regulation and policies, risk of injuries due to AI errors), and technical level (eg, usability, performance, data privacy and security).

#3

PRP Compliance 2025-12-03 | Regulating AI in Medical Devices: FDA and EU Expectations by 2026 - PRP Compliance

NEUTRAL

In Europe, the new AI Act classifies most AI based medical devices as high risk systems. By August 2026, these devices must comply with new conformity assessments, transparency rules, and risk management requirements. In the United States, the FDA continues to release new guidance documents for AI and machine learning devices, emphasizing transparency, validation, and post-market monitoring to ensure that learning algorithms remain safe and effective over time.

#4

PMC - NIH 2024-07-12 | Evaluation and mitigation of the limitations of large language models in clinical decision-making - PMC - NIH

REFUTE

Our analysis reveals that LLMs are currently not ready for autonomous clinical decision-making while providing a dataset and framework to guide future studies. We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians), follow neither diagnostic nor treatment guidelines, and cannot interpret laboratory results, thus posing a serious risk to the health of patients.

#5

RISE 2026-01-26 | AI chatbot misuse tops annual list of health technology hazards - RISE

REFUTE

Artificial intelligence (AI) chatbot misuse ranks as the top health technology hazard for 2026, according to an annual report from ECRI, an independent, nonpartisan patient safety organization. ECRI cites the rapid adoption of chatbots, their lack of regulatory oversight, and mounting evidence that they can generate unsafe or misleading medical guidance as key reasons for the top rating. In its evaluation, ECRI found examples of chatbots suggesting incorrect diagnoses, recommending unnecessary tests, promoting substandard medical supplies, and even inventing nonexistent anatomy when asked medical questions.

#6

PubMed 2025-04-30 | Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis

NEUTRAL

This systematic review and NMA examined 168 articles encompassing 35896 questions and 3063 clinical cases. ChatGPT-4o (SUCRA=0.9207) demonstrated strong performance in terms of accuracy for objective questions... In terms of accuracy for top 1 diagnosis and top 3 diagnosis of clinical cases, human experts (SUCRA=0.9001 and SUCRA=0.7126, respectively) ranked the highest, while Claude 3 Opus (SUCRA=0.9672) performed well at the top 5 diagnosis.

#7

PubMed 2025-06-12 | Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases - PubMed

SUPPORT

Advanced LLMs showed high diagnostic accuracy (>90%) in common scenarios, with Claude 3.7 achieving perfect accuracy (100%) in certain conditions. In complex cases, Claude 3.7 achieved the highest accuracy (83.3%) at the final diagnostic stage, significantly outperforming smaller models. Leading LLMs show remarkable diagnostic accuracy in diverse clinical cases.

#8

medRxiv 2024-02-26 | On the limitations of large language models in clinical diagnosis | medRxiv

REFUTE

Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings.

#9

CNET 2026-03-11 | AI Chatbots Miss More Than Half of Medical Diagnoses, Study Finds - CNET

REFUTE

During the study, 1,298 participants in the UK were asked to use a large language model, such as ChatGPT or Meta's Llama 3, for medical advice. When used in this way, the LLM correctly identified medical conditions in fewer than 34.5% of cases. After the initial diagnosis, the LLMs provided the correct follow-up steps to the person just 44.2% of the time.

#10

Mount Sinai 2025-08-06 | AI Chatbots Can Run With Medical Misinformation, Study Finds, Highlighting the Need for Stronger Safeguards | Mount Sinai

REFUTE

A new study by researchers at the Icahn School of Medicine at Mount Sinai finds that widely used AI chatbots are highly vulnerable to repeating and elaborating on false medical information, revealing a critical need for stronger safeguards before these tools can be trusted in health care. The results revealed hallucination rates between 50 and 82 per cent, with chatbots often elaborating on the fake details as if they were genuine.

#11

MedTech Dive 2026-01-22 | ECRI names misuse of AI chatbots as top health tech hazard for 2026 | MedTech Dive

REFUTE

Misuse of artificial intelligence-powered chatbots in healthcare has topped ECRI's annual list of the top health technology hazards. The nonprofit ECRI said chatbots built on ChatGPT and other large language models can provide false or misleading information that could result in significant patient harm. While AI chatbots are not validated for healthcare purposes, ECRI said clinicians, patients and healthcare personnel are increasingly using the tools in that context.

#12

CP24 2026-02-09 | AI chatbots give bad health advice, research finds - CP24

REFUTE

People using the AI chatbots were only able to identify their health problem around a third of the time, while only around 45 percent figured out the right course of action. This was no better than the control group, according to the study, published in the Nature Medicine journal. In 52% of emergency cases, the bots 'under-triaged,' meaning treated the ailment as less serious than it was.

#13

Forbes 2026-03-08 | ChatGPT Provided Wrong Advice In Over 50% Medical Emergencies Tested - Forbes

REFUTE

ChatGPT ended up providing the right advice for only 35.2 percent of non-urgent conditions and only 48.4 percent of medical emergencies that the research team offered it in the study. For 51.6 percent or 33 out of 54 of the true emergency, ChatGPT only recommended 24-to-48 hour observation.

#14

University of Oxford 2026-02-10 | New study warns of risks in AI chatbots giving medical advice | University of Oxford

REFUTE

The largest user study of large language models (LLMs) for assisting the general public in medical decisions has found that they present risks to people seeking medical advice due to their tendency to provide inaccurate and inconsistent information. 'Despite all the hype, AI just isn't ready to take on the role of the physician. Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed.'

#15

WAER 2026-03-11 | ChatGPT might give you bad medical advice, studies warn - WAER

REFUTE

In a study published recently in the journal Nature Medicine, researchers tried to simulate how people use AI chatbots by giving participants medical scenarios and asking them to consult AI tools. After conversing with the bots, participants correctly identified the hypothetical condition only about a third of the time. Only 43% made the correct decision about next steps, such as whether to go to the emergency room or stay home. In 52% of emergency cases, the bots "under-triaged," meaning treated the ailment as less serious than it was.

#16

PMC 2025-12-17 | Evaluating the Accuracy of Medical Information Generated by ChatGPT and Gemini and Its Alignment With International Clinical Guidelines From the Surviving Sepsis Campaign - PMC

NEUTRAL

ChatGPT and Gemini both demonstrated potential for generating medical information. Despite their current limitations, both showed promise as complementary tools in patient education and clinical decision-making. Their accuracy and reliability can vary, and they often lack the completeness and adherence to guidelines that traditional sources provide.

#17

KFF 2024-08-22 | AI Chatbots as Health Information Sources — The Monitor - KFF

NEUTRAL

While AI may serve as a beneficial tool in efforts to dispel misinformation, it may also increase the spread of false or misleading claims if misused. Notably, when it comes to information provided by AI chatbots, most adults (56%) – including half of AI users – are not confident that they can tell the difference between what is true and what is false.

#18

aihealthcare360.org 2026-01-08 | Risks of AI in Healthcare: Bias, Errors, and Patient Safety - aihealthcare360.org -

REFUTE

If an AI system makes an error here, the impact can be severe: delayed treatment, missed diagnosis, privacy exposure, or unfair care decisions for specific patient groups. That's why “AI risk” in healthcare isn't just a technical topic. It's a patient safety topic.

#19

University of South Australia 2025-06-30 | AI chatbots could spread 'fake news' with serious health consequences - University of South Australia

REFUTE

Chatbots can easily be programmed to deliver false medical and health information, according to an international team of researchers who have exposed some concerning weaknesses in machine learning systems. In total, 88% of all responses were false, and yet they were presented with scientific terminology, a formal tone and fabricated references that made the information appear legitimate.

#20

PMC 2025-02-13 | Revolutionizing e-health: the transformative role of AI-powered hybrid chatbots in healthcare solutions - PMC

SUPPORT

Hybrid chatbots in healthcare have shown significant benefits, such as reducing hospital readmissions by up to 25%, improving patient engagement by 30%, and cutting consultation wait times by 15%. They are widely used for chronic disease management, mental health support, and patient education, demonstrating their efficiency in both developed and developing countries. However, gaps remain in trust, data security, system integration, and user experience, which hinder widespread adoption.

#21

Doctronic 2025-01-01 | Are AI Doctors Safe To Use For Medical Advice In 2025 - Doctronic

NEUTRAL

AI medical advisors in 2025 offer unprecedented accessibility but require careful consideration of limitations. These systems excel at pattern recognition and data processing but struggle with nuanced clinical judgment. Safety depends on using AI as a supplement to, not replacement for, human medical care. The technology also struggles with complete patient complexity, including psychological factors and social determinants of health significantly impacting outcomes.

#22

Talentica Software 2025-05-09 | Future of Chatbots in Healthcare: Trends & Use Cases - Talentica Software

SUPPORT

In February 2025, a new study from Stanford made the headline “Physicians make better decisions with the help of AI chatbots.” What was once considered a flashy add-on is now becoming a serious and functional part of healthcare business operations. AI-powered chatbots can assist in alleviating staffing shortages by automating administrative tasks and low-risk clinical duties.

What do you think of the claim?

Your challenge will appear immediately.

Challenge submitted!

Verify any other claim Browse Health claims

Expert review

How each expert evaluated the evidence and arguments

Expert 1 — The Logic Examiner

Focus: Inferential Soundness & Fallacies

False

2/10

The proponent infers “consistently reliable and safe” from evidence of high accuracy in constrained settings (objective questions in a systematic review and high accuracy in “common scenarios” in a performance study: Sources 6–7) plus general operational benefits of chatbots (Source 20), but that chain does not establish consistency or safety for users across realistic medical-advice use because it shifts scope from narrow benchmarks to broad real-world advising and ignores documented failure modes. The opponent's evidence directly targets reliability/safety in medical-advice contexts—showing frequent inaccuracies, hallucination vulnerability, under-triage, and explicit safety-risk conclusions (Sources 4–5, 9–15, 10)—so the claim that such advice is consistently reliable and safe is logically contradicted and therefore false.

Logical fallacies

Scope shift / overgeneralization: inferring “consistently reliable and safe for users” from accuracy on objective questions or selected common-case benchmarks (Sources 6–7) and from efficiency benefits (Source 20) without proving broad real-world safety.Cherry-picking: emphasizing favorable accuracy results while discounting or not integrating substantial contrary evidence about hallucinations, guideline nonadherence, and unsafe triage/diagnosis outcomes (Sources 4, 10, 9, 12–15).Equivocation on 'medical advice': treating patient education/support or administrative/low-risk uses as equivalent to providing medical advice that is safe and reliable for users in general.

Confidence: 8/10

Expert 2 — The Context Analyst

Focus: Completeness & Framing

False

2/10

The claim's framing (“consistently reliable and safe”) omits that performance is highly task-, prompt-, and context-dependent and that real-world user studies and safety analyses report frequent inaccuracies, hallucinations, and under-triage risks, plus lack of validation/regulatory oversight for general-purpose chatbots (Sources 4, 5, 10, 11, 14). Even though some studies show strong accuracy on narrow, objective questions or selected/common scenarios (Sources 6, 7) and potential benefits in supportive roles (Source 20), the full context shows reliability is not consistent and safety is not assured for users, so the overall impression is false.

Missing context

Evidence of high accuracy is largely limited to constrained benchmarks (eg, objective questions, curated/common cases) and does not establish consistent safety across real-world, high-stakes, heterogeneous patient presentations (Sources 6, 7 vs. 4, 14).General-purpose chatbots like ChatGPT are typically not regulated/validated as medical devices for diagnosis/triage, and misuse/overreliance is a recognized patient-safety hazard (Sources 5, 11, 3).LLMs can be vulnerable to hallucinations and can amplify misinformation with high apparent plausibility, which directly undermines the claim's “consistently reliable” wording (Sources 10, 1, 19).Some evidence supports use as a complementary tool (education/support) rather than autonomous medical advice; the claim fails to narrow to those lower-risk use cases (Sources 16, 21, 20).

Confidence: 8/10

Expert 3 — The Source Auditor

Focus: Source Reliability & Independence

False

2/10

The most reliable, independent evidence in the pool is the peer‑reviewed/academic literature and major patient-safety bodies: PMC/NIH (Source 4) concludes current LLMs are not ready for autonomous clinical decision-making and can pose serious patient risk, Mount Sinai researchers (Source 10) find high vulnerability to medical misinformation with very high hallucination rates, and ECRI reporting via RISE (Source 5) and MedTech Dive (Source 11) flags chatbot misuse as a top safety hazard—together directly contradicting “consistently reliable and safe.” While PubMed-indexed studies (Sources 6 and 7) show strong performance in some constrained tasks or “common scenarios,” they do not establish consistent reliability/safety for users broadly and are outweighed by higher-salience safety findings and real-world risk evidence, so the claim is false as stated.

Weakest sources

Source 21 (Doctronic) is a company blog with inherent commercial incentives and is not an independent, peer-reviewed evaluation of safety or reliability.Source 22 (Talentica Software) is a vendor blog that makes broad claims and references a “headline” without providing primary, citable study details here, limiting verifiability and independence.Source 18 (aihealthcare360.org) appears to be a general informational webpage rather than primary research or a recognized standards body, so it adds little probative weight beyond higher-authority sources.

Confidence: 8/10

Expert summary

All three evaluation axes — source quality, logical soundness, and contextual completeness — converge unanimously on a False verdict. The highest-authority sources (PMC/NIH, Mount Sinai, ECRI via RISE and MedTech Dive) document frequent diagnostic failures, hallucination rates of 50–82%, and under-triage in emergencies. The word "consistently" in the claim is the critical failure point: while some studies show strong performance on narrow, controlled benchmarks, real-world evidence from 2025–2026 shows ChatGPT provided wrong advice in over 51% of medical emergencies and correctly identified conditions in fewer than 34.5% of cases in user studies. ECRI ranked AI chatbot misuse as the #1 health technology hazard for 2026. No axis found grounds to support the claim as stated.

See the full panel summary

Create a free account to read the complete analysis.

Sign up free

The claim is

False

2/10

Confidence: 8/10 Unanimous

The arguments

Two AI advocates debated this claim using the research gathered.

Argument for

P

Proponent Argues TRUE

Multiple peer-reviewed evaluations show that leading LLM chatbots can answer many medical questions with high accuracy—e.g., a large systematic review found ChatGPT-4o performed strongly on objective clinical questions (Source 6, PubMed: JMIR systematic review), and a separate performance study reported advanced models achieving >90% diagnostic accuracy in common scenarios (Source 7, PubMed). Given this demonstrated high accuracy in typical use-cases and the documented benefits of healthcare chatbots in patient education and chronic-disease support (Source 20, PMC), it is reasonable to conclude that AI chatbots like ChatGPT can provide medical advice that is consistently reliable and safe for users when used as intended.

O

Opponent Rebuttal

Your selective citation of Source 6 and Source 7 commits a classic cherry-picking fallacy — Source 6 itself explicitly ranks human experts highest for clinical diagnosis, and Source 7's "common scenarios" performance tells us nothing about safety in the real-world conditions documented by Source 9 (CNET), where LLMs correctly identified conditions fewer than 34.5% of the time in a study of 1,298 actual users, and Source 13 (Forbes), where ChatGPT failed in over 51% of true medical emergencies. Your appeal to Source 20's "benefits" in chronic disease management cannot rescue the claim of "consistently reliable and safe" advice when Source 5 (RISE) and Source 11 (MedTech Dive) confirm that ECRI — an independent patient safety body — designated AI chatbot misuse the single top health technology hazard for 2026 precisely because these tools lack regulatory validation and generate unsafe guidance at scale.

Argument against

O

Opponent Argues FALSE

The claim that AI chatbots provide "consistently reliable and safe" medical advice is decisively refuted by a overwhelming body of high-authority evidence: Source 9 (CNET) reveals that LLMs correctly identified medical conditions in fewer than 34.5% of cases, Source 13 (Forbes) shows ChatGPT gave correct advice in only 35.2% of non-urgent conditions and failed in over 51% of true medical emergencies, and Source 10 (Mount Sinai) documents hallucination rates between 50–82%, with chatbots elaborating on false medical information as if it were genuine. Furthermore, Source 5 (RISE) and Source 11 (MedTech Dive) report that ECRI — an independent patient safety organization — ranked AI chatbot misuse as the single top health technology hazard for 2026, explicitly citing their lack of regulatory validation and mounting evidence of unsafe guidance, while Source 4 (PMC-NIH) confirms that current LLMs "do not accurately diagnose patients across all pathologies, performing significantly worse than physicians," making the word "consistently" in the claim not just inaccurate but dangerously misleading.

P

Proponent Rebuttal

You're treating “chatbots” as a monolith and then smuggling those worst-case outcomes into a blanket conclusion about ChatGPT's medical advice, but the higher-quality comparative evidence you ignore shows strong accuracy on objective clinical questions (Source 6, PubMed: JMIR systematic review) and >90% diagnostic accuracy in common scenarios for leading models (Source 7, PubMed), which directly undercuts your claim that unreliability is decisive and universal. You also commit a category error by citing hallucination vulnerability and “misuse” hazard warnings (Source 10, Mount Sinai; Sources 5 and 11 on ECRI) as if they prove intended-use guidance is inherently unsafe, when even the clinical-decisionmaking critique you cite targets autonomous decision-making across all pathologies (Source 4, PMC-NIH) rather than the narrower, common, educational and support use-cases where benefits and safe complementarity are documented (Source 20, PMC).

“AI chatbots, such as ChatGPT, provide medical advice that is consistently reliable and safe for users.”

The conclusion

Caveats

Sources

Related verifications

Expert review

Expert 1 — The Logic Examiner

Expert 2 — The Context Analyst

Expert 3 — The Source Auditor

Expert summary

The arguments

Argument for

Argument against

Did you know?

Embed this verification