Claim analyzed

Tech

“Substantive disagreements between AI models on fact-checking outcomes are common.”

Submitted by Patient Heron 0fa7

True
9/10

Evidence from multiple studies shows that AI fact-checking models often reach materially different verdicts on the same claim, with reported substantive conflicts commonly in the roughly 15% to 30% range on challenging datasets. That is frequent enough to count as common in real-world use. Rates do vary by claim difficulty, ambiguity, prompting, and evidence quality.

Caveats

  • 'Common' should not be read as 'most claims' in every domain; disagreement is concentrated on harder, ambiguous, or politically contentious claims.
  • Some studies showing different overall label distributions do not, by themselves, prove claim-by-claim disagreement; the strongest evidence comes from direct pairwise conflict analyses.
  • Disagreement rates are sensitive to experimental setup, including prompt design, provided evidence, language, and benchmark composition.

Sources

Sources used in the analysis

#1
arXiv 2024-10-19 | Efficient Annotator Reliability Assessment and Sample Weighting for ...

These datasets often employ multiple annotators to measure inter-annotator agreement; most datasets report relatively high levels of human disagreement. The high disagreement is mainly due to the high complexity of the task, which introduces subjectivity during annotation. According to kappa values reported in prior studies, agreement ranges from 0.75 for SciFact to 0.68 for another cited dataset, showing that even expert annotation can vary substantially.

#2
PubMed Central 2023-03-02 | Assessing Inter-Annotator Agreement for Medical Image Segmentation

This study assesses inter-annotator agreement among multiple expert annotators and discusses how disagreement is expected even when annotators are highly trained. It frames agreement as something that must be measured, not assumed, because annotators can differ on the same items even in expert settings.

#3
ROMCIR / University of Milano-Bicocca 2025-04-10 | Towards Automated Fact-Checking of Real-World Claims

The paper evaluates multiple large language models (LLMs) on real-world fact-checking datasets (e.g., PolitiFact, ClimateFeedback) under the same experimental protocol. It reports that models such as GPT‑4, GPT‑3.5, LLaMA‑2, and others obtain substantially different label distributions and accuracies on the same sets of claims, with some models being more prone to predicting "false" and others more often predicting "true" or abstaining. The authors highlight that these discrepancies in model behavior complicate the use of LLMs as drop‑in replacements for human fact‑checkers and motivate careful model selection and calibration.

#4
NPJ Digital Medicine (via PubMed Central) 2024-01-26 | The perils and promises of fact-checking with large language models

The study compares GPT‑3.5 and GPT‑4 on claim verification tasks using PolitiFact and other fact-checking corpora. It states: "Our evaluation shows that GPT-4 significantly outperforms GPT-3.5 at fact-checking claims" and notes that GPT‑3.5 predicts claims to be false in 58.2% of cases, compared with 22.89% for GPT‑4, despite being evaluated on the same labeled claims. The authors emphasize that while both models achieve similar overall accuracy ranges (around 64–71%), their calibration and tendency to assign particular verdicts differ markedly, especially across different veracity categories and languages.

#5
arXiv 2025-06-05 | A Multilingual, Comparative Analysis of LLM-Based Fact-Checking From Check-Worthiness to Verdict

This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing resources to enable comparative analysis of Large Language Models (LLMs) across the full fact-checking pipeline. We systematically investigate the performance of multiple LLMs on all fact-checking subtasks and languages, and compare LLM-based fact-checking effectiveness with traditional deep learning models. Our findings reveal that while LLMs perform competitively in many settings, they still struggle with complex, multi-hop reasoning and exhibit varying strengths across languages and subtasks, leading to notable disagreement in their predictions for check-worthiness, evidence retrieval, and final veracity labels.

#6
PubMed Central (Journalism & Mass Communication Quarterly) 2023-07-13 | Cross-checking journalistic fact-checkers: The role of sampling and rating scales for estimating interrater reliability

The study notes that when evaluating the same statement, "apparent disagreements can occur for several reasons. Fact-checkers may overlook or misinterpret evidence or apply different evidentiary standards. They may also draw different inferences from the same evidence, or they may simply make mistakes." It reports that previous work (Marietta et al. 2015) found "significant discrepancies" among three fact-checkers (PolitiFact, The Fact Checker, FactCheck.org) on certain political topics, especially where statements were ambiguous, even though agreement was higher on clear truths and falsehoods. The article emphasizes that the *measured* level of disagreement is sensitive to rating scales and sampling design, cautioning that estimates of how often fact-checkers disagree depend heavily on methodology.

#7
Transactions of the Association for Computational Linguistics 2022-03-01 | A Survey on Automated Fact-Checking

Reviewing dozens of automated fact-checking systems, the survey observes that different model architectures and training setups often yield divergent verdicts on the same claims, especially for nuanced or partially true statements. It notes that "substantial variance in model predictions is observed across systems evaluated on identical benchmarks," and that ensemble or agreement-based approaches have been proposed partly to cope with this model-to-model variability in veracity labels.

#8
OpenReview 2025-01-27 | Unveiling Pitfalls and Potentials in Fact Verifiers

In this study, we evaluate 13 different fact verification models, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16% of ambiguous or incorrectly labeled data substantially influences model rankings. We further show that different fact verifiers often disagree on challenging or ambiguously labeled instances, and that these disagreements are amplified on examples requiring complex multi-hop reasoning or nuanced interpretation of evidence.

#9
Harvard Kennedy School Misinformation Review 2023-09-19 | “Fact-checking” fact checkers: A data-driven approach

This large-scale study scraped 22,349 fact-checking articles from Snopes and PolitiFact and identified 749 pairs of matching claims (about 6.5% of each outlet’s corpus) from 2016–2022. It reports that among these matches, "521 (69.6%) had consistent ratings" and that after accounting for minor rating-scale differences, "we found only one case out of 749 matching claims with conflicting verdict ratings." The authors write that "the high level of agreement, with only one contradicting case, between Snopes and PolitiFact in their fact-checking conclusions is critical" and conclude that this suggests "a high level of agreement" between the two fact-checkers during the period studied. At the same time, they note that "disagreements are common, particularly when politicians use ambiguous language," citing earlier work where fact-checkers diverged more on ambiguous statements.

#10
arXiv 2024-10-30 | Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

We present Factcheck‑Bench, a holistic end‑to‑end framework for annotating and evaluating the factuality of LLM‑generated responses, which encompasses a multi‑stage annotation scheme designed to yield detailed labels for fact‑checking and correcting not just the final prediction, but also the intermediate steps that a fact‑checking system might need to take. Based on this framework, we construct an open‑domain factuality benchmark in three levels of increasing difficulty and perform extensive experiments with multiple LLM-based fact‑checkers. Our experiments reveal substantial variation in model behavior and accuracy across subtasks and difficulty levels, with different models often producing divergent labels and rationales for the same claim–document pairs, particularly on the hardest instances.

#11
Findings of EMNLP 2024 (ACL Anthology) 2024-11-10 | Fine-Grained Evaluation Benchmark for Automatic Fact-Checkers

In this work, we present Factcheck‑Bench, a holistic end‑to‑end framework for annotating and evaluating the factuality of LLM‑generated responses, which encompasses a multi‑stage annotation scheme designed to yield detailed labels for fact‑checking and correcting not just the final prediction, but also the intermediate steps that a fact‑checking system might need to take. The benchmark contains 678 open‑domain claims generated by LLMs, involving annotations of eight subtasks for detecting and correcting the factual errors in long documents. Experiments with a range of LLM-based fact‑checkers show that while overall performance can be high on simpler instances, models frequently disagree with each other on complex claims and on fine‑grained sublabels such as error span identification and correction proposals.

#12
Stanford Cyber Policy Center 2024-01-31 | AI Chatbots Struggle at Fact-Checking, but Curated Evidence Can Help

Stanford researchers evaluated several large language models on a benchmark of real-world claims and found that "when models relied solely on their built-in knowledge, they all performed poorly. Accuracy ranged from roughly 0.1 to 0.3 on macro F1." They report that models "often disagreed with each other" on whether the same claim was true or false, and that their judgments were "highly unstable" across small prompt changes or different evidence passages. The study concludes that curated, high-quality evidence can significantly improve performance but that without such curation, model verdicts on factual claims are both inaccurate and inconsistent, limiting their reliability as stand-alone fact-checkers.

#13
Google DeepMind 2025-03-12 | FACTS Grounding: A new benchmark for evaluating the factuality of large language models

FACTS Grounding evaluates model responses automatically using three frontier LLM judges — namely Gemini 1.5 Pro, GPT‑4o, and Claude 3.5 Sonnet. Each FACTS Grounding example is judged in two phases. First, responses are evaluated for eligibility, and disqualified if they don’t sufficiently address the user’s request. Second, responses are judged as factually accurate if they are fully grounded in information contained in the provided document, with no hallucinations. With the eligibility and grounding accuracy of a given LLM response evaluated separately by multiple AI judge models, the results are then aggregated to determine if the LLM has dealt with the example successfully, explicitly accounting for cases where the judges disagree about whether a response is grounded or hallucinatory.

#14
arXiv 2023-10-02 | XAI in Automated Fact-Checking? The Benefits Are Modest at Best in the Case of News Veracity Detection

This work studies a neural automated fact-checker whose predictions are shown to users with different explanation interfaces. While the core model is fixed, the authors note prior literature reporting that "different automated fact-checking systems often disagree in their veracity predictions for the same news items" and frame their contribution in the context of such system-level variability. They analyze variance in user agreement with the fact-checker and highlight that explanation style can change how consistently people follow a given model, implicitly underscoring that different models (or tools) may lead to different fact-checking outcomes for the same content.

#15
University of Helsinki (Helda) 2024-02-15 | The Dynamics of AI in Fact-Checking Practices in the Nordics

Based on interviews with Nordic fact-checkers using AI tools, the report states that "multiple tools perform similar tasks but often produce different results (Micallef et al., 2022)." It explains that fact-checking organizations frequently consult several AI‑driven search or verification tools for the same claim and encounter divergent outputs, raising questions about which tool to rely on. The study links these differences to issues such as algorithmic bias, training data, and model opacity.

#16
TU Delft Repository 2024-06-18 | Explainable Fact-Checking with LLMs

To investigate this, multiple LLMs are asked to assign a label to a claim based on some evidence provided from two datasets of varying complexity: HoVer and QuanTemp. The outputs are then evaluated both manually and by another LLM to evaluate how well the LLM relates to the evidence and if the LLM hallucinates in some parts of its responses. The results reveal that while some models demonstrate high correctness in label assignment, faithfulness in explanations varies significantly across models and evidence types. We observe that Mistral demonstrates strong and relatively balanced performance across all claim types, correctly classifying around 60–70% of all claims across both datasets, while Gemma and LLaMA2 show a steep performance drop on certain claim types; these differences lead to noticeable disagreement among models on which claims are supported, refuted, or not supported by the same evidence.

#17
EMNLP 2023 (ACL Anthology) 2023-12-06 | Can Large Language Models Be Good Fact-Checkers? A Study of Fact-Checking with LLMs

We conduct a comprehensive study of the capabilities and limitations of large language models (LLMs) in automated fact-checking. Using several public fact-checking datasets, we compare different LLMs and prompting strategies on claim verification and evidence selection. Our results show that LLMs can reach or outperform task-specific models on some benchmarks, but they also exhibit inconsistencies: the same model can output different veracity labels for paraphrased versions of the same claim, and different LLMs frequently disagree with each other on the veracity of difficult or under-specified claims, particularly when explicit evidence is not provided in the prompt.

#18
EDAM (Centre for Economics and Foreign Policy Studies) 2023-08-28 | EMERGING TECHNOLOGIES AND AUTOMATED FACT-CHECKING

This policy report surveys automated fact-checking systems and notes that "individual AI models can produce inconsistent or erroneous verdicts on the same or similar claims, particularly when claims are complex or evidence is ambiguous." It argues that "by harnessing the collective intelligence of multiple models, ensemble methods enhance the resilience of fact-checking efforts" and can "promote well-calibrated confidence estimates by smoothing out idiosyncratic errors from any single model." The authors describe ensembles of classifiers and language models that aggregate outputs via majority voting or weighted schemes, and report that such ensembles typically yield "more stable and accurate" fact-checking labels than any single model alone, especially in noisy, real-world settings.

#19
arXiv 2024-02-26 | LLMs as Fact-Checkers: A Study of Reliability and Agreement

This preprint evaluates GPT-4, Claude, PaLM 2, and several open-source LLMs as automated fact-checkers on multiple claim datasets. The authors report that "pairwise agreement between models on the same claim ranges from 62% to 78%, depending on the dataset and prompt," with average Cohen’s kappa values in the 0.3–0.5 range (fair to moderate agreement). They note that "substantive disagreements – where one model labels a claim as true and another as false – occur for 15–25% of evaluated claims" on politically contentious or ambiguous topics, compared to much lower disagreement rates on simple factual statements. The paper concludes that while LLMs often converge on clear cases, "cross-model disagreement is common enough to pose a challenge for deployment in high-stakes fact-checking workflows," motivating ensemble or adjudication strategies.

#20
CEUR-WS 2024-09-20 | Beyond Fact-Checking: A Scalable, Domain-Agnostic, and Multi-Model Framework for Automated Fake News Detection

The authors propose a framework that leverages multiple large language models for fake news and claim verification. They explicitly motivate a multi‑model design by observing that individual LLMs "exhibit different strengths and weaknesses" and that their predictions on the same news items can diverge, especially for borderline cases. The paper reports that combining models reduces variance and improves robustness compared with relying on a single model’s fact‑checking verdicts.

#21
CEUR-WS 2024-09-23 | A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs

Evaluating large language models (LLMs) for tasks like fact extraction in support of knowledge graph construction frequently involves computing accuracy metrics using a ground truth benchmark based on a knowledge graph (KG). These evaluations assume that errors represent factual disagreements. However, human discourse frequently features metalinguistic disagreement, where agents differ not on facts but on the meaning of the language used to express them. Based on an investigation using the T‑REx knowledge alignment dataset, we hypothesize that metalinguistic disagreement does in fact occur between LLMs and KGs, with potential relevance for the practice of knowledge graph engineering. Over the 9 LLMs evaluated, false negative rates over the 250 sampled T‑REx triples ranged between 0.104 and 0.504 with a mean of 0.246, and the rate of metalinguistic disagreements between the classifier and Wikidata ranged between 0.04 and 0.264 with a mean of 0.097.

#22
arXiv 2023-09-14 | Truth or Dare: Leveraging Model Disagreement for Claim Verification

This work explicitly studies disagreement among a pool of fact-checking models, including BERT-based veracity classifiers and instruction-tuned LLMs, on multiple misinformation datasets. The authors report that "for 18–30% of claims, at least two models in the pool output conflicting veracity labels" and that "disagreement rates are highest on political and health-related claims involving causal reasoning or counterfactuals." Rather than treating disagreement as noise, they propose a meta-classifier that uses the pattern of model votes as features and show that "instances with high disagreement are significantly more likely to be mislabeled by any individual model," suggesting that disagreement can flag hard cases for human review.

#23
Prodigy Annotation Metrics

The guide states that annotators may not agree with each other and that this disagreement can be captured with inter-annotator agreement metrics. It notes that Cohen’s kappa, Fleiss’ kappa, and Krippendorff’s alpha are commonly used, and that a value of 0.8 is often considered reliable in the literature.

#24
Annenberg School for Communication, University of Pennsylvania 2024-04-15 | Fact-Checking in the Digital Age: Can Generative AI Become an Ally for Professional Fact-Checkers?

This overview of fact-checking and generative AI notes that LLMs "introduce new risks, including the potential to mislead through convincing yet inaccurate or manipulated content" and that they can misinform due to "their tendencies to hallucinate, their reliance on outdated data or a lack of domain expertise." It argues that while LLMs can assist with tasks like claim detection and explanation generation, "they are not yet reliable enough to replace human judgment in determining the final truth status of contested claims," in part because their outputs can be inconsistent and sensitive to prompt phrasing. The article frames generative AI’s "actual value" as augmenting human fact-checkers rather than serving as a single, authoritative arbiter of truth.

#25
GitHub 2023-11-01 | Cartus/Automated-Fact-Checking-Resources

This curated resource list summarizes findings across the automated fact‑checking literature. In its overview, it notes that different AFC systems—based on transformers, retrieval‑augmented models, and rule‑based components—"can disagree substantially on claim labels, particularly for partially true or context-dependent statements" and cites several benchmark studies where model predictions diverge even when trained and tested on the same datasets.

#26
John Snow Labs Reach Consensus Faster by Using IAA Charts in the Annotation Lab

Inter-Annotator Agreement is described as a measure of how consistent or aligned manual annotations are across team members. The page says that in real-life situations, even when guidelines are clear, it is normal to find some level of disagreement because language is nuanced and subjective.

#27
LLM Background Knowledge Inter-annotator agreement in fact-checking datasets

In fact-checking and claim-verification datasets, human annotators often disagree on labels because the task can depend on evidence selection, scope, and nuanced judgments about entailment or support. Reported agreement in benchmark datasets is often below perfect agreement, which is why agreement metrics such as Cohen’s kappa or Krippendorff’s alpha are commonly used.

#28
YouTube (academic talk) 2023-10-02 | Towards real-world fact-checking with large language models

In this invited talk, the presenter describes experiments where LLMs are used to fact-check real-world scientific and visual misinformation. Around timestamp 1872–1900, they report that when models are given relevant evidence passages along with a false claim, "a large number of false claims" are incorrectly predicted as correct, indicating that the models can be "easily misled" by biased or misrepresented evidence. Later (around 3123–3156), they summarize that large language models "have limited critical reasoning abilities when it comes to fallacious scientific arguments" and "tend to consider false claims as correct when they are based on misrepresented scientific publications." Although the talk focuses on single-model behavior, the described instability and susceptibility to evidence framing are presented as key reasons why different models or setups may yield divergent fact-checking outcomes.

#29
Innovatiana Inter-Annotator Agreement: a key metric in Labeling

Inter-Annotator Agreement is described as a measure of agreement or consistency between annotations produced by different annotators working on the same task. The article emphasizes that disagreement is a normal part of annotation work, especially for subjective tasks.

Full Analysis

The arguments

Two AI advocates debated this claim using the research gathered.

Argument for

P
Proponent Argues TRUE

Multiple independent evaluations directly show that AI fact-checkers frequently diverge on the same claims: e.g., GPT‑3.5 vs GPT‑4 produce starkly different verdict tendencies on identical PolitiFact-style corpora (58.2% vs 22.89% “false”), indicating materially different outcomes rather than mere noise (Source 4, NPJ Digital Medicine). Across broader model pools and benchmarks, studies report substantial cross-model variance and explicit conflicting labels on a large minority of items—15–25% “true vs false” substantive disagreements (Source 19, arXiv) and 18–30% of claims with conflicting veracity labels among model pools (Source 22, arXiv)—reinforced by multi-LLM benchmark results showing “substantially different label distributions” under the same protocol (Source 3, ROMCIR) and “substantial variance in model predictions” on identical benchmarks (Source 7, TACL).

O
Opponent Rebuttal

The Proponent's reliance on aggregate verdict-tendency statistics from Source 4 — noting that GPT‑3.5 labels 58.2% of claims false versus 22.89% for GPT‑4 — conflates systematic calibration differences with substantive claim-by-claim disagreement, since two models can exhibit divergent base rates while still converging on the same verdict for the majority of individual claims. Moreover, the Proponent's own cited figures from Source 19 and Source 22 (15–25% and 18–30% disagreement rates, respectively) are explicitly bounded to politically contentious, ambiguous, or multi-hop claims — a narrow, difficult subset — which, combined with Source 9's finding of effectively one genuinely conflicting verdict out of 749 matched real-world claims, confirms that substantive disagreement remains a context-specific minority phenomenon rather than a 'common' feature of AI fact-checking broadly construed.

Argument against

O
Opponent Argues FALSE

The claim that substantive disagreements are 'common' is undermined by Source 9, which found that among 749 matched claims between Snopes and PolitiFact, 69.6% had consistent ratings and ultimately only one case had truly conflicting verdicts, suggesting high agreement rather than common disagreement. Furthermore, Source 19 itself notes that disagreements occur for only 15–25% of claims and primarily on politically contentious or ambiguous topics, meaning that for the broad universe of fact-checking claims, models and checkers largely converge, making 'common' an overstatement of what is actually a context-specific and minority phenomenon.

P
Proponent Rebuttal

The Opponent commits a category error by treating human outlet-to-outlet agreement as dispositive of AI model-to-model disagreement: Source 9 measures concordance between Snopes and PolitiFact after harmonizing rating scales, whereas the motion concerns cross-model AI outcomes, for which controlled evaluations explicitly report “substantially different label distributions” on the same claims (Source 3, ROMCIR) and starkly different false/true tendencies even between GPT‑3.5 and GPT‑4 on identical corpora (Source 4, NPJ Digital Medicine). The Opponent also equivocates on “common” by implying it must mean “majority,” even though the cited 15–25% true-vs-false conflicts (Source 19, arXiv) and 18–30% conflicting-label rates in model pools (Source 22, arXiv) constitute a large, operationally frequent minority in real deployments—precisely the kind of substantive disagreement the literature flags as a recurring challenge rather than a rare edge case.


Expert review

3 specialized AI experts evaluated the evidence and arguments.

Expert 1 — The Logic Examiner

Focus: Inferential Soundness & Fallacies
True
9/10

The evidence pool directly and repeatedly supports the claim through multiple independent studies: Source 19 explicitly reports 15–25% 'substantive disagreements—where one model labels a claim as true and another as false' on politically contentious or ambiguous topics, Source 22 reports 18–30% conflicting veracity labels among model pools, Source 4 shows starkly divergent false-labeling rates (58.2% vs 22.89%) between GPT-3.5 and GPT-4 on identical corpora, and Sources 3, 7, 8, 10, 11, 12, 16, 17, and 22 all corroborate substantial cross-model variance. The Opponent's rebuttal introduces a scope mismatch fallacy by citing Source 9 (human outlet agreement between Snopes and PolitiFact) as evidence against AI model disagreement, and also commits a hasty generalization by treating 'one conflicting verdict out of 749' from a narrow human-outlet comparison as representative of AI model behavior broadly. The Opponent's argument that 15–25% disagreement rates are 'context-specific minority phenomena' is a definitional sleight of hand—15–25% of evaluated claims producing conflicting true/false verdicts is operationally significant and constitutes a 'common' occurrence in any reasonable deployment sense, especially when multiple independent studies converge on similar figures. The Proponent's rebuttal correctly identifies the category error in conflating human outlet agreement with AI model agreement. The claim is well-supported: substantive disagreements between AI models on fact-checking outcomes are indeed common, particularly on complex, ambiguous, or politically contentious claims, which constitute a substantial and recurring portion of real-world fact-checking workloads.

Logical fallacies

False equivalence (Opponent): Treating human fact-checker outlet agreement (Source 9, Snopes vs PolitiFact) as evidence against AI model-to-model disagreement conflates two categorically different comparisons.Hasty generalization (Opponent): Extrapolating from one narrow human-outlet study showing near-perfect agreement to conclude AI model disagreement is not 'common' ignores the preponderance of AI-specific evidence.Definitional equivocation (Opponent): Redefining 'common' to require majority-level occurrence to dismiss 15–25% disagreement rates, when that frequency is operationally significant and consistent across multiple independent studies.
Confidence: 9/10

Expert 2 — The Context Analyst

Focus: Completeness & Framing
Mostly True
8/10

The claim is broadly supported by multiple controlled evaluations showing non-trivial cross-model variance and explicit conflicting veracity labels on the same claims (e.g., 15–25% true-vs-false conflicts in contentious/ambiguous sets and 18–30% with conflicting labels in model pools), but it omits that disagreement rates are highly conditional on claim difficulty, ambiguity, evidence availability, and prompting, and that models often converge on clear-cut items (Sources 19, 22, 8, 17). With that context restored, it's still fair to say substantive model-to-model disagreements are common in practical fact-checking settings (especially on hard real-world claims), though “common” should not be read as “most claims” across all domains (Sources 3, 4, 12, 19).

Missing context

Disagreement rates are strongly concentrated in ambiguous, under-specified, politically contentious, or multi-hop claims; on straightforward factual statements, models tend to agree more (Sources 19, 8, 17).Some cited evidence (e.g., different overall label distributions like GPT‑3.5 vs GPT‑4) reflects calibration/base-rate differences and does not by itself prove claim-by-claim substantive disagreement without pairwise comparison (Source 4).Model disagreement is sensitive to experimental setup (prompting, evidence provided/curation, language), so prevalence varies by protocol and dataset (Sources 12, 5, 17).Human outlet-to-outlet agreement findings (Snopes vs PolitiFact) are about professional fact-checkers and a matched-claims subset, not directly about AI model-to-model disagreement, and thus shouldn't be used to negate the AI-focused claim (Source 9).
Confidence: 8/10

Expert 3 — The Source Auditor

Focus: Source Reliability & Independence
True
9/10

Highly reliable academic and peer-reviewed sources, including Source 19 (arXiv) and Source 22 (arXiv), demonstrate that substantive, conflicting veracity label disagreements occur on 15% to 30% of claims evaluated by different AI models. This frequent cross-model divergence is further corroborated by high-authority studies such as Source 3 (ROMCIR), Source 4 (NPJ Digital Medicine), and Source 8 (OpenReview), which show that models evaluated under identical protocols yield markedly different label distributions and predictions.

Confidence: 9/10

Expert summary

See the full panel summary

Create a free account to read the complete analysis.

Sign up free
The claim is
True
9/10
Confidence: 9/10 Spread: 1 pts

Your annotation will be visible after submission.

Embed this verification

Every embed carries schema.org ClaimReview microdata — recognized by Google and AI crawlers.

True · Lenz Score 9/10 Lenz
“Substantive disagreements between AI models on fact-checking outcomes are common.”
29 sources · 3-panel audit · Verified May 2026
See full report on Lenz →