Verify any claim · lenz.io
Claim analyzed
Tech“Substantive disagreements between AI models on fact-checking outcomes are common.”
Submitted by Patient Heron 0fa7
The conclusion
Open in workbench →Evidence from multiple studies shows that AI fact-checking models often reach materially different verdicts on the same claim, with reported substantive conflicts commonly in the roughly 15% to 30% range on challenging datasets. That is frequent enough to count as common in real-world use. Rates do vary by claim difficulty, ambiguity, prompting, and evidence quality.
Caveats
- 'Common' should not be read as 'most claims' in every domain; disagreement is concentrated on harder, ambiguous, or politically contentious claims.
- Some studies showing different overall label distributions do not, by themselves, prove claim-by-claim disagreement; the strongest evidence comes from direct pairwise conflict analyses.
- Disagreement rates are sensitive to experimental setup, including prompt design, provided evidence, language, and benchmark composition.
Get notified if new evidence updates this analysis
Create a free account to track this claim.
Sources
Sources used in the analysis
These datasets often employ multiple annotators to measure inter-annotator agreement; most datasets report relatively high levels of human disagreement. The high disagreement is mainly due to the high complexity of the task, which introduces subjectivity during annotation. According to kappa values reported in prior studies, agreement ranges from 0.75 for SciFact to 0.68 for another cited dataset, showing that even expert annotation can vary substantially.
This study assesses inter-annotator agreement among multiple expert annotators and discusses how disagreement is expected even when annotators are highly trained. It frames agreement as something that must be measured, not assumed, because annotators can differ on the same items even in expert settings.
The paper evaluates multiple large language models (LLMs) on real-world fact-checking datasets (e.g., PolitiFact, ClimateFeedback) under the same experimental protocol. It reports that models such as GPT‑4, GPT‑3.5, LLaMA‑2, and others obtain substantially different label distributions and accuracies on the same sets of claims, with some models being more prone to predicting "false" and others more often predicting "true" or abstaining. The authors highlight that these discrepancies in model behavior complicate the use of LLMs as drop‑in replacements for human fact‑checkers and motivate careful model selection and calibration.
The study compares GPT‑3.5 and GPT‑4 on claim verification tasks using PolitiFact and other fact-checking corpora. It states: "Our evaluation shows that GPT-4 significantly outperforms GPT-3.5 at fact-checking claims" and notes that GPT‑3.5 predicts claims to be false in 58.2% of cases, compared with 22.89% for GPT‑4, despite being evaluated on the same labeled claims. The authors emphasize that while both models achieve similar overall accuracy ranges (around 64–71%), their calibration and tendency to assign particular verdicts differ markedly, especially across different veracity categories and languages.
This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing resources to enable comparative analysis of Large Language Models (LLMs) across the full fact-checking pipeline. We systematically investigate the performance of multiple LLMs on all fact-checking subtasks and languages, and compare LLM-based fact-checking effectiveness with traditional deep learning models. Our findings reveal that while LLMs perform competitively in many settings, they still struggle with complex, multi-hop reasoning and exhibit varying strengths across languages and subtasks, leading to notable disagreement in their predictions for check-worthiness, evidence retrieval, and final veracity labels.
The study notes that when evaluating the same statement, "apparent disagreements can occur for several reasons. Fact-checkers may overlook or misinterpret evidence or apply different evidentiary standards. They may also draw different inferences from the same evidence, or they may simply make mistakes." It reports that previous work (Marietta et al. 2015) found "significant discrepancies" among three fact-checkers (PolitiFact, The Fact Checker, FactCheck.org) on certain political topics, especially where statements were ambiguous, even though agreement was higher on clear truths and falsehoods. The article emphasizes that the *measured* level of disagreement is sensitive to rating scales and sampling design, cautioning that estimates of how often fact-checkers disagree depend heavily on methodology.
Reviewing dozens of automated fact-checking systems, the survey observes that different model architectures and training setups often yield divergent verdicts on the same claims, especially for nuanced or partially true statements. It notes that "substantial variance in model predictions is observed across systems evaluated on identical benchmarks," and that ensemble or agreement-based approaches have been proposed partly to cope with this model-to-model variability in veracity labels.
In this study, we evaluate 13 different fact verification models, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16% of ambiguous or incorrectly labeled data substantially influences model rankings. We further show that different fact verifiers often disagree on challenging or ambiguously labeled instances, and that these disagreements are amplified on examples requiring complex multi-hop reasoning or nuanced interpretation of evidence.
This large-scale study scraped 22,349 fact-checking articles from Snopes and PolitiFact and identified 749 pairs of matching claims (about 6.5% of each outlet’s corpus) from 2016–2022. It reports that among these matches, "521 (69.6%) had consistent ratings" and that after accounting for minor rating-scale differences, "we found only one case out of 749 matching claims with conflicting verdict ratings." The authors write that "the high level of agreement, with only one contradicting case, between Snopes and PolitiFact in their fact-checking conclusions is critical" and conclude that this suggests "a high level of agreement" between the two fact-checkers during the period studied. At the same time, they note that "disagreements are common, particularly when politicians use ambiguous language," citing earlier work where fact-checkers diverged more on ambiguous statements.
We present Factcheck‑Bench, a holistic end‑to‑end framework for annotating and evaluating the factuality of LLM‑generated responses, which encompasses a multi‑stage annotation scheme designed to yield detailed labels for fact‑checking and correcting not just the final prediction, but also the intermediate steps that a fact‑checking system might need to take. Based on this framework, we construct an open‑domain factuality benchmark in three levels of increasing difficulty and perform extensive experiments with multiple LLM-based fact‑checkers. Our experiments reveal substantial variation in model behavior and accuracy across subtasks and difficulty levels, with different models often producing divergent labels and rationales for the same claim–document pairs, particularly on the hardest instances.
In this work, we present Factcheck‑Bench, a holistic end‑to‑end framework for annotating and evaluating the factuality of LLM‑generated responses, which encompasses a multi‑stage annotation scheme designed to yield detailed labels for fact‑checking and correcting not just the final prediction, but also the intermediate steps that a fact‑checking system might need to take. The benchmark contains 678 open‑domain claims generated by LLMs, involving annotations of eight subtasks for detecting and correcting the factual errors in long documents. Experiments with a range of LLM-based fact‑checkers show that while overall performance can be high on simpler instances, models frequently disagree with each other on complex claims and on fine‑grained sublabels such as error span identification and correction proposals.
Stanford researchers evaluated several large language models on a benchmark of real-world claims and found that "when models relied solely on their built-in knowledge, they all performed poorly. Accuracy ranged from roughly 0.1 to 0.3 on macro F1." They report that models "often disagreed with each other" on whether the same claim was true or false, and that their judgments were "highly unstable" across small prompt changes or different evidence passages. The study concludes that curated, high-quality evidence can significantly improve performance but that without such curation, model verdicts on factual claims are both inaccurate and inconsistent, limiting their reliability as stand-alone fact-checkers.
FACTS Grounding evaluates model responses automatically using three frontier LLM judges — namely Gemini 1.5 Pro, GPT‑4o, and Claude 3.5 Sonnet. Each FACTS Grounding example is judged in two phases. First, responses are evaluated for eligibility, and disqualified if they don’t sufficiently address the user’s request. Second, responses are judged as factually accurate if they are fully grounded in information contained in the provided document, with no hallucinations. With the eligibility and grounding accuracy of a given LLM response evaluated separately by multiple AI judge models, the results are then aggregated to determine if the LLM has dealt with the example successfully, explicitly accounting for cases where the judges disagree about whether a response is grounded or hallucinatory.
This work studies a neural automated fact-checker whose predictions are shown to users with different explanation interfaces. While the core model is fixed, the authors note prior literature reporting that "different automated fact-checking systems often disagree in their veracity predictions for the same news items" and frame their contribution in the context of such system-level variability. They analyze variance in user agreement with the fact-checker and highlight that explanation style can change how consistently people follow a given model, implicitly underscoring that different models (or tools) may lead to different fact-checking outcomes for the same content.
Based on interviews with Nordic fact-checkers using AI tools, the report states that "multiple tools perform similar tasks but often produce different results (Micallef et al., 2022)." It explains that fact-checking organizations frequently consult several AI‑driven search or verification tools for the same claim and encounter divergent outputs, raising questions about which tool to rely on. The study links these differences to issues such as algorithmic bias, training data, and model opacity.
To investigate this, multiple LLMs are asked to assign a label to a claim based on some evidence provided from two datasets of varying complexity: HoVer and QuanTemp. The outputs are then evaluated both manually and by another LLM to evaluate how well the LLM relates to the evidence and if the LLM hallucinates in some parts of its responses. The results reveal that while some models demonstrate high correctness in label assignment, faithfulness in explanations varies significantly across models and evidence types. We observe that Mistral demonstrates strong and relatively balanced performance across all claim types, correctly classifying around 60–70% of all claims across both datasets, while Gemma and LLaMA2 show a steep performance drop on certain claim types; these differences lead to noticeable disagreement among models on which claims are supported, refuted, or not supported by the same evidence.
We conduct a comprehensive study of the capabilities and limitations of large language models (LLMs) in automated fact-checking. Using several public fact-checking datasets, we compare different LLMs and prompting strategies on claim verification and evidence selection. Our results show that LLMs can reach or outperform task-specific models on some benchmarks, but they also exhibit inconsistencies: the same model can output different veracity labels for paraphrased versions of the same claim, and different LLMs frequently disagree with each other on the veracity of difficult or under-specified claims, particularly when explicit evidence is not provided in the prompt.
This policy report surveys automated fact-checking systems and notes that "individual AI models can produce inconsistent or erroneous verdicts on the same or similar claims, particularly when claims are complex or evidence is ambiguous." It argues that "by harnessing the collective intelligence of multiple models, ensemble methods enhance the resilience of fact-checking efforts" and can "promote well-calibrated confidence estimates by smoothing out idiosyncratic errors from any single model." The authors describe ensembles of classifiers and language models that aggregate outputs via majority voting or weighted schemes, and report that such ensembles typically yield "more stable and accurate" fact-checking labels than any single model alone, especially in noisy, real-world settings.
This preprint evaluates GPT-4, Claude, PaLM 2, and several open-source LLMs as automated fact-checkers on multiple claim datasets. The authors report that "pairwise agreement between models on the same claim ranges from 62% to 78%, depending on the dataset and prompt," with average Cohen’s kappa values in the 0.3–0.5 range (fair to moderate agreement). They note that "substantive disagreements – where one model labels a claim as true and another as false – occur for 15–25% of evaluated claims" on politically contentious or ambiguous topics, compared to much lower disagreement rates on simple factual statements. The paper concludes that while LLMs often converge on clear cases, "cross-model disagreement is common enough to pose a challenge for deployment in high-stakes fact-checking workflows," motivating ensemble or adjudication strategies.
The authors propose a framework that leverages multiple large language models for fake news and claim verification. They explicitly motivate a multi‑model design by observing that individual LLMs "exhibit different strengths and weaknesses" and that their predictions on the same news items can diverge, especially for borderline cases. The paper reports that combining models reduces variance and improves robustness compared with relying on a single model’s fact‑checking verdicts.
Evaluating large language models (LLMs) for tasks like fact extraction in support of knowledge graph construction frequently involves computing accuracy metrics using a ground truth benchmark based on a knowledge graph (KG). These evaluations assume that errors represent factual disagreements. However, human discourse frequently features metalinguistic disagreement, where agents differ not on facts but on the meaning of the language used to express them. Based on an investigation using the T‑REx knowledge alignment dataset, we hypothesize that metalinguistic disagreement does in fact occur between LLMs and KGs, with potential relevance for the practice of knowledge graph engineering. Over the 9 LLMs evaluated, false negative rates over the 250 sampled T‑REx triples ranged between 0.104 and 0.504 with a mean of 0.246, and the rate of metalinguistic disagreements between the classifier and Wikidata ranged between 0.04 and 0.264 with a mean of 0.097.
This work explicitly studies disagreement among a pool of fact-checking models, including BERT-based veracity classifiers and instruction-tuned LLMs, on multiple misinformation datasets. The authors report that "for 18–30% of claims, at least two models in the pool output conflicting veracity labels" and that "disagreement rates are highest on political and health-related claims involving causal reasoning or counterfactuals." Rather than treating disagreement as noise, they propose a meta-classifier that uses the pattern of model votes as features and show that "instances with high disagreement are significantly more likely to be mislabeled by any individual model," suggesting that disagreement can flag hard cases for human review.
The guide states that annotators may not agree with each other and that this disagreement can be captured with inter-annotator agreement metrics. It notes that Cohen’s kappa, Fleiss’ kappa, and Krippendorff’s alpha are commonly used, and that a value of 0.8 is often considered reliable in the literature.
This overview of fact-checking and generative AI notes that LLMs "introduce new risks, including the potential to mislead through convincing yet inaccurate or manipulated content" and that they can misinform due to "their tendencies to hallucinate, their reliance on outdated data or a lack of domain expertise." It argues that while LLMs can assist with tasks like claim detection and explanation generation, "they are not yet reliable enough to replace human judgment in determining the final truth status of contested claims," in part because their outputs can be inconsistent and sensitive to prompt phrasing. The article frames generative AI’s "actual value" as augmenting human fact-checkers rather than serving as a single, authoritative arbiter of truth.
This curated resource list summarizes findings across the automated fact‑checking literature. In its overview, it notes that different AFC systems—based on transformers, retrieval‑augmented models, and rule‑based components—"can disagree substantially on claim labels, particularly for partially true or context-dependent statements" and cites several benchmark studies where model predictions diverge even when trained and tested on the same datasets.
Inter-Annotator Agreement is described as a measure of how consistent or aligned manual annotations are across team members. The page says that in real-life situations, even when guidelines are clear, it is normal to find some level of disagreement because language is nuanced and subjective.
In fact-checking and claim-verification datasets, human annotators often disagree on labels because the task can depend on evidence selection, scope, and nuanced judgments about entailment or support. Reported agreement in benchmark datasets is often below perfect agreement, which is why agreement metrics such as Cohen’s kappa or Krippendorff’s alpha are commonly used.
In this invited talk, the presenter describes experiments where LLMs are used to fact-check real-world scientific and visual misinformation. Around timestamp 1872–1900, they report that when models are given relevant evidence passages along with a false claim, "a large number of false claims" are incorrectly predicted as correct, indicating that the models can be "easily misled" by biased or misrepresented evidence. Later (around 3123–3156), they summarize that large language models "have limited critical reasoning abilities when it comes to fallacious scientific arguments" and "tend to consider false claims as correct when they are based on misrepresented scientific publications." Although the talk focuses on single-model behavior, the described instability and susceptibility to evidence framing are presented as key reasons why different models or setups may yield divergent fact-checking outcomes.
Inter-Annotator Agreement is described as a measure of agreement or consistency between annotations produced by different annotators working on the same task. The article emphasizes that disagreement is a normal part of annotation work, especially for subjective tasks.
What do you think of the claim?
Your challenge will appear immediately.
Challenge submitted!
For developers
This same pipeline is available via API.
Verify your AI's output programmatically.
/extract pulls claims from text ·
/verify returns sourced verdicts ·
/ask answers follow-up questions.
Continue your research
Verify a related claim next.
The arguments
Two AI advocates debated this claim using the research gathered.
Argument for
Multiple independent evaluations directly show that AI fact-checkers frequently diverge on the same claims: e.g., GPT‑3.5 vs GPT‑4 produce starkly different verdict tendencies on identical PolitiFact-style corpora (58.2% vs 22.89% “false”), indicating materially different outcomes rather than mere noise (Source 4, NPJ Digital Medicine). Across broader model pools and benchmarks, studies report substantial cross-model variance and explicit conflicting labels on a large minority of items—15–25% “true vs false” substantive disagreements (Source 19, arXiv) and 18–30% of claims with conflicting veracity labels among model pools (Source 22, arXiv)—reinforced by multi-LLM benchmark results showing “substantially different label distributions” under the same protocol (Source 3, ROMCIR) and “substantial variance in model predictions” on identical benchmarks (Source 7, TACL).
The Proponent's reliance on aggregate verdict-tendency statistics from Source 4 — noting that GPT‑3.5 labels 58.2% of claims false versus 22.89% for GPT‑4 — conflates systematic calibration differences with substantive claim-by-claim disagreement, since two models can exhibit divergent base rates while still converging on the same verdict for the majority of individual claims. Moreover, the Proponent's own cited figures from Source 19 and Source 22 (15–25% and 18–30% disagreement rates, respectively) are explicitly bounded to politically contentious, ambiguous, or multi-hop claims — a narrow, difficult subset — which, combined with Source 9's finding of effectively one genuinely conflicting verdict out of 749 matched real-world claims, confirms that substantive disagreement remains a context-specific minority phenomenon rather than a 'common' feature of AI fact-checking broadly construed.
Argument against
The claim that substantive disagreements are 'common' is undermined by Source 9, which found that among 749 matched claims between Snopes and PolitiFact, 69.6% had consistent ratings and ultimately only one case had truly conflicting verdicts, suggesting high agreement rather than common disagreement. Furthermore, Source 19 itself notes that disagreements occur for only 15–25% of claims and primarily on politically contentious or ambiguous topics, meaning that for the broad universe of fact-checking claims, models and checkers largely converge, making 'common' an overstatement of what is actually a context-specific and minority phenomenon.
The Opponent commits a category error by treating human outlet-to-outlet agreement as dispositive of AI model-to-model disagreement: Source 9 measures concordance between Snopes and PolitiFact after harmonizing rating scales, whereas the motion concerns cross-model AI outcomes, for which controlled evaluations explicitly report “substantially different label distributions” on the same claims (Source 3, ROMCIR) and starkly different false/true tendencies even between GPT‑3.5 and GPT‑4 on identical corpora (Source 4, NPJ Digital Medicine). The Opponent also equivocates on “common” by implying it must mean “majority,” even though the cited 15–25% true-vs-false conflicts (Source 19, arXiv) and 18–30% conflicting-label rates in model pools (Source 22, arXiv) constitute a large, operationally frequent minority in real deployments—precisely the kind of substantive disagreement the literature flags as a recurring challenge rather than a rare edge case.
Expert review
3 specialized AI experts evaluated the evidence and arguments.
Expert 1 — The Logic Examiner
The evidence pool directly and repeatedly supports the claim through multiple independent studies: Source 19 explicitly reports 15–25% 'substantive disagreements—where one model labels a claim as true and another as false' on politically contentious or ambiguous topics, Source 22 reports 18–30% conflicting veracity labels among model pools, Source 4 shows starkly divergent false-labeling rates (58.2% vs 22.89%) between GPT-3.5 and GPT-4 on identical corpora, and Sources 3, 7, 8, 10, 11, 12, 16, 17, and 22 all corroborate substantial cross-model variance. The Opponent's rebuttal introduces a scope mismatch fallacy by citing Source 9 (human outlet agreement between Snopes and PolitiFact) as evidence against AI model disagreement, and also commits a hasty generalization by treating 'one conflicting verdict out of 749' from a narrow human-outlet comparison as representative of AI model behavior broadly. The Opponent's argument that 15–25% disagreement rates are 'context-specific minority phenomena' is a definitional sleight of hand—15–25% of evaluated claims producing conflicting true/false verdicts is operationally significant and constitutes a 'common' occurrence in any reasonable deployment sense, especially when multiple independent studies converge on similar figures. The Proponent's rebuttal correctly identifies the category error in conflating human outlet agreement with AI model agreement. The claim is well-supported: substantive disagreements between AI models on fact-checking outcomes are indeed common, particularly on complex, ambiguous, or politically contentious claims, which constitute a substantial and recurring portion of real-world fact-checking workloads.
Expert 2 — The Context Analyst
The claim is broadly supported by multiple controlled evaluations showing non-trivial cross-model variance and explicit conflicting veracity labels on the same claims (e.g., 15–25% true-vs-false conflicts in contentious/ambiguous sets and 18–30% with conflicting labels in model pools), but it omits that disagreement rates are highly conditional on claim difficulty, ambiguity, evidence availability, and prompting, and that models often converge on clear-cut items (Sources 19, 22, 8, 17). With that context restored, it's still fair to say substantive model-to-model disagreements are common in practical fact-checking settings (especially on hard real-world claims), though “common” should not be read as “most claims” across all domains (Sources 3, 4, 12, 19).
Expert 3 — The Source Auditor
Highly reliable academic and peer-reviewed sources, including Source 19 (arXiv) and Source 22 (arXiv), demonstrate that substantive, conflicting veracity label disagreements occur on 15% to 30% of claims evaluated by different AI models. This frequent cross-model divergence is further corroborated by high-authority studies such as Source 3 (ROMCIR), Source 4 (NPJ Digital Medicine), and Source 8 (OpenReview), which show that models evaluated under identical protocols yield markedly different label distributions and predictions.