Claim analyzed

Tech

“Publicly posted online content can be scraped and used to train artificial intelligence models.”

Submitted by Vicky

The conclusion

Reviewed by Vicky Dodeva, editor · Apr 03, 2026
Mostly True
8/10

The claim is accurate as a statement of technical capability and widespread industry practice. Publicly posted online content is routinely scraped to train AI models—confirmed by academic research, corporate disclosures (e.g., Google's privacy policy), and the existence of major datasets like Common Crawl. However, the claim omits critical legal context: copyright law, privacy regulations, terms of service, and the EU AI Act (fully enforced in 2026) all impose significant restrictions. "Can be done" is true; "can be done freely and lawfully in all cases" is not.

Based on 25 sources: 15 supporting, 5 refuting, 5 neutral.

Caveats

  • 'Publicly posted' does not mean 'free to use'—most online content is copyrighted, and scraping it for AI training is the subject of 70+ active lawsuits as of early 2026.
  • The EU AI Act now requires AI developers to disclose training data sources, respect copyright opt-outs, and comply with transparency obligations—scraping without compliance steps may be unlawful.
  • Website terms of service, technical access controls, and computer-access laws (e.g., CFAA in the U.S.) can make automated scraping illegal even when content is publicly viewable in a browser.

Sources

Sources used in the analysis

#1
Use Apify 2026-03-04 | Ethical AI Scraping in 2026: Navigating the Legal Landscape | Use Apify
REFUTE

The legal landscape for web scraping has shifted dramatically in early 2026. Over 70 copyright infringement lawsuits have been filed against AI companies for scraping protected content. The EU AI Act enters full enforcement in August 2026, introducing specific requirements for data collection used in AI training, and a draft US bill (AI Accountability for Publishers Act) introduced in February 2026 would require AI companies to obtain explicit permission and pay for scraping publisher content.

#2
Scalevise 2026-02-28 | EU AI Act 2026: New Rules for Training Data and Copyright - Scalevise
REFUTE

Starting in 2026, the EU AI Act will require every AI company to disclose training data sources, respect copyright opt-outs, and label AI-generated content. Providers of general-purpose AI models will be required to publish a public summary of the datasets used for training, showing sources and types of data, and must ensure their data sources respect copyright law.

#3
bfvlaw.com 2026-01-13 | Training Data or Taking Data? How AI Copyright Lawsuits Are Reshaping Creative Rights
REFUTE

As generative AI tools become embedded in everyday life, copyright owners are increasingly asking a critical question: what happens when creative works are copied to train artificial intelligence without consent, compensation, or attribution? This question is now squarely before the courts in a series of lawsuits brought by authors and other creators against OpenAI and similar companies, alleging direct infringement through unauthorized copying and loss of exclusivity.

#4
imhuman.ai 2025-08-29 | Unveiling the Legal Battle: OpenAI Faces Lawsuit Over Data Collection Practices
REFUTE

A California law firm has filed a class-action lawsuit against OpenAI, alleging the unauthorized collection of personal data for training purposes, including private information from millions of internet users and minors without informed consent. The lawsuit contends that OpenAI scraped a staggering 300 billion words from the internet, including personal information from platforms like Twitter and Reddit, bypassing user consent and legal requirements.

#5
i2Coalition Mindfully Training AI Models Using Public Data - i2Coalition
SUPPORT

A legitimate interest exists for using publicly-available personal data to train an AI model—as long as safeguards like public notices, subject access rights, and retention policies are built into the process. Given the vast amounts of personal data that people voluntarily make public, there should be a reasonable expectation that public data could be used to train AI models.

#6
IAPP 2025-04-30 | The EU AI Act and copyrights compliance - IAPP
SUPPORT

The DSM directive introduced a text and data mining exception to copyright protection. While text and data mining covers a wide range of computational analysis, including search engine indexing, it also extends to data scraping for AI training.

#7
arXiv 2025-02-21 | Generative AI Training and Copyright Law - arXiv
SUPPORT

A common practice is to collect such data through web scraping. Yet, much of what has been and is collected is copyright protected. Its use may be copyright infringement. In the USA, AI developers rely on “fair use” and in Europe, the prevailing view is that the exception for “Text and Data Mining” (TDM) applies.

#8
Harvard Law School 2024-12-12 | Harvard's Library Innovation Lab Launches Institutional Data Initiative
SUPPORT

At the Institutional Data Initiative (IDI), a new program hosted within the Harvard Law School Library, efforts are already underway to expand and enhance the data resources available for AI training. At the initiative’s public launch on Dec. 12, Library Innovation Lab faculty director, Jonathan Zittrain ’95, and IDI executive director, Greg Leppert, announced plans to expand the availability of public domain data from knowledge institutions — including the text of nearly one million books scanned at Harvard Library — to train AI models.

#9
Zyte 2025-12-12 | AI's legal frontier: What Europe's privacy regulators say about scraping personal data - Zyte
NEUTRAL

Explore how EU privacy regulators view AI web scraping, lawful bases like legitimate interest, risks of collecting personal data, and compliance best practices. We have seen a surge of lawsuits and regulations concerning AI and the web scraping methods often used to obtain AI training data. Companies doing so must follow all applicable laws and regulations, such as copyright laws, data protection laws, and new regulations specific to AI, such as the EU AI Act.

#10
Thurrott.com 2023-07-04 | Google's New Privacy Policy Confirms AI Data Scraping - Thurrott.com
SUPPORT

Google has updated its privacy policy to explicitly state that it uses publicly available information to help train its AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities. The policy notes, 'We may collect information that's publicly available online or from other public sources to help train Google's AI models.'

#11
Label Studio Five Open Dataset Resources For Fine-Tuning and Training AI Models
SUPPORT

Similar to Google’s public datasets, AWS provides examples of how some open datasets have been leveraged before. Take, for example, Common Crawl — a non-profit organization that crawls the internet and makes every dataset and archive available for public use, for free. Through AWS, author Jonathan Dunn leveraged a dataset from Common Crawl to write a paper titled "Mapping Languages: The Corpus of Global Language Use."

#12
Google Cloud 2024-10-31 | Navigating the legal intricacies of scraping personal data for AI development
NEUTRAL

In the rapidly evolving world of artificial intelligence, data scraping is a hot topic. The copying of online text, images and videos has beneficial use cases (e.g. training AI models for more accurate fraud detection or collecting contact details of business representatives for marketing purposes). But is it legal? The answer isn't straightforward.

#13
Humans in the Loop 2025-01-01 | Best AI training datasets for machine learning in 2025
SUPPORT

Data.gov: A government-supported platform offering access to a wide range of public datasets in various sectors, such as finance, climate, healthcare, and transportation. This resource helps researchers and businesses seeking open data for training their AI models.

#14
Epoch AI Data on AI Models - Epoch AI
NEUTRAL

Our public database, the largest of its kind, tracks over 3200 machine learning models from 1950 to today. Explore data and graphs showing the trajectory of ...

#15
Google Cloud Vertex AI Train and use your own models | Vertex AI
NEUTRAL

Develop and deploy ML models on Vertex AI. Choose AutoML, run custom training with serverless jobs or dedicated clusters, or scale with Ray.

#16
Grepsr 2025-10-16 | Ethical and Legal Considerations for AI Training Data - Grepsr
SUPPORT

Web scraping has become an essential tool for collecting large-scale datasets that power artificial intelligence (AI) and machine learning (ML) models. By gathering data from diverse online sources, organizations can build AI systems that are smarter, more accurate, and capable of understanding real-world variability.

#17
Data Annotation Company 2025-11-13 | AI Training Data: Top Sources and Dataset Providers - Data Annotation Company
SUPPORT

Training data for AI can come from public repositories such as Kaggle, Google Dataset Search, Hugging Face, or OpenML, as well as commercial vendors... Sources: Common Crawl, Wikipedia, PubMed, OpenWebText, multilingual corpora.

#18
CookieScript 2026-01-12 | Blocking AI Scrapers: Can Your Privacy Policy Stop LLM Training? - CookieScript
NEUTRAL

Europe's AI Act is the world's most comprehensive framework, regulating AI scrapers. In 2026, its transparency and Data Governance requirements regulate how generative AI could use personal data. ... Under the EU Copyright Directive, AI bots must respect websites' technical signals like robots.txt to opt out of AI training.

#19
Forbes 2026-03-10 | Judge Rules AI Agents Can't Act On Your Behalf Without Platform Permission - Forbes
REFUTE

A U.S. District Court ruling in March 2026 stated that AI agents cannot access platforms' electronic systems without permission from platform owners, meaning that even if a user authorizes an AI agent to log into a website, the website owners control access, and automated access can become illegal under computer access laws if permission is revoked.

#20
Invisible Technologies 2025-12-03 | AI training in 2026: anchoring synthetic data in human truth - Invisible Technologies
SUPPORT

The web corpus that fed GPT-3, GPT-4, Llama, DeepSeek and other foundation models is long exhausted. More scraping from blogs, docs with a DOI, and papers on arXiv doesn't magically teach an AI to run a hospital rota or a supply-chain control tower.

#21
PatentPC 2026-02-12 | Managing Copyright Claims for AI Training on Social Media Content - PatentPC
SUPPORT

AI systems, particularly those based on machine learning, often require massive datasets to train their models. These datasets typically consist of a wide range of data types, including social media content that is publicly available online.

#22
zdnet.com 2025-06-30 | How AI companies are secretly collecting training data from the web (and why it matters)
SUPPORT

AI companies are quietly harvesting your web content. AI chatbots like Google Gemini and ChatGPT are making token efforts to be good citizens. They scrape all our content and make billions off of it, but they're willing to provide links back to our work for the very few who bother to check sources.

#23
Matthew Butterick 2024-06-22 | AI scraping & "publicly available web data" - Matthew Butterick
SUPPORT

Publicly available web data is a phrase intended to conjure the idea of “data made publicly available on the web by the author,” because many web pages are. But in the context of automated scraping of AI training data, it means something simpler and grubbier: every byte that can be accessed via the public web.

#24
LLM Background Knowledge 2025-01-01 | Common Crawl and AI Training Practices
SUPPORT

Common Crawl is a widely used public web crawl dataset that has been scraped from the internet and extensively utilized to train large language models such as those from OpenAI, Google, and others, confirming the standard practice in the AI industry of using publicly posted online content for model training.

#25
Jetwriter AI 2024-06-22 | List of companies that train AI models using their user data - Jetwriter AI
SUPPORT

Companies like Google, OpenAI, Meta, Perplexity, Microsoft Copilot, Poe, xAI, and Merlin AI, leverage your data for training AI models, which involves analyzing your interactions to enhance their services. X (Twitter) uses personal information for research, trend analysis, and developing new features. They train their Language Learning Model using publicly available data and user feedback, so your interactions help improve their models.

Full Analysis

Expert review

How each expert evaluated the evidence and arguments

Expert 1 — The Logic Examiner

Focus: Inferential Soundness & Fallacies
True
9/10

The claim states that publicly posted online content "can be scraped and used to train AI models" — a statement of capability and established practice, not a claim of universal legal permissibility. The logical chain from evidence to claim is direct and robust: Sources 7 (arXiv), 10 (Thurrott/Google), 16 (Grepsr), 21 (PatentPC), 22 (ZDNet), and 24 (Common Crawl/LLM Knowledge) all confirm that web scraping of publicly posted content is a standard, real-world method used to train AI models, and even the refuting sources (1, 2, 3, 4, 19) presuppose the practice occurs by documenting lawsuits and regulations responding to it. The opponent's rebuttal commits a clear is-ought fallacy: the existence of legal disputes and regulatory frameworks does not negate the factual claim that scraping "can" and does occur; it merely establishes that the practice is legally contested in some contexts — which is entirely consistent with the claim being true as a statement of technical capability and industry practice. The proponent correctly identifies that the opponent conflates "can be done" with "is always lawful," and the evidence overwhelmingly supports that the practice is real, widespread, and technically feasible, even if legally complex.

Logical fallacies

Is-Ought Fallacy (Opponent): The opponent conflates 'what is legally permitted' with 'what can be done,' treating legal contestation as proof the practice cannot occur — but the claim asserts capability and practice, not legal license.Straw Man (Opponent): The opponent reframes the claim as asserting lawful permissibility ('what AI companies are lawfully permitted to do'), then refutes that reframed version rather than the actual claim about capability and practice.Cherry-Picking (Opponent): The opponent selects only litigation and regulatory sources while ignoring the preponderance of evidence confirming scraping is a standard, ongoing industry practice — including sources the opponent's own evidence presupposes.
Confidence: 9/10

Expert 2 — The Context Analyst

Focus: Completeness & Framing
Mostly True
7/10

The claim is framed as a broad capability statement, but it omits key qualifiers that “publicly posted” does not mean free of copyright, privacy, contract/ToS, or computer-access restrictions; in 2026 the EU AI Act's transparency/opt-out duties and ongoing copyright/privacy litigation underscore that scraping-for-training may be unlawful or constrained depending on jurisdiction, content type, and access method (Sources 1-4, 2, 9, 18-19). Even with that context, it remains accurate that publicly accessible online content can in fact be scraped and has been used to train AI models (e.g., common web-scraping practice and public-policy acknowledgments), but the unqualified phrasing can mislead readers into thinking it is generally permitted (Sources 7, 10, 11, 24).

Missing context

“Publicly posted” content can still be copyrighted; copying for training may require a legal basis (e.g., fair use, TDM exception) and is heavily litigated.Website terms of service and technical access controls (and some computer-access laws) can make automated scraping unlawful even if content is viewable in a browser.Privacy/data-protection rules can restrict scraping and reuse of personal data, especially sensitive data or minors' data.EU AI Act (full enforcement in 2026) adds transparency and governance duties (e.g., dataset summaries, respecting opt-outs), so 'can be used' often means 'can be used only with compliance steps' rather than freely.
Confidence: 8/10

Expert 3 — The Source Auditor

Focus: Source Reliability & Independence
Mostly True
8/10

Higher-authority, more independent sources in the pool (notably Source 8 Harvard Law School; Source 7 arXiv; and Source 6 IAPP summarizing EU text-and-data-mining rules) describe web scraping/publicly available online material as a real, established input to AI training (often with legal caveats), while the main “refute” items (Sources 1–4, 19) largely document legal risk, lawsuits, and compliance constraints rather than showing that such content cannot be scraped or used at all. Based on what the most trustworthy evidence says, the claim is broadly correct as a capability and common practice, but many sources emphasize that legality depends on copyright, privacy, and access/ToS constraints—so the unqualified phrasing is somewhat incomplete rather than wrong.

Weakest sources

Source 24 (LLM Background Knowledge) is not an independent, citable primary source and cannot be audited like a publication.Source 25 (Jetwriter AI) is low-authority and has strong incentives to publish attention-grabbing lists; it is not a primary or independently verified record.Source 1 (Use Apify) and Source 2 (Scalevise) are commercial blogs whose legal summaries may be useful context but are not primary legal authorities; their framing can overreach beyond what statutes/cases definitively establish.Source 4 (imhuman.ai) appears to be a niche blog summarizing litigation with potentially promotional/advocacy framing, and it is not a primary court document or major newsroom report.Source 19 (Forbes) is a secondary media account of a court ruling and, without the underlying docket/opinion, is hard to verify and may be overgeneralized from a fact-specific access dispute.
Confidence: 7/10

Expert summary

See the full panel summary

Create a free account to read the complete analysis.

Sign up free
The claim is
Mostly True
8/10
Confidence: 8/10 Spread: 2 pts

The arguments

Two AI advocates debated this claim using the research gathered.

Argument for

P
Proponent Argues TRUE

Multiple independent sources describe web scraping of publicly accessible online material as a standard, real-world method for assembling AI training corpora—arXiv notes that “a common practice is to collect such data through web scraping” for model training (Source 7, arXiv), and Google explicitly states in its privacy policy that it “may collect information that's publicly available online… to help train Google's AI models” (Source 10, Thurrott.com). The existence of detailed compliance regimes and disputes about permissions (e.g., EU text-and-data-mining exceptions extending to scraping for AI training in Source 6, IAPP, and the 2026 EU AI Act's disclosure/opt-out requirements in Source 2, Scalevise) presupposes the underlying fact that publicly posted online content can be scraped and used to train AI models.

O
Opponent Rebuttal

Your argument commits a classic is-ought fallacy — you conflate what AI companies have done with what they can lawfully do, but Source 1 (Use Apify) documents over 70 active copyright infringement lawsuits and a draft U.S. bill requiring explicit permission before scraping, while Source 19 (Forbes) reports a March 2026 court ruling that automated access without platform permission can be outright illegal, meaning the practice is legally contested, not permissible. Furthermore, your appeal to Google's privacy policy and the EU's text-and-data-mining exception cherry-picks narrow carve-outs while ignoring that Source 2 (Scalevise) and Source 3 (Training Data or Taking Data) confirm courts and regulators are actively treating unauthorized scraping as infringement — so citing compliance frameworks as proof the practice "can" occur freely actually undermines your case, since those frameworks exist precisely because unconstrained scraping is no longer legally tenable.

Argument against

O
Opponent Argues FALSE

The claim that publicly posted online content can be scraped and used to train AI models is fundamentally misleading in 2026, as it ignores the rapidly solidifying legal barriers: Source 1 (Use Apify) documents over 70 copyright infringement lawsuits against AI companies for exactly this practice, Source 2 (Scalevise) confirms the EU AI Act now mandates copyright opt-out compliance and data source disclosure, and Source 19 (Forbes) reports a March 2026 U.S. District Court ruling that automated access to platforms without explicit permission can be illegal under computer access laws. Far from being a permissible practice, Source 3 (Training Data or Taking Data) and Source 4 (Unveiling the Legal Battle) demonstrate that courts are actively treating unauthorized scraping of publicly posted content as copyright infringement and unauthorized data collection, meaning the claim — without critical legal qualifications — is false as a general statement of what AI companies are lawfully permitted to do.

P
Proponent Rebuttal

You're equivocating between “can be scraped and used” (a capability and widespread practice) and “is always lawful,” but even your own evidence—lawsuits (Source 1; Source 3; Source 4) and new compliance duties like opt-outs and disclosure (Source 2)—presupposes that companies are in fact scraping and using publicly posted content for training, which is exactly what the motion asserts. And your reliance on Source 19 (Forbes) overgeneralizes a platform-access/permission dispute into a blanket ban on scraping, while the brief's pro-sources explicitly describe scraping as common practice (Source 7, arXiv) and confirm real-world training on publicly available online information (Source 10, Thurrott.com), with EU TDM exceptions extending to scraping for AI training (Source 6, IAPP).

Your annotation will be reviewed by an editor before becoming visible.

Embed this verification

Copy this code and paste it in your article's HTML.