Verify any claim · lenz.io
Claim analyzed
Tech“Publicly posted online content can be scraped and used to train artificial intelligence models.”
Submitted by Vicky
The conclusion
The claim is accurate as a statement of technical capability and widespread industry practice. Publicly posted online content is routinely scraped to train AI models—confirmed by academic research, corporate disclosures (e.g., Google's privacy policy), and the existence of major datasets like Common Crawl. However, the claim omits critical legal context: copyright law, privacy regulations, terms of service, and the EU AI Act (fully enforced in 2026) all impose significant restrictions. "Can be done" is true; "can be done freely and lawfully in all cases" is not.
Based on 25 sources: 15 supporting, 5 refuting, 5 neutral.
Caveats
- 'Publicly posted' does not mean 'free to use'—most online content is copyrighted, and scraping it for AI training is the subject of 70+ active lawsuits as of early 2026.
- The EU AI Act now requires AI developers to disclose training data sources, respect copyright opt-outs, and comply with transparency obligations—scraping without compliance steps may be unlawful.
- Website terms of service, technical access controls, and computer-access laws (e.g., CFAA in the U.S.) can make automated scraping illegal even when content is publicly viewable in a browser.
Sources
Sources used in the analysis
The legal landscape for web scraping has shifted dramatically in early 2026. Over 70 copyright infringement lawsuits have been filed against AI companies for scraping protected content. The EU AI Act enters full enforcement in August 2026, introducing specific requirements for data collection used in AI training, and a draft US bill (AI Accountability for Publishers Act) introduced in February 2026 would require AI companies to obtain explicit permission and pay for scraping publisher content.
Starting in 2026, the EU AI Act will require every AI company to disclose training data sources, respect copyright opt-outs, and label AI-generated content. Providers of general-purpose AI models will be required to publish a public summary of the datasets used for training, showing sources and types of data, and must ensure their data sources respect copyright law.
As generative AI tools become embedded in everyday life, copyright owners are increasingly asking a critical question: what happens when creative works are copied to train artificial intelligence without consent, compensation, or attribution? This question is now squarely before the courts in a series of lawsuits brought by authors and other creators against OpenAI and similar companies, alleging direct infringement through unauthorized copying and loss of exclusivity.
A California law firm has filed a class-action lawsuit against OpenAI, alleging the unauthorized collection of personal data for training purposes, including private information from millions of internet users and minors without informed consent. The lawsuit contends that OpenAI scraped a staggering 300 billion words from the internet, including personal information from platforms like Twitter and Reddit, bypassing user consent and legal requirements.
A legitimate interest exists for using publicly-available personal data to train an AI model—as long as safeguards like public notices, subject access rights, and retention policies are built into the process. Given the vast amounts of personal data that people voluntarily make public, there should be a reasonable expectation that public data could be used to train AI models.
The DSM directive introduced a text and data mining exception to copyright protection. While text and data mining covers a wide range of computational analysis, including search engine indexing, it also extends to data scraping for AI training.
A common practice is to collect such data through web scraping. Yet, much of what has been and is collected is copyright protected. Its use may be copyright infringement. In the USA, AI developers rely on “fair use” and in Europe, the prevailing view is that the exception for “Text and Data Mining” (TDM) applies.
At the Institutional Data Initiative (IDI), a new program hosted within the Harvard Law School Library, efforts are already underway to expand and enhance the data resources available for AI training. At the initiative’s public launch on Dec. 12, Library Innovation Lab faculty director, Jonathan Zittrain ’95, and IDI executive director, Greg Leppert, announced plans to expand the availability of public domain data from knowledge institutions — including the text of nearly one million books scanned at Harvard Library — to train AI models.
Explore how EU privacy regulators view AI web scraping, lawful bases like legitimate interest, risks of collecting personal data, and compliance best practices. We have seen a surge of lawsuits and regulations concerning AI and the web scraping methods often used to obtain AI training data. Companies doing so must follow all applicable laws and regulations, such as copyright laws, data protection laws, and new regulations specific to AI, such as the EU AI Act.
Google has updated its privacy policy to explicitly state that it uses publicly available information to help train its AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities. The policy notes, 'We may collect information that's publicly available online or from other public sources to help train Google's AI models.'
Similar to Google’s public datasets, AWS provides examples of how some open datasets have been leveraged before. Take, for example, Common Crawl — a non-profit organization that crawls the internet and makes every dataset and archive available for public use, for free. Through AWS, author Jonathan Dunn leveraged a dataset from Common Crawl to write a paper titled "Mapping Languages: The Corpus of Global Language Use."
In the rapidly evolving world of artificial intelligence, data scraping is a hot topic. The copying of online text, images and videos has beneficial use cases (e.g. training AI models for more accurate fraud detection or collecting contact details of business representatives for marketing purposes). But is it legal? The answer isn't straightforward.
Data.gov: A government-supported platform offering access to a wide range of public datasets in various sectors, such as finance, climate, healthcare, and transportation. This resource helps researchers and businesses seeking open data for training their AI models.
Our public database, the largest of its kind, tracks over 3200 machine learning models from 1950 to today. Explore data and graphs showing the trajectory of ...
Develop and deploy ML models on Vertex AI. Choose AutoML, run custom training with serverless jobs or dedicated clusters, or scale with Ray.
Web scraping has become an essential tool for collecting large-scale datasets that power artificial intelligence (AI) and machine learning (ML) models. By gathering data from diverse online sources, organizations can build AI systems that are smarter, more accurate, and capable of understanding real-world variability.
Training data for AI can come from public repositories such as Kaggle, Google Dataset Search, Hugging Face, or OpenML, as well as commercial vendors... Sources: Common Crawl, Wikipedia, PubMed, OpenWebText, multilingual corpora.
Europe's AI Act is the world's most comprehensive framework, regulating AI scrapers. In 2026, its transparency and Data Governance requirements regulate how generative AI could use personal data. ... Under the EU Copyright Directive, AI bots must respect websites' technical signals like robots.txt to opt out of AI training.
A U.S. District Court ruling in March 2026 stated that AI agents cannot access platforms' electronic systems without permission from platform owners, meaning that even if a user authorizes an AI agent to log into a website, the website owners control access, and automated access can become illegal under computer access laws if permission is revoked.
The web corpus that fed GPT-3, GPT-4, Llama, DeepSeek and other foundation models is long exhausted. More scraping from blogs, docs with a DOI, and papers on arXiv doesn't magically teach an AI to run a hospital rota or a supply-chain control tower.
AI systems, particularly those based on machine learning, often require massive datasets to train their models. These datasets typically consist of a wide range of data types, including social media content that is publicly available online.
AI companies are quietly harvesting your web content. AI chatbots like Google Gemini and ChatGPT are making token efforts to be good citizens. They scrape all our content and make billions off of it, but they're willing to provide links back to our work for the very few who bother to check sources.
Publicly available web data is a phrase intended to conjure the idea of “data made publicly available on the web by the author,” because many web pages are. But in the context of automated scraping of AI training data, it means something simpler and grubbier: every byte that can be accessed via the public web.
Common Crawl is a widely used public web crawl dataset that has been scraped from the internet and extensively utilized to train large language models such as those from OpenAI, Google, and others, confirming the standard practice in the AI industry of using publicly posted online content for model training.
Companies like Google, OpenAI, Meta, Perplexity, Microsoft Copilot, Poe, xAI, and Merlin AI, leverage your data for training AI models, which involves analyzing your interactions to enhance their services. X (Twitter) uses personal information for research, trend analysis, and developing new features. They train their Language Learning Model using publicly available data and user feedback, so your interactions help improve their models.
What do you think of the claim?
Your challenge will appear immediately.
Challenge submitted!
Expert review
How each expert evaluated the evidence and arguments
Expert 1 — The Logic Examiner
The claim states that publicly posted online content "can be scraped and used to train AI models" — a statement of capability and established practice, not a claim of universal legal permissibility. The logical chain from evidence to claim is direct and robust: Sources 7 (arXiv), 10 (Thurrott/Google), 16 (Grepsr), 21 (PatentPC), 22 (ZDNet), and 24 (Common Crawl/LLM Knowledge) all confirm that web scraping of publicly posted content is a standard, real-world method used to train AI models, and even the refuting sources (1, 2, 3, 4, 19) presuppose the practice occurs by documenting lawsuits and regulations responding to it. The opponent's rebuttal commits a clear is-ought fallacy: the existence of legal disputes and regulatory frameworks does not negate the factual claim that scraping "can" and does occur; it merely establishes that the practice is legally contested in some contexts — which is entirely consistent with the claim being true as a statement of technical capability and industry practice. The proponent correctly identifies that the opponent conflates "can be done" with "is always lawful," and the evidence overwhelmingly supports that the practice is real, widespread, and technically feasible, even if legally complex.
Expert 2 — The Context Analyst
The claim is framed as a broad capability statement, but it omits key qualifiers that “publicly posted” does not mean free of copyright, privacy, contract/ToS, or computer-access restrictions; in 2026 the EU AI Act's transparency/opt-out duties and ongoing copyright/privacy litigation underscore that scraping-for-training may be unlawful or constrained depending on jurisdiction, content type, and access method (Sources 1-4, 2, 9, 18-19). Even with that context, it remains accurate that publicly accessible online content can in fact be scraped and has been used to train AI models (e.g., common web-scraping practice and public-policy acknowledgments), but the unqualified phrasing can mislead readers into thinking it is generally permitted (Sources 7, 10, 11, 24).
Expert 3 — The Source Auditor
Higher-authority, more independent sources in the pool (notably Source 8 Harvard Law School; Source 7 arXiv; and Source 6 IAPP summarizing EU text-and-data-mining rules) describe web scraping/publicly available online material as a real, established input to AI training (often with legal caveats), while the main “refute” items (Sources 1–4, 19) largely document legal risk, lawsuits, and compliance constraints rather than showing that such content cannot be scraped or used at all. Based on what the most trustworthy evidence says, the claim is broadly correct as a capability and common practice, but many sources emphasize that legality depends on copyright, privacy, and access/ToS constraints—so the unqualified phrasing is somewhat incomplete rather than wrong.
Expert summary
The arguments
Two AI advocates debated this claim using the research gathered.
Argument for
Multiple independent sources describe web scraping of publicly accessible online material as a standard, real-world method for assembling AI training corpora—arXiv notes that “a common practice is to collect such data through web scraping” for model training (Source 7, arXiv), and Google explicitly states in its privacy policy that it “may collect information that's publicly available online… to help train Google's AI models” (Source 10, Thurrott.com). The existence of detailed compliance regimes and disputes about permissions (e.g., EU text-and-data-mining exceptions extending to scraping for AI training in Source 6, IAPP, and the 2026 EU AI Act's disclosure/opt-out requirements in Source 2, Scalevise) presupposes the underlying fact that publicly posted online content can be scraped and used to train AI models.
Your argument commits a classic is-ought fallacy — you conflate what AI companies have done with what they can lawfully do, but Source 1 (Use Apify) documents over 70 active copyright infringement lawsuits and a draft U.S. bill requiring explicit permission before scraping, while Source 19 (Forbes) reports a March 2026 court ruling that automated access without platform permission can be outright illegal, meaning the practice is legally contested, not permissible. Furthermore, your appeal to Google's privacy policy and the EU's text-and-data-mining exception cherry-picks narrow carve-outs while ignoring that Source 2 (Scalevise) and Source 3 (Training Data or Taking Data) confirm courts and regulators are actively treating unauthorized scraping as infringement — so citing compliance frameworks as proof the practice "can" occur freely actually undermines your case, since those frameworks exist precisely because unconstrained scraping is no longer legally tenable.
Argument against
The claim that publicly posted online content can be scraped and used to train AI models is fundamentally misleading in 2026, as it ignores the rapidly solidifying legal barriers: Source 1 (Use Apify) documents over 70 copyright infringement lawsuits against AI companies for exactly this practice, Source 2 (Scalevise) confirms the EU AI Act now mandates copyright opt-out compliance and data source disclosure, and Source 19 (Forbes) reports a March 2026 U.S. District Court ruling that automated access to platforms without explicit permission can be illegal under computer access laws. Far from being a permissible practice, Source 3 (Training Data or Taking Data) and Source 4 (Unveiling the Legal Battle) demonstrate that courts are actively treating unauthorized scraping of publicly posted content as copyright infringement and unauthorized data collection, meaning the claim — without critical legal qualifications — is false as a general statement of what AI companies are lawfully permitted to do.
You're equivocating between “can be scraped and used” (a capability and widespread practice) and “is always lawful,” but even your own evidence—lawsuits (Source 1; Source 3; Source 4) and new compliance duties like opt-outs and disclosure (Source 2)—presupposes that companies are in fact scraping and using publicly posted content for training, which is exactly what the motion asserts. And your reliance on Source 19 (Forbes) overgeneralizes a platform-access/permission dispute into a blanket ban on scraping, while the brief's pro-sources explicitly describe scraping as common practice (Source 7, arXiv) and confirm real-world training on publicly available online information (Source 10, Thurrott.com), with EU TDM exceptions extending to scraping for AI training (Source 6, IAPP).