The internet is eating itself: model collapse and what it means for the evidence base

The previous articles in this series traced evidence quality failures at the level of individual claims: fabricated citations, amplification cascades, fragile assurance, AI-generated peer reviews. This article traces a failure at the level of the system that produces the AI tools themselves. The contamination runs deeper than a bad source or a weak study.

Model collapse is not a new concern. Shumailov et al. raised it in 2023, and the theoretical problem was understood clearly from early on: if successive model generations train on their own outputs, quality degrades as rare signals are lost and errors compound. The concern attracted serious academic attention but remained, for practical purposes, a future risk. The fraction of synthetic content contaminating training data was small enough to treat as manageable.

That fraction is no longer small.

When researchers talk about the reliability of AI tools, the usual frame is accuracy: does the model get things right? The benchmarks test this. The leaderboards rank it. The sales decks cite it. The frame is correct as far as it goes, but it misses a prior question: what was the model trained on, and how much of that training data was itself generated by an earlier AI model?

The question matters because of a phenomenon researchers have named model collapse. It refers to the progressive degradation of AI model quality when a significant portion of training data is synthetic — generated by previous model generations rather than produced by human researchers, writers, and knowledge workers. The degradation is not a cliff edge. It is a slow compounding across generations, and it is already underway at internet scale.

How the contamination happens

The standard approach to training large language models relies on web crawls: automated scraping of text from the internet at massive scale. Common Crawl, the most widely used source, indexes billions of pages continuously. The models trained on it learn from whatever the web contains — academic papers, journalism, forum discussions, government documents, and, increasingly, content generated by AI tools.

This creates a feedback loop. A model trained in 2023 produces content that appears on the web in 2024. A model trained on a 2024 web crawl ingests that content alongside human-authored material, treating the two as equivalent training signal. The 2024 model produces more content, which enters the 2025 crawl. Each generation of models trains partly on the outputs of its predecessors, amplifying whatever patterns, errors, and distortions those predecessors introduced.

The self-reinforcing contamination loop

Each model generation's outputs enter the web, enter the next training crawl, and shape the next model. The dashed red path marks where quality degrades.

The scale of synthetic content on the web is no longer speculative. Analysis published in 2025 documented that 74.2% of newly published web pages contain AI-generated material. The proportion is rising faster than human-authored content can dilute it. A paper published in April 2026 (arXiv:2606.05168) modelled this contamination using epidemiological SIR dynamics — the same framework used to model infectious disease spread — and found supercritical dynamics across three scenarios. The R0 value, the average number of new contaminated data points produced by each existing one, exceeds 1 in all cases. The contamination is self-reinforcing rather than self-limiting.

74%

Newly published web pages containing AI-generated material (2025)

Synthetic data fraction sufficient to initiate measurable model collapse

R0>1

Contamination dynamics across all modelled scenarios — self-reinforcing, not self-limiting

The threshold finding is the one that should attract the most attention. Research from Stanford and other institutions has established that even moderate levels of synthetic training data — as little as 1% — can initiate measurable model collapse. Scaling the model or the dataset does not reliably prevent it. The contamination cannot be diluted away by adding more data if the additional data is itself synthetic. And at 74% synthetic content on newly published pages, the assumption that web-scale training data is predominantly human-authored is no longer defensible.

What collapse actually looks like

The term model collapse suggests a sudden failure. The actual phenomenon is more gradual and in some ways more troubling for that reason. Collapse manifests as homogenisation: the model's output distribution narrows over successive training generations, losing the diversity and tail-end richness of human-authored text. The model becomes more confident and less varied. It produces fluent, plausible, internally consistent outputs — but the range of those outputs contracts, edge cases disappear from the distribution, and errors become systematic rather than random.

The photocopy analogy is accurate. A first-generation photocopy of a document is nearly indistinguishable from the original. A photocopy of that photocopy is slightly degraded. By the tenth generation, fine detail is gone, contrast is blown out, and the text is still readable but no longer faithful to the original. Each generation's errors propagate and compound. With model collapse, the analogous process operates on the distribution of knowledge itself: what the model knows, what it can express, and how confidently it asserts things that may no longer reflect the original human-generated evidence base.

This matters for the evidence quality problem the series has been tracing. The previous articles documented failures at the level of individual outputs: a specific citation that doesn't exist, a particular claim that overstates its evidence, a review that looks rigorous without being rigorous. Model collapse is a failure at the input level. If the training data for AI research tools is increasingly synthetic — drawn from earlier AI outputs rather than primary human research — then the tools are learning from a progressively distorted picture of what human knowledge actually contains. The outputs they generate reflect that distortion, not the original evidence base.

The knowledge field the model learned from is not the knowledge field that exists

There is a version of this problem that is subtle enough to deserve its own framing. A model trained on web data from 2024 knows what the web said in 2024. But a significant portion of what the web said in 2024 was a restatement, a paraphrase, or an AI-generated summary of what the web had said previously. The model's representation of any given fact or finding reflects not the original source of that fact, but the web's collective reprocessing of it across multiple generations of summarisation and restatement.

This is the semantic laundering problem the previous article touched on at the level of individual sources. At the scale of model training data, the same dynamic operates on entire knowledge fields. A model trained on the 2025 web's representation of, say, epidemiological research reflects not the primary literature of epidemiology but a web-scale summarisation of it that has passed through multiple rounds of AI restatement, selective emphasis, and compression. The model can produce confident, fluent answers about epidemiology. The accuracy of those answers depends on how faithfully each summarisation generation preserved the original evidence and its appropriate caveats.

Research from the University of Passau and Arizona State University, published in February 2026, identified a specific mechanism that makes this worse. It found that organisations relying heavily on AI tools to process and summarise information progressively lose the human expertise needed to evaluate that information independently. The in-house knowledge atrophies because the AI handles the task. When the AI's outputs reflect a distorted training distribution, the organisation no longer has the internal capacity to catch the distortion. The check on AI reliability is the very thing the AI has been substituted for.

The compounding problem Model collapse degrades the training distribution across generations. Human expertise atrophies as AI tools substitute for the tasks that expertise was exercised on. The two processes reinforce each other: as model outputs become less reliable, the human capacity to detect that unreliability declines in parallel.

Why this is different from the problems already documented

The fabricated citation problem, the amplification cascade, the fragile assurance failure, the AI peer review — each of these is an identifiable failure in a specific output that a sufficiently attentive researcher could in principle detect. A fabricated citation resolves to nothing when checked. An amplification cascade reveals itself when source lineage is traced. AI-generated reviews have detectable stylistic signatures. These are failures that produce artifacts, and artifacts can be examined.

Model collapse is different in kind. It is a failure in the prior that shapes all outputs, not a failure in any particular output. A model collapsing across training generations does not produce a detectable error in any given response. It produces responses that are correct on average, according to its training distribution — but the training distribution is drifting away from human knowledge and toward a synthetic reprocessing of human knowledge. The outputs look fine. The drift is in the substrate they are drawn from.

This means the standard tools for catching AI quality failures — fact-checking outputs, verifying citations, checking source independence — do not directly address model collapse. They remain necessary. They are no longer sufficient. The reliability of an AI research tool is not only a function of what it produces in any given session. It is also a function of what its training data represented, and whether that representation was faithful to primary human research or was itself a synthetic derivative of earlier AI outputs.

What this means for AI tools used in research

The practical implication for anyone using AI tools to do serious research is uncomfortable. The tools in wide use in 2026 were trained on data that almost certainly includes substantial synthetic content. The specific proportion is unknown because training data composition is not disclosed by most frontier model developers. The Stanford 2026 AI Index found that responsible AI benchmarks, including training data transparency, remain sparse and inconsistently reported even as frontier capability benchmarks proliferate. You can find out how well a model performs on SWE-bench. You cannot find out what fraction of its training data was generated by an earlier model.

This is not a hypothetical concern about a future risk. The LAION-5B dataset, a cornerstone for training many generative models, has measurable contamination from synthetic sources identified in published research. The web crawls that train frontier text models are ongoing processes that cannot distinguish synthetic content from human-authored content at scale without targeted filtering. Targeted filtering exists but its coverage and effectiveness are not independently audited. The contamination is present in the systems being used now.

The researchers studying this problem have identified detection-based filtering and source diversity as the highest-leverage interventions — finding and removing synthetic content before training, and ensuring training data draws from genuinely independent primary sources rather than from web pages that all trace back to the same underlying text. These are engineering problems with tractable solutions. They require disclosure, auditing, and standards that do not yet exist in a form that practitioners can rely on.

Until they do, the series' evidence ladder needs an additional step above the ones already documented. Reference exists. Source supports the claim. Independent corroboration. Evidence is sufficient for this use. Expressed confidence is proportionate. And now: the model producing this output was trained on data that reflects human knowledge rather than a synthetic reprocessing of it. That last step has no standard mechanism for verification. It is the step that makes all the others harder to take seriously.

The internet is eating itself: model collapse and the evidence base

How the contamination happens

What collapse actually looks like

The knowledge field the model learned from is not the knowledge field that exists

Why this is different from the problems already documented

What this means for AI tools used in research