AI research tools erode the one expertise that can catch what they get wrong

The previous article described model collapse: the degradation of AI training data as synthetic content contaminates web-scale corpora across successive model generations. This article describes a parallel process that operates not in the training pipeline but inside the organisations using AI tools. The two processes are linked, and together they close a trap.

Expertise is not a static possession. It is maintained by exercise. A radiologist who reads films every day retains the pattern recognition that makes their judgment valuable. A radiologist who delegates film reading to an AI system and reviews only its outputs is exercising a different skill: evaluating AI conclusions rather than forming independent ones. These are related tasks but not identical ones, and the gap between them grows over time.

This observation is not new, and it is not specific to AI. Skill atrophy from automation is a well-documented phenomenon. What makes the current situation distinctive is that the atrophy is occurring specifically in the domain of evidence evaluation, at precisely the moment when the AI tools being used to do that evaluation are producing outputs that most need expert scrutiny.

The economics of the substitution

The individual decision to use an AI tool for a research task is rational. The tool is faster, it never gets tired, and it processes more material than a human researcher can cover in the same time. If it gets something wrong, the human reviewer is supposed to catch it. This is the standard human-in-the-loop model and it is a reasonable design for many applications.

The problem is what happens to the human reviewer over time. Catching AI errors requires knowing what correct looks like independently of what the AI produced. A fact-checker who verifies AI-generated claims against primary sources maintains that knowledge. A researcher who reads AI summaries of papers without reading the papers themselves does not. Gradually, the basis for independent evaluation shifts from primary knowledge to familiarity with AI outputs. The reviewer becomes better at spotting AI-style errors and less equipped to spot errors that don't match the AI's characteristic failure modes.

A study by researchers at the University of Passau and Arizona State University, published in February 2026, documents this dynamic empirically. It found that organisations relying heavily on AI tools progressively lose the in-house expertise needed to evaluate the accuracy of those tools' outputs. The mechanism is straightforward: the tasks that exercised that expertise are the tasks that have been delegated to the AI. When the AI makes systematic errors that reflect a degraded training distribution — the model collapse problem described in the previous article — the organisation no longer has the internal capacity to recognise them as errors rather than facts.

47%

Executives who report making at least one major business decision based on hallucinated AI content (Deloitte, 2025)

4.3h

Weekly hours spent by employees verifying AI outputs, per Forrester 2025 enterprise analysis

$14.2K

Estimated annual per-employee cost of AI output verification across enterprise deployments (Forrester, 2025)

The Deloitte figure is the most troubling of the three. Nearly half of executives report having made at least one major business decision based on hallucinated AI content. This is not recklessness — it is a rational response to volume. When AI tools produce more content than the organisation has capacity to verify independently, verification becomes selective, then nominal, then a formal box-ticking exercise conducted by reviewers who lack the primary knowledge to catch the errors that matter most.

A knowledge trap, formalised

Researchers studying the economics of AI training data have modelled the individual decision-making process underlying this dynamic. The argument runs as follows. An individual researcher faces a choice between drawing on the true knowledge distribution — reading primary sources, consulting domain experts, conducting original analysis — and drawing on AI-generated information, which is cheaper and faster. Over time, as more individuals choose the cheaper path, the public knowledge distribution reflects progressively less primary engagement with evidence and progressively more AI-mediated reprocessing of it.

Individual rationality produces collective degradation. Each person's decision to use the AI tool is sensible given their constraints. The aggregate effect of many such decisions is a research community that maintains progressively less direct contact with primary evidence and progressively more dependence on AI systems whose reliability depends on the quality of a training distribution that is itself being shaped by AI outputs.

This is what makes it a trap rather than simply a risk. The same properties that make AI tools attractive — speed, scale, consistent output — are the properties that accelerate the atrophy of the expertise needed to evaluate them critically. The more successful the adoption, the faster the atrophy. And the faster the atrophy, the less equipped organisations are to notice when the tools' reliability declines.

Where the atrophy appears first

Not all expertise atrophies at the same rate. The tasks most likely to be delegated to AI first are the tasks with the highest friction and lowest immediate feedback: literature searches, background research, initial summarisation, drafting. These are also the tasks that most develop the capacity to recognise when a summary misrepresents a source, when a background claim doesn't match the literature, when a draft argument rests on a weak evidential foundation. The early-delegated tasks are the ones that build the judgment the later verification requires.

The pattern is visible in three professional contexts where the evidence is most developed.

In healthcare, only 5% of clinical AI studies use actual clinical data, according to the Stanford 2026 AI Index. The majority evaluate AI tools against curated benchmarks that do not reflect the full range of presentations a clinician encounters. Clinicians using AI-assisted diagnosis tools trained on benchmark data lack a reliable basis for knowing when their patient falls outside the distribution the tool was trained on — and the clinical experience that would develop that intuition is precisely what AI assistance reduces the need for.

In legal research, AI tools have measurable hallucination rates of 17% to 34% on legal information even in the best-performing models. Catching those errors requires knowing what the correct answer is, which requires either having looked it up independently or having the domain knowledge to recognise the error without looking. Junior lawyers using AI research tools for the tasks that would previously have built that domain knowledge are developing a different skill set than their predecessors. The difference may not be apparent until a high-stakes matter surfaces an error that a more traditionally trained practitioner would have caught.

In policy research, the issue is less about individual error rates and more about the provenance of the framing itself. AI tools trained on web-scale data reflect the distribution of emphasis and argument in that data. A researcher who uses AI tools to survey a policy area will receive a summary shaped by the distribution of published commentary — which reflects prior debates, dominant framings, and the amplification cascades documented earlier in this series. The AI is not lying about the literature. It is reflecting the literature's existing distortions back at a scale and speed that makes independent triangulation difficult.

The feedback loop closes

The previous article described model collapse as a degradation in the training distribution that shapes AI outputs. This article has described expertise atrophy as a degradation in the human capacity to evaluate those outputs. The two processes are connected in a way that makes the combined problem harder to address than either alone.

How the loop closes AI tools substitute for research tasks. Those tasks are where domain expertise is built and maintained. As expertise atrophies, the capacity to catch AI errors declines. Meanwhile, AI training data incorporates more synthetic content, degrading the tools' reliability. Each generation of AI outputs is less reliable and less scrutinised. The degradation compounds invisibly because both the source of error and the mechanism for detecting it are moving in the same direction.

How the loop closes

As model reliability falls with each training generation, the human capacity to detect that decline also falls. The two processes reinforce each other.

The consequence is that the standard assurance model — humans verify AI outputs — becomes progressively less robust precisely as AI adoption scales. At low adoption rates, experienced practitioners who have spent years engaging with primary sources review AI outputs from a strong independent base. At high adoption rates, that base has eroded because the tasks that built it have been delegated. The review is conducted by practitioners whose primary knowledge of the domain derives increasingly from AI-mediated sources rather than original engagement.

This is not an argument against AI tools in research. It is an argument about a structural property of how those tools interact with the expertise needed to use them responsibly. Recognising the structure matters because the interventions it implies are different from the ones usually discussed.

What this implies for practice

Most guidance on responsible AI use focuses on output verification: check the citations, confirm the claims, maintain human oversight. This advice is correct and necessary. It addresses the artifact-level failures — the fabricated reference, the misrepresented source, the overconfident claim — that earlier articles in this series described.

The expertise atrophy problem is a level below that. It concerns the capacity to verify, not the act of verification. Maintaining that capacity requires sustained engagement with primary sources and primary analysis tasks, even when AI tools make it unnecessary in the short term. A research team that never reads papers without an AI summary is a research team that is progressively less able to evaluate whether the summary is accurate. A policy analyst who never writes a literature review without AI assistance is developing a different relationship to the evidence base than one who does.

The disciplines that have thought most carefully about this are the ones where the consequences of getting it wrong are most visible and most rapidly attributed. Aviation uses mandatory manual flying hours alongside automated flight to preserve the judgment needed when automation fails. Surgery requires open procedure training alongside minimally invasive techniques for the same reason. The principle is not that automation is bad. It is that the human capacity to override or correct automation requires regular exercise against the actual task, not against the automation's outputs.

Research practice has not yet developed an equivalent norm. The pressures run the other way: faster output, broader coverage, lower cost. These are real pressures with real value. The question this series has been building toward is whether the evidence quality those pressures are trading against is visible enough to be weighed consciously — or whether it is being eroded quietly, in a way that will become apparent only when a specific decision, built on a specific AI-processed evidence base, fails in a specific and traceable way.

By that point, the expertise to have caught it earlier may no longer be there to reconstruct what went wrong.