The distinction is this: a citation resolving to a real paper is not the same as that paper supporting the claim attached to it. The first is a verifiability check. The second is an evidence check. Most AI research tools perform the first and call it the second.

This has a name now. A research paper published in February 2026 by Rasheed, Banerjee, Mukherjee, and Hazra -- "From Fluent to Verifiable: Claim-Level Auditability for Deep Research Agents" (arXiv 2602.13855) -- separates these two properties into distinct measurable metrics: provenance coverage and provenance soundness. Provenance coverage asks whether a claim has a traceable path to a source -- and importantly, it passes even when the link is only topical relevance at the document level. Provenance soundness asks whether the specific text in that source logically entails the specific claim, at the sentence or passage level. A tool can score perfectly on the first and near zero on the second. The paper has a term for what that produces: scientific pollution -- outputs that look credible while the underlying evidence trails are fragile or missing.

The term is apt. It captures something that is harder to detect than a fabricated citation precisely because the source is real.

The problem that got solved first

The early AI research tool problem was fabricated citations -- papers that did not exist, authors who were not real, DOIs that resolved to nothing. That problem attracted attention because it was easy to catch and embarrassing when it was caught. Tools built citation-grounding into their architecture, and the failure rate on outright fabrication dropped. This was a genuine improvement.

What it did not solve was the question of whether a real, retrievable paper actually says what the AI claims it says. That question is harder to check, takes longer, and does not produce the kind of obvious error that gets noticed in a quick scan. You have to read the paper. You have to find the specific passage. You have to assess whether the passage supports the relationship the AI has asserted between the finding and the claim.

Most researchers using AI tools do not do this for every citation. The volume is too high and the format -- a confident synthesis with numbered references -- does not invite that level of scrutiny. The tool has already done the work. The citations are there. It reads like scholarship.

The AAR paper describes what happens at the architectural level when this gap is not closed: the dominant risk shifts from isolated factual errors to scientifically styled outputs whose claim-evidence links are weak, missing, or misleading. Fluency becomes the problem rather than a feature, because fluent text with real citations provides the social signals of credibility without the substance.

What the gap looks like in practice

Consider a policy researcher using an AI research tool to assess the evidence base for a regulatory claim. The tool surfaces twelve citations. All twelve resolve to real papers. The synthesis reads coherently. The confidence indicators are positive.

What the tool has not done: checked whether the specific passages in those papers support the specific relationships asserted in the synthesis. Some of the papers may support a weaker version of the claim. Some may have been published in a context where the finding does not generalise to the regulatory domain being researched. One or two may actually qualify or contradict the claim, but the tool has surfaced them as supporting sources because they discuss the same topic.

None of this produces an obviously wrong output. The citations are real. The papers exist. A reader who does not go back to the source text has no way to know that provenance coverage is high while provenance soundness is low.

This is not a hypothetical failure mode. It is a consequence of how most AI research tools are architected. The retrieval step finds relevant documents. The synthesis step generates claims grounded in those documents. The citation step attaches document references to claims. What is missing is a step that checks, at the passage level, whether the specific text in the cited document actually entails the specific claim being made. That step is what the AAR paper calls provenance soundness, and it is what most tools omit.

Why fluency makes this harder to catch

There is an additional dynamic worth naming. An earlier article in this series noted that AI research tools erode the expertise needed to catch AI errors -- the tasks AI substitutes for are precisely the tasks that build the critical reading capacity required to evaluate AI outputs. Provenance soundness failures are the most direct casualty of that erosion.

Checking whether a cited passage entails a specific claim is a skilled task. It requires knowing the methodological context of the paper, understanding the scope conditions of the finding, and recognising when a relationship that looks supportive is actually qualified or conditional. These are the skills that develop through doing the retrieval and synthesis work manually. When AI tools handle that work, those skills atrophy -- and the errors that require those skills to catch go undetected.

The AAR paper identifies this as a structural property of the failure mode, not an incidental one. Auditability becomes the bottleneck precisely because generation is cheap and fast. When a tool produces a fifty-page research synthesis in three minutes, the human reviewer does not have fifty pages worth of reading time in reserve to audit it. The tool's speed advantage becomes a verification deficit.

The distinction in a single sentence Provenance coverage tells you a source exists. Provenance soundness tells you the source does the work the citation implies. Most AI research tools measure the first and present it as evidence of the second.

What an honest architecture looks like

The AAR paper proposes four metrics for evaluating research agent auditability: provenance coverage, provenance soundness, contradiction transparency, and audit effort. The first two address the citation gap described above. The third addresses a related problem -- whether the tool surfaces conflicts between sources or suppresses them in favour of a coherent synthesis. The fourth addresses the practical question of how long it takes a human to verify a claim from the evidence trail.

These metrics suggest what an architecture that takes the problem seriously would look like. Source documents would be declared and scoped before the evidence run begins, not retrieved opportunistically during synthesis. Claim-evidence links would be established at the passage level, not the document level -- the relevant question is not whether a paper exists but whether a specific sentence in that paper supports a specific claim. Contradictions between sources would be preserved and surfaced rather than resolved into a unified narrative. And the confidence attached to a claim would reflect the quality of the evidence links, not the fluency of the synthesis.

This is not a description of how most current AI research tools work. It is a description of the architectural direction the paper argues the field should move toward. The contribution is to name and define what measurable auditability would require -- not to establish that it has been achieved.

The compliance dimension

For researchers working in policy, legal, or regulatory contexts, the provenance soundness gap has a specific consequence that is worth stating directly. A compliance argument built on research where citations resolve but do not support is not a defensible compliance argument. It is a document that looks like one.

The distinction matters when the argument is reviewed. A regulator, a legal counterparty, or an internal audit function that goes back to the cited sources and finds that they do not support the claims they are attached to has identified a failure of the research process, not just an error in the output. The existence of citations is not a defence if the citations do not do the work they appear to do.

This is the compliance consequence that most discussions of AI research tool risk understate. The focus tends to be on fabricated citations -- a risk that is real but has become well-understood and partially addressed. The subtler risk is a research output that survives a citation existence check and fails an evidence support check. That is harder to detect, harder to attribute, and harder to explain when it surfaces.

A February 2026 research paper now names and defines the gap precisely -- provenance coverage and provenance soundness are distinct, measurable properties, and most tools optimise for the first while presenting it as evidence of the second. Whether the field moves to formalise that distinction into evaluation standards is a separate question. That the distinction exists and matters is not.

For anyone who has been relying on citation existence as a proxy for evidence quality, that is a useful distinction to have named. It is also a somewhat uncomfortable one.