The compounding evidence problem: why agentic research pipelines fail quietly

The previous articles in this series described specific failure modes in research evidence: citation cascades that mimic consensus, peer review captured by AI, training data contaminated by synthetic content, expertise atrophied by delegation. Each is a problem in a static system. This article describes what happens when you put a dynamic system in front of all of them. Agentic AI pipelines do not just produce errors. They inherit, transform, and propagate them.

There is a simple piece of mathematics that practitioners building agentic AI systems have started to take seriously. If each step in a multi-step workflow has a 5% probability of producing an error, and those errors are not corrected between steps, then a 10-step pipeline delivers reliable output roughly 60% of the time, even though no individual step appears problematic in isolation. The compounding is multiplicative, not additive. Each subsequent step does not add 5% risk; it inherits the prior step's errors and adds its own on top of them.

This arithmetic is well-established in reliability engineering, and the agentic AI research community has been formalising it. What has received less attention is a specific application of it: what happens to evidence quality, not just task accuracy, as errors compound across a research workflow. The distinction matters because task errors are often recoverable. Evidence quality errors frequently are not. By the time a degraded claim reaches a decision, its provenance has been overwritten so many times that it cannot be traced back to the source that would reveal it as weak, partial, or simply wrong.

What an agentic research pipeline actually does

A conventional AI research tool takes a query and returns results. The user sees the sources, evaluates the relevance, and decides what weight to place on each. The pipeline is short, the human is in the loop at every step, and the provenance of each piece of information is visible.

An agentic research pipeline does something structurally different. It breaks a research task into subtasks, assigns those subtasks to specialised agents, passes the outputs of early agents as inputs to later ones, synthesises across multiple intermediate results, and returns a finished work product to the user. The user sees the output of step ten, not the chain of decisions that produced it. The sources that shaped step two's summary are embedded in the context that shaped step five's synthesis, which shaped step eight's conclusion. They are not gone. They are invisible.

A typical agentic research pipeline

Step 1: Query decomposition agent breaks the research question into sub-questions.

Steps 2-4: Retrieval agents fetch sources for each sub-question. Source selection reflects the agent's retrieval strategy, not the user's judgment about relevance or quality.

Steps 5-6: Summarisation agents compress each retrieved source. Nuance, caveats, and methodological limitations are candidates for compression.

Step 7: Synthesis agent combines summaries into a unified account. Sources that agreed with each other appear as convergent evidence. Sources that are all downstream of the same original study appear as independent corroboration.

Steps 8-9: Drafting and refinement agents produce the final output. The draft reflects step 7's synthesis, not the original sources.

Step 10: The user receives a finished document. The chain from original source to final claim is not visible in the output.

This is not a flaw in any particular pipeline design. It is the design. The purpose of an agentic pipeline is to absorb complexity so the user does not have to. The question is what gets absorbed along with it.

Three evidence problems that pipelines make worse

The earlier articles in this series identified specific mechanisms by which evidence quality degrades. Agentic pipelines do not create these mechanisms. They accelerate and compound them, and they make the resulting degradation harder to detect.

Source independence collapses invisibly

The first article in this series described the corroboration problem: a citation cascade in which many sources all trace back to a single originating study, creating an appearance of independent consensus where none exists. A human researcher who reads five papers in sequence notices when four of them cite the same source. The convergence is visible in the bibliography.

A retrieval agent processing those same five papers produces a summary of their claims. The summary reflects the claims' convergence, not their common origin. The synthesis agent that processes that summary sees strong agreement and weights it accordingly. By step seven, what was originally a single study with four downstream citations has become convergent evidence from five independent sources. No step was wrong in isolation. The cumulative effect is a false confidence signal that would not have survived a manual literature review.

Caveats are the first thing compressed

Summarisation under token constraints is not a neutral operation. It preserves the central claim and discards what surrounds it. What surrounds central claims in research papers is frequently the most important content for evidence quality purposes: sample size limitations, confidence intervals, replication status, conflict of interest disclosures, the authors' own caveats about generalisability.

A paper that reports a significant finding in a sample of 47 participants with a p-value of 0.048 and notes explicitly that the finding has not been replicated in other settings is a paper that calls for caution. Its summary will report the significant finding. The synthesis built on that summary will treat the finding as established. The document produced by the drafting agent will assert it with the confidence appropriate to established findings, because that is what the evidence it was given supports.

This is not hallucination in the technical sense. The claim in the final document traces to a real paper. The paper does say what the document claims it says. What the document does not say, and cannot say, is that the paper's own authors considered the finding preliminary.

Each step resets the provenance clock

In a system that tracks claims back to their sources, every intermediate step that produces a new text is an opportunity to lose provenance. A summarisation agent that compresses a source into three sentences produces three sentences, not three sentences plus a link to the methodological context that would qualify them. A synthesis agent that combines three summaries produces a unified account, not three accounts with their independent limitations preserved.

By the time the final output reaches the user, the chain from claim to source has been interrupted multiple times. The claim exists in the document. The source exists somewhere in the retrieval logs. The connection between them, including everything that would allow a reader to evaluate whether the source actually supports the claim, has been progressively overwritten at each intermediate step.

~60%

Reliable output rate from a 10-step pipeline at 5% per-step error, before any evidence-specific degradation

30%+

Accuracy degradation in linear multi-agent workflows when a high-error agent appears upstream (COCO benchmark, Huang et al. 2025)

39-70%

Sequential reasoning degradation in multi-agent vs single-agent settings across four benchmarks and three model families (2025 MAST study)

Epistemic error cascading — what the research calls it

Researchers studying multi-agent system reliability have formalised this phenomenon. The COCO paper (2025) uses the term "epistemic error cascading" to describe how errors in early agents propagate through graph-structured workflows, compounding at each interaction. The mechanism is structural rather than incidental: in a pipeline where Agent B's input is Agent A's output, Agent B does not have access to what Agent A's output should have been. It processes what it received. If what it received was wrong, it reasons from wrong premises, and the output it passes to Agent C inherits those premises.

The astrophysics workflow study from April 2026 makes the application to scientific research concrete. Evaluating an agentic system performing multi-stage reasoning across domain-specific data, the researchers found that early errors propagated throughout the pipeline in ways that were not detectable from the final output. The paper's title captures the quality of the failure: "Plausible but Wrong." The outputs looked credible. They read like research. The errors were embedded in the reasoning, not visible in the surface of the text.

"Plausible but wrong" is a more dangerous failure mode than obvious error in professional research contexts specifically. An obviously wrong output gets rejected at the point of use. A plausible but wrong output gets incorporated into the analysis, the briefing, or the submission. The downstream cost is not the error itself but everything built on top of it before anyone checks.

Why this is different from what came before

The problems described in earlier articles — citation cascades, fabricated references, fragile single-study claims, expertise atrophy — all occur in settings where a human researcher is, in principle, interacting with sources directly. The researcher may not catch the citation cascade. They may not notice the missing replication study. But they had the opportunity. The source was in front of them.

An agentic pipeline removes that opportunity structurally. The user receives a finished document. The intermediate steps are not presented for review because presenting them would defeat the purpose. The sources are not displayed because the pipeline processed them rather than retrieving them for the user. The caveats are not visible because the summarisation step that removed them operated three steps upstream of anything the user saw.

This matters because the standard advice for responsible AI use — check the citations, verify the claims, maintain human oversight — assumes that the citations are present to check and the claims are traceable to verify. In an agentic pipeline, the design intention is specifically that the user should not have to do this. The pipeline does the research. The user gets the output.

The oversight gap Most guidance on responsible AI use in research assumes human review happens at the interface between the AI system and the user. In an agentic pipeline, the steps that most need review — source selection, caveat preservation, synthesis across sources with shared ancestry — happen between agents, not between the system and the user. Human-in-the-loop governance that operates only at the final output catches artifact-level errors. It does not catch the evidence quality degradation that accumulated upstream.

The scientific workflow is the highest-risk application

Agentic pipelines are being deployed across many professional domains. The evidence quality risks described here apply wherever the pipeline is being used to synthesise research, and they are most severe in domains where the stakes of a wrong conclusion are highest and the errors are hardest to detect from outside the workflow.

In regulatory affairs, an agentic pipeline summarising the evidence base for a product submission compresses the same papers a regulatory reviewer will read in full. The gaps between what the papers say and what the submission claims are exactly the gaps that create compliance risk. The pipeline did not fabricate anything. It produced a plausible summary of the evidence it was given. The summary omits the methodological limitations that the regulator will find when they read the primary source.

In investment research, an agentic pipeline synthesising analyst reports and market data produces a coherent narrative. If the underlying reports share a common framing bias — a sectoral consensus that has not yet been tested against disconfirming evidence — the pipeline's synthesis will reflect and amplify that bias. Each report appeared to be an independent source. The synthesis treated them as such. The investment thesis that emerges from it looks well-supported because the evidence it rests on looks diverse when it is not.

In policy research, the problem is the one described in the earlier article on expertise atrophy, operating at a different level. A policy team using an agentic pipeline to survey the evidence on a contentious question receives a synthesis shaped by the distribution of published research. The pipeline retrieved what it could find. What it could find reflects prior publication patterns. Prior publication patterns reflect prior policy debates, dominant methodological schools, and the amplification cascades that determine which papers get cited and which do not. The pipeline's output is internally coherent and well-sourced. It also reflects an existing structure of emphasis that the policy team had no opportunity to interrogate.

What claim-level provenance tracking changes

The structural problem with agentic pipelines is not that they make mistakes. All research systems make mistakes. The problem is that the architecture of a pipeline is specifically designed to hide the intermediate steps where mistakes accumulate, and where the evidence quality judgments that a skilled researcher would apply are most needed.

The intervention that matches the structure of the problem is one that operates at the level of individual claims rather than final outputs. If each claim in a research document carries a chain of provenance back to the source that generated it, including the intermediate steps that transformed it, then the opacity of the pipeline is reduced without requiring the user to do the research again. This creates a verifiable path from source to output — an audit trail that directly counters what the pipeline's design makes invisible. The document is still a finished product. The claims in it are still synthesised and readable. But each claim can be interrogated: which source did this come from, what did the source actually say, what happened to it in transit from source to synthesis.

This does not solve the corroboration problem automatically. A claim that traces back to five downstream citations of the same study is still a claim with weak evidential support. But it makes the problem visible in a way that the pipeline's output, on its own, does not. A researcher who can see that five apparently independent sources all cite the same 2019 paper can make a judgment about the strength of the convergence. A researcher who sees five sources cited in support of a claim, with no visibility into their ancestry, cannot.

The earlier articles in this series described what can go wrong with evidence in static research settings. This article has described why those problems are harder to detect and more damaging in the agentic setting that is rapidly becoming the default for research-intensive professional work. The pipeline is not making research worse on purpose. It is doing exactly what it was designed to do: absorbing complexity so the user does not have to. The question is whether the complexity it is absorbing includes the evidence quality judgments that no pipeline should be making alone.