The finding came from an unlikely starting point. A Carnegie Mellon professor posted on social media offering a reward to anyone who could systematically scan ICLR 2026 submissions for AI-generated text. Max Spero, CEO of Pangram Labs, responded. Within twelve hours, his team had written code to analyse all 75,800 peer reviews submitted to the conference.
The results were not what most people expected in scale. One in five reviews — 15,899 out of 75,800 — were classified as fully AI-generated. A further tranche pushed total AI involvement past 50%. The 2024 figure for the same conference was 16.9%. In two years, the rate had climbed four percentage points. The curve is not flattening.
That last number is the one that matters most for evidence quality. AI-generated reviews gave systematically higher scores than human reviews — despite, in Pangram's analysis, correlating with lower paper quality. The review process is not just being automated. It is being gamed in a direction that inflates the apparent quality of the research record.
What AI peer review actually looks like
Pangram's analysis identified five recurring characteristics of AI-generated reviews. They were longer than human reviews. They used section headers with bold formatting. They had low information density — extensive lists of weaknesses and questions that did not engage with the specific argument of the paper. One review ran to 3,000 words, listing 40 weaknesses and 40 questions. Exhaustive and weightless is a precise description of what AI reviewing looks like at scale.
This matters for how the output gets used downstream. A 3,000-word review with 40 listed weaknesses creates an impression of rigorous scrutiny. The weaknesses are real-looking — grammatically correct, technically plausible, specific enough to seem engaged. But they are not the result of a human researcher reading the paper carefully and identifying the specific methodological choice that undermines the central claim. They are the result of a model producing what a thorough review looks like, without the epistemic work that a thorough review requires.
The distinction between looking like scrutiny and being scrutiny is exactly the distinction this series of articles has been tracing from a different direction: fabricated citations look like citations, amplification cascades look like consensus, fragile assurance looks like established evidence. The peer review finding is the same failure mode at a different point in the evidence chain.
The proxy-sovereign problem
A paper published in January 2026 by researchers at the University of Washington formalises what is happening here with useful precision. The paper introduces the concept of proxy-sovereign evaluation — a condition in which a field's evaluation process has shifted from tracking actual quality to tracking easier-to-measure proxies for quality.
The paper identifies two forces driving this phase transition: verification pressure, when claims outpace verification capacity, and signal shrinkage, when real improvements become hard to separate from noise. When the volume of submissions exceeds the community's ability to review them carefully, the review process migrates toward signals that are easier to produce and evaluate — benchmark scores, formal compliance, section structure, word count. The AI review crisis is a direct consequence of this migration.
The paper's central recommendation is for verification-first AI: deploying AI as an adversarial auditor that generates auditable verification artifacts, rather than as a score predictor that adds more reviews to a process already failing at quality control. The difference is whether the tool produces something that can be independently checked — a specific claim about a methodology error, a verifiable citation check, a reproducibility test — or whether it produces a plausible-sounding evaluation that cannot be independently verified.
Why this matters beyond AI research
The ICLR finding is about an AI conference. But the mechanism it exposes is not domain-specific.
Policy research draws on academic literature as one of its primary evidence sources. When a policy analyst cites "research showing that X" they are often citing a paper that was accepted at a conference or published in a journal on the basis of peer review. If peer review is being conducted by AI systems that systematically inflate quality assessments, the policy analyst's evidence base is systematically degraded — not because the analyst did anything wrong, but because the upstream quality control has failed.
Regulatory research faces the same problem. EU AI Act compliance work, for example, regularly references published research on AI reliability, bias, and safety. If that research passed peer review through AI-generated evaluations that missed substantive methodological problems, the compliance conclusions built on it are fragile — in precisely the sense we described in the previous article on evidential fragility.
The contamination path runs: AI-generated paper, AI-generated review, acceptance, citation, policy brief, regulatory guidance. Each step in the chain looks legitimate. The problem is invisible unless you track the lineage.
The incentive collapse condition
The verification-first paper derives what it calls an incentive-collapse condition: the point at which rational effort shifts from truth-seeking to proxy optimisation, even when current decisions still appear reliable. This is the most important insight in the paper for understanding why the problem compounds rather than self-corrects.
Once a significant fraction of reviewers use AI, the remaining human reviewers are at a disadvantage. They spend more time per review. They produce shorter, more focused critiques. The AI reviews are longer and more comprehensive-looking. If longer reviews with more listed weaknesses correlate with reviewer reputation or acceptance rates, the incentive to do careful short reviews disappears. The proxy — review length, apparent thoroughness — becomes the target rather than the underlying quality it was meant to signal.
This is not a moral failure by individual researchers. It is a structural consequence of a review system designed for a world where the volume of submissions was humanly manageable and where the cost of producing a high-quality review was roughly similar across all reviewers. Neither condition holds anymore.
What the community is doing about it
ICLR's leadership announced stricter guidelines following the Pangram finding, including mandatory declarations of AI use in reviews and enhanced verification processes. In May 2026, PLOS became the first major academic publisher to deploy an AI tool specifically for detecting suspicious peer reviews, flagging copied peer reviews to help uncover fraud in academic publishing — a notable escalation from detecting AI content in papers to detecting it in the review process itself.
A survey of 1,600 academics found that more than 50% have used AI tools while peer reviewing manuscripts, even when core judgments remained human-authored. This makes the detection problem considerably harder: a review that is partially AI-assisted but human-edited may be more problematic than a fully AI-generated review, because it is harder to detect and because the human editor may not have engaged critically with the AI's output before submitting it.
The arXiv community has responded with proposals for author feedback mechanisms and reviewer incentives — stipends, credits, reputation systems — to make careful human reviewing more attractive relative to delegating to an AI. These are reasonable responses to the incentive problem, but they address the symptom rather than the structural cause.
The deeper question for evidence-dependent research
The ICLR finding sits at the end of a chain of evidence quality failures that this series of articles has been tracing from different angles. Fabricated citations undermine the existence guarantee of references. Citation amplification cascades undermine the independence guarantee of corroboration. Evidential fragility undermines the sufficiency guarantee of cited evidence. AI-generated peer review undermines the quality guarantee of accepted research.
Each failure mode is distinct. But they share a common structure: a process that was designed to ensure evidence quality is being bypassed or corrupted in a way that preserves the appearance of quality while removing the substance.
The appropriate response is not to distrust all published research. The appropriate response is to treat peer review acceptance as a weaker signal than it was five years ago, and to build research workflows that do not rely on acceptance alone as a quality guarantee. What a paper claims, what evidence it cites, whether those citations exist and support the claim, and whether independent research paths arrive at the same conclusion — these questions matter more now, not less, because the upstream filtering that used to catch bad research is less reliable than it was.
Verification-first is not a technical feature. It is a research posture. The peer review crisis makes it newly urgent.