The peer review loop is breaking

At ICLR 2026, the world's largest AI conference, 21% of peer reviews were fully AI-generated. More than half showed signs of AI involvement. The same conference reviews AI research. This is the first measured instance of a closed loop: AI writing papers, AI reviewing them, with human oversight increasingly thin in between. The implications reach well beyond academia.

The finding came from an unlikely starting point. A Carnegie Mellon professor posted on social media offering a reward to anyone who could systematically scan ICLR 2026 submissions for AI-generated text. Max Spero, CEO of Pangram Labs, responded. Within twelve hours, his team had written code to analyse all 75,800 peer reviews submitted to the conference. Two caveats are worth naming upfront: Pangram is a commercial AI detection company with a business interest in demonstrating that AI content is widespread, and their detection model, EditLens, was itself a paper submitted to ICLR 2026. The findings should be read with that context. They are nonetheless consistent with what ICLR's own organisers had already acknowledged — announcing desk rejections for undisclosed LLM use and consequences for reviewers submitting hallucinated reviews — before Pangram published their numbers.

The results were not what most people expected in scale. One in five reviews — 15,899 out of 75,800 — were classified as fully AI-generated. A further tranche pushed total AI involvement past 50%. The 2024 figure for the same conference was 16.9%. In two years, the rate had climbed four percentage points. The curve is not flattening.

21%

of ICLR 2026 peer reviews fully AI-generated (Pangram Labs, 2026)

50%+

of all reviews showed some form of AI involvement

4.43

average score given by AI reviews vs 4.13 by fully human reviews

That last number is the one that matters most for evidence quality. AI-generated reviews gave systematically higher scores than human reviews — despite, in Pangram's analysis, correlating with lower paper quality. The review process is not just being automated. It is being gamed in a direction that inflates the apparent quality of the research record.

What AI peer review actually looks like

Pangram's analysis identified five recurring characteristics of AI-generated reviews. They were longer than human reviews. They used section headers with bold formatting. They had low information density — extensive lists of weaknesses and questions that did not engage with the specific argument of the paper. One review ran to 3,000 words, listing 40 weaknesses and 40 questions. Exhaustive and weightless is a precise description of what AI reviewing looks like at scale.

This matters for how the output gets used downstream. A 3,000-word review with 40 listed weaknesses creates an impression of rigorous scrutiny. The weaknesses are real-looking — grammatically correct, technically plausible, specific enough to seem engaged. But they are not the result of a human researcher reading the paper carefully and identifying the specific methodological choice that undermines the central claim. They are the result of a model producing what a thorough review looks like, without the epistemic work that a thorough review requires.

The distinction between looking like scrutiny and being scrutiny is exactly the distinction this series of articles has been tracing from a different direction: fabricated citations look like citations, amplification cascades look like consensus, fragile assurance looks like established evidence. The peer review finding is the same failure mode at a different point in the evidence chain.

The proxy-sovereign problem

A paper published in January 2026 by researchers at the Technical University of Denmark and Technical University of Darmstadt formalises what is happening here with useful precision. The paper introduces the concept of proxy-sovereign evaluation — a condition in which a field's evaluation process has shifted from tracking actual quality to tracking easier-to-measure proxies for quality.

The paper identifies two forces driving this phase transition: verification pressure, when claims outpace verification capacity, and signal shrinkage, when real improvements become hard to separate from noise. When the volume of submissions exceeds the community's ability to review them carefully, the review process migrates toward signals that are easier to produce and evaluate — benchmark scores, formal compliance, section structure, word count. The AI review crisis is a direct consequence of this migration.

The paper's central recommendation is for verification-first AI: deploying AI as an adversarial auditor that generates auditable verification artifacts, rather than as a score predictor that adds more reviews to a process already failing at quality control. The difference is whether the tool produces something that can be independently checked — a specific claim about a methodology error, a verifiable citation check, a reproducibility test — or whether it produces a plausible-sounding evaluation that cannot be independently verified.

The proxy-sovereign condition A field is in a proxy-sovereign state when its evaluation process has shifted from tracking what is actually true or good to tracking what is easier to measure. High prestige is no protection: the paper argues that high-profile venues may be more exposed, because the volume of submissions to prestigious conferences outpaces verification capacity faster than smaller venues.

Why this matters beyond AI research

The ICLR finding is about an AI conference. But the mechanism it exposes is not domain-specific.

Policy research draws on academic literature as one of its primary evidence sources. When a policy analyst cites "research showing that X" they are often citing a paper that was accepted at a conference or published in a journal on the basis of peer review. If peer review is being conducted by AI systems that systematically inflate quality assessments, the policy analyst's evidence base is systematically degraded — not because the analyst did anything wrong, but because the upstream quality control has failed.

Regulatory research faces the same problem. EU AI Act compliance work, for example, regularly references published research on AI reliability, bias, and safety. If that research passed peer review through AI-generated evaluations that missed substantive methodological problems, the compliance conclusions built on it are fragile — in precisely the sense we described in the previous article on evidential fragility.

The contamination path runs: AI-generated paper, AI-generated review, acceptance, citation, policy brief, regulatory guidance. Each step in the chain looks legitimate. The problem is invisible unless you track the lineage.

The incentive collapse condition

The verification-first paper derives what it calls an incentive-collapse condition: the point at which rational effort shifts from truth-seeking to proxy optimisation, even when current decisions still appear reliable. This is the most important insight in the paper for understanding why the problem compounds rather than self-corrects.

Once a significant fraction of reviewers use AI, the remaining human reviewers are at a disadvantage. They spend more time per review. They produce shorter, more focused critiques. The AI reviews are longer and more comprehensive-looking. If longer reviews with more listed weaknesses correlate with reviewer reputation or acceptance rates, the incentive to do careful short reviews disappears. The proxy — review length, apparent thoroughness — becomes the target rather than the underlying quality it was meant to signal.

This is not a moral failure by individual researchers. It is a structural consequence of a review system designed for a world where the volume of submissions was humanly manageable and where the cost of producing a high-quality review was roughly similar across all reviewers. Neither condition holds anymore.

What the community is doing about it

ICLR's leadership announced stricter guidelines following the Pangram finding, including mandatory declarations of AI use in reviews and enhanced verification processes. In May 2026, PLOS became the first major academic publisher to deploy an AI tool specifically for detecting suspicious peer reviews, flagging copied peer reviews to help uncover fraud in academic publishing — a notable escalation from detecting AI content in papers to detecting it in the review process itself.

A survey of 1,600 academics found that more than 50% have used AI tools while peer reviewing manuscripts, even when core judgments remained human-authored. This makes the detection problem considerably harder: a review that is partially AI-assisted but human-edited may be more problematic than a fully AI-generated review, because it is harder to detect and because the human editor may not have engaged critically with the AI's output before submitting it.

The arXiv community has responded with proposals for author feedback mechanisms and reviewer incentives — stipends, credits, reputation systems — to make careful human reviewing more attractive relative to delegating to an AI. These are reasonable responses to the incentive problem, but they address the symptom rather than the structural cause.

The deeper question for evidence-dependent research

The ICLR finding sits at the end of a chain of evidence quality failures that this series of articles has been tracing from different angles. Fabricated citations undermine the existence guarantee of references. Citation amplification cascades undermine the independence guarantee of corroboration. Evidential fragility undermines the sufficiency guarantee of cited evidence. AI-generated peer review undermines the quality guarantee of accepted research.

Each failure mode is distinct. But they share a common structure: a process that was designed to ensure evidence quality is being bypassed or corrupted in a way that preserves the appearance of quality while removing the substance.

The appropriate response is not to distrust all published research. The appropriate response is to treat peer review acceptance as a weaker signal than it was five years ago, and to build research workflows that do not rely on acceptance alone as a quality guarantee. What a paper claims, what evidence it cites, whether those citations exist and support the claim, and whether independent research paths arrive at the same conclusion — these questions matter more now, not less, because the upstream filtering that used to catch bad research is less reliable than it was.

Verification-first is not a technical feature. It is a research posture. The peer review crisis makes it newly urgent.

The peer review loopis breaking.

What AI peer review actually looks like

The proxy-sovereign problem

Why this matters beyond AI research

The incentive collapse condition

What the community is doing about it

The deeper question for evidence-dependent research

The peer review loop
is breaking.