In AI-assisted consulting work, the output that looks most confident is sometimes the work that most needs checking

The partner review is the quality gate the consulting model has always relied on. AI has introduced a failure mode it cannot see -- not because the output looks wrong, but because nothing in the output says which parts of it are.

Before a deliverable reaches a client, someone senior reads it and decides whether it holds up -- whether the analysis is sound, whether the claims are defensible, whether the evidence supports what the engagement team is asserting. That review is the quality gate the consulting model has always relied on.

The problem is that AI has introduced a failure mode the partner review cannot see.

The study that changed how this should be understood

A 2023 study conducted with Boston Consulting Group -- now formally published in Organisation Science -- ran a pre-registered experiment with 758 BCG consultants to measure what actually happens when skilled knowledge workers use AI on realistic consulting tasks. The results on the positive side have been widely cited: consultants using GPT-4 completed 12.2% more tasks, worked 25.1% faster, and produced output rated more than 40% higher in quality by independent evaluators. Those are striking numbers and they have driven a significant portion of the AI adoption narrative in professional services.

The finding that received less attention is the one that matters more for how this plays out in practice.

For a task deliberately placed outside the boundary of what current AI handles well, consultants who used AI were 19 percentage points less likely to produce correct solutions than those who worked without it. The AI made them worse, not better. This is not an argument against AI use in consulting -- the positive findings are real and the productivity gains are substantial. The argument is more specific: a 40% quality uplift on tasks AI handles well, combined with a 19-percentage-point penalty on tasks it handles poorly, is only a good trade-off if the engagement team can tell which tasks sit where. The study's authors are specific about why that is the problem. The boundary between tasks AI handles well and tasks it handles poorly is not predictable in advance. It does not correspond to task difficulty. Tasks that appear similar in complexity can sit on opposite sides of the capability boundary, and the AI's output does not signal which side it is on.

The researchers named this the "jagged technological frontier." The frontier is not a smooth line between easy and hard. It is irregular, counterintuitive, and opaque. A consultant working across a typical engagement workflow will have tasks on both sides of it without knowing which is which.

What this means for the deliverable

A thorough partner review does more than read a synthesis for plausibility. It interrogates the analysis -- challenging the numbers, testing logical coherence against domain knowledge, questioning assumptions, and applying experience to spot conclusions that do not fit known market dynamics. This is real and important quality assurance, and it catches a significant class of errors.

The failure mode the BCG study describes is different from the ones partner review is calibrated to catch. The study found that consultants using AI on outside-frontier tasks produced analyses that were substantively wrong -- not stylistically implausible, not obviously thin, but analytically incorrect in ways that a reviewer without deep specialist knowledge of that specific sub-task would not necessarily detect. The failure is not that the output sounds confident. It is that the output is substantively plausible while being wrong, and there is no signal in the work product that flags which parts of the analysis crossed the capability boundary.

This matters most in cross-disciplinary or highly specialised work -- a due diligence report covering a technical domain the reviewing partner is not expert in, a regulatory impact assessment spanning multiple jurisdictions, a market sizing that combines macroeconomic modelling with sector-specific dynamics. In these cases, the partner's domain expertise cannot fully substitute for knowing where the AI's capability boundary fell within the specific analytical tasks the team performed.

This is distinct from the hallucinated citation problem that has attracted most of the attention around AI in consulting -- the Deloitte Australia government report, the EY Canada loyalty fraud paper, the fabricated court quotes. Those failures are detectable by checking whether sources exist. The failure mode the BCG study describes is more subtle: a synthesis that is analytically wrong rather than referentially fabricated, produced on a task the AI handled poorly, with no indicator in the output that it belongs to that category.

The evidence quality problem underneath the output

There is a related structural problem that sits beneath the visible output of consulting work.

When a consulting team produces a regulatory impact assessment, a market sizing analysis, or a due diligence report, the claims in that document rest on an evidence base that is mostly invisible to the person reading the final deliverable. The client sees the synthesis. The partner review sees a layer below that. But which claims are supported by what evidence, at what level of confidence, whether contradictions between sources were acknowledged or suppressed -- these typically exist only in working documents that do not travel with the deliverable.

AI-assisted research compounds this problem. The retrieval and synthesis work happens inside the tool, and the output is a confident narrative rather than a structured record of what was found, what was contested, and what was not answerable. Even when citations are present and real, a provenance gap opens between a citation existing and a citation actually supporting the claim it is attached to.

This is the provenance soundness problem that research published earlier this year has started to name precisely. A June 2026 benchmark study of frontier AI agents on biomedical research questions -- a domain where source matching is particularly demanding -- found that agents resolved over 99% of cited URLs correctly, yet approximately 15.9% of those citations linked to papers that did not support the claim being made. The domain is specific and the rate may differ in commercial consulting contexts, but the failure mode is architectural rather than domain-dependent: retrieval finds relevant documents, synthesis generates claims, citation attaches document references to claims, and no step checks whether the specific passage in the cited document actually entails the specific claim.

The gap that matters A citation existing confirms a source is real. It says nothing about whether the specific passage in that source supports the specific claim attached to it. In consulting deliverables, this distinction is rarely visible to the partner review.

The capability boundary and the review process

The BCG study's finding about the 19-percentage-point penalty deserves to be read alongside a separate finding about how consultants interact with AI on tasks where it struggles.

The study found that consultants could not reliably identify in advance which tasks fell outside the AI's capability boundary. This is not primarily a story about being misled by tone or presentation -- the BCG experiment did not measure whether output style influenced trust. It is a story about the absence of a reliable signal. The work product produced on outside-frontier tasks was substantively indistinguishable, at the point of review, from work product produced on inside-frontier tasks. Consultants who relied most heavily on AI for the outside-frontier task were the most likely to produce wrong answers -- not because they were naive about AI limitations, but because there was nothing in the output that told them this particular task was one where those limitations applied.

This is the failure mode that sits beneath the partner review's visibility. Not because partners are insufficiently sceptical, but because the signal they would need -- some indicator that this specific piece of analysis crossed the capability boundary -- is not present in what the engagement team submits.

The conventional response to AI quality risk in consulting has been to tell practitioners to check the outputs. The BCG study suggests this framing misunderstands the problem. The checking requires knowing which outputs need to be checked, and the AI provides no reliable signal of that. A checking process applied uniformly to all AI output is expensive and defeats most of the speed benefit. A checking process applied selectively -- to the sections that seem most reliable -- will systematically miss the sections that most need scrutiny.

What the review process actually needs

The alternative is not less AI use. It is a different architecture for what the AI-assisted workflow produces before the synthesis is written.

Consider a regulatory impact assessment covering three jurisdictions. The engagement team uses AI to research the applicable obligations in each. The AI produces a synthesis. What travels to the partner review is that synthesis -- a structured narrative with citations attached. What does not travel is anything that would let the review process do three specific things.

The first is claim-level validation: for each material assertion in the document, which specific passage in which specific source supports it, and does that passage actually entail the claim or merely discuss the same topic. This is different from citation existence checking, which confirms a paper exists but says nothing about whether it supports the claim attached to it.

The second is contradiction surfacing: where two or more sources in the evidence base disagree on a material point, was that disagreement acknowledged in the synthesis or resolved invisibly. An AI synthesis that encounters two sources with conflicting positions on the scope of a regulatory obligation will typically produce a confident unified statement. The contradiction disappears. The partner review sees the unified statement, not the disagreement underneath it.

The third is gap identification: where the evidence was insufficient to support a claim at a defensible standard, was that gap named or papered over. A synthesis that fills every evidentiary gap with confident prose is not more rigorous than one that names the gaps -- it is less honest and more dangerous, because it presents an absence of evidence as if it were evidence.

One clarification is worth making explicit: this structured review layer cannot be built by the same generative models used for synthesis. Asking the AI that produced a wrong synthesis on an outside-frontier task to accurately audit its own claim-evidence links is the same problem restated. The verification step requires a dedicated process -- whether a specialised non-generative tool, a structured human review at the claim level, or a hybrid -- that is architecturally separate from the generation step. The instinctive response to an AI quality problem is often to add more AI, which does not resolve the underlying issue when the problem is that the AI cannot signal its own failures.

These three functions -- claim validation, contradiction surfacing, gap identification -- are what the partner review would need to catch the failure mode the BCG study describes. Not an additional reading of the synthesis, but a different input to the review entirely. The consulting model has always relied on human judgment to perform these functions informally. The BCG study's finding suggests that AI-assisted work makes the informal version increasingly inadequate -- not because AI makes consultants less capable, but because the capability boundary is invisible and the output gives the reviewer nothing to work with except the synthesis itself.

The output that looks most confident is sometimes the work that most needs checking. The question for professional services firms is whether the review process can see that -- and at the moment, the answer is that it cannot, not without a structured record of the evidence underneath the work.

In AI-assisted consulting work, the output that looks most confident is sometimes the work that most needs checking.

The study that changed how this should be understood

What this means for the deliverable

The evidence quality problem underneath the output

The capability boundary and the review process

What the review process actually needs