This field is moving fast. The account below reflects what we know as of June 2026 and will become incomplete. We think that is a reason to be precise about what we claim and what we don't, not a reason to avoid the question.
The core difficulty in AI-assisted research is not that language models make things up. It is that they do not know when they are making things up, and they report high confidence regardless. A 2025 study in Memory and Cognition found that LLM confidence judgments are systematically miscalibrated. A separate analysis found that tool-using agents show higher calibration errors than standalone models, not lower, meaning that adding retrieval to a language model often makes the confidence problem worse rather than better.
This creates a specific problem for professional research. A research brief that sounds authoritative and a research brief that is authoritative look identical in the output. The gap is only visible if you can trace each claim back to its source, assess the quality of that source, and check whether the claim survived scrutiny. Most AI research tools produce neither the trace nor the assessment.
The medical evidence synthesis community has documented this clearly. A 2026 scoping review in the Journal of Medical Internet Research covering 222 AI evidence synthesis tools concluded that current evidence does not support generative AI use in evidence synthesis without human involvement. Cochrane launched a formal platform study in 2026 to evaluate AI tools against traditional methods precisely because no tool has yet demonstrated sufficient reliability for systematic reviews.
The gap in that literature is significant. All serious work on AI evidence quality is happening in health and medical research. Policy research, regulatory intelligence, investment due diligence, and general professional research have no equivalent framework and no equivalent scrutiny. That is the domain Epistamate works in.
Several systems address parts of the problem. None addresses the combination that matters for professional research. The following is not a dismissal of these systems. They are serious work and some are more rigorous than Epistamate on the components they focus on.
FActScore decomposes generated text into atomic claims and verifies each claim against a knowledge source. It is rigorous on the NLP side and has been influential in establishing claim-level evaluation as the correct unit of analysis. VeriScore extends this with better handling of complex claims.
GraphRAG is the most technically serious system in the adjacent space. It builds a knowledge graph from ingested documents and uses that graph for retrieval rather than raw vector search. This produces better multi-hop reasoning and some degree of cross-session persistence. The engineering behind it is substantial and it is actively developed. As of Build 2026, GraphRAG and LazyGraphRAG are the knowledge layer inside Microsoft Discovery, an agentic R&D platform that reached general availability in June 2026. Discovery adds hypothesis generation, experimentation workflows, and reproducible scientific review on top of the GraphRAG retrieval layer.
Elicit is the most direct product competitor for the research workflow. It does structured extraction from academic papers, claim-level analysis, and some degree of evidence quality assessment. For researchers working primarily with peer-reviewed literature it is a serious tool. The 2026 updates added multimodal capabilities and a Research Agent feature (Pro and higher plans) that extends search to clinical trial data, regulatory documents, and press releases.
Scite's Smart Citations classify each citing statement as supporting, contrasting, or mentioning the cited work, with the exact text snippet from the citing paper included. The Reference Check feature lets users upload manuscripts to see if cited papers have been contradicted by subsequent research. The AI Assistant can answer research questions while explicitly noting where findings conflict. The platform processes over 1.2 billion citation statements from more than 180 million articles.
Undermind runs autonomously for minutes to tens of minutes per query, performing iterative searching and reading rather than single-pass retrieval. It produces a research report with paper-level citations addressing questions from multiple angles and noting contradictions and methodological variation. GSK has deployed Undermind as the research-evidence layer of its AI Scientist stack, used by both AI agents and human researchers across target discovery and clinical development workflows.
DebateCV (published at ACM Web Conference 2026, WWW '26) is the closest academic parallel to Epistamate's adversarial challenge stage. It uses multiple LLM agents in a debate structure to verify claims, with agents challenging each other's assessments. The peer-reviewed paper demonstrates that debate-driven verification improves claim assessment quality over single-model approaches.
MACI introduces what it calls dual-dial control: an information quality gate that filters evidence by credibility, and a behaviour dial that adjusts adversarial intensity across debate rounds. It achieves better confidence calibration than fixed-stance debate systems and uses fewer tokens. The calibration improvement is the relevant finding: a system that modulates how hard it challenges a claim based on evidence quality produces better-calibrated outputs than one that challenges everything equally.
MemGPT and similar systems solve the session persistence problem. They maintain memory across conversations and can retrieve prior context. This is real and useful.
A useful confidence scoring system uses evidence confidence, not model confidence. The score should tell you: we found three statutory provisions, two rulings, and a circular that directly address your question, and they agree, not: the model is 87% sure about its word choices.
Auryth, December 2025. Independent convergent framing of the same distinction Epistamate's formula is built on.Three pieces of independent work, from different research groups with no connection to Epistamate, have reached conclusions that validate the core architectural choices.
WebTrust (Tsinghua University / Chandigarh University, 2025) built an automated source credibility scoring system trained on 140,000 articles across 21 domains, using 35 reliability labels. It achieved a Mean Absolute Error of 0.09 on credibility prediction. The finding relevant to Epistamate: automated source credibility scoring at the tier level is tractable and accurate. Epistamate's source tier system is currently a structured heuristic. Anchoring it to a system like WebTrust would make the tier assignments more defensible empirically.
The Confidence Dichotomy paper (2025) found that tool-using agents exhibit systematically higher calibration errors than standalone LLMs. The relevant finding: adding retrieval to a language model does not fix confidence miscalibration; it often makes it worse. This is exactly the failure mode Epistamate's formula is designed to address. A formula-computed score based on source tiers, consensus counts, and adversarial outcomes is structurally different from retrieval-augmented model confidence.
MACI's calibration results show that information quality gating, combined with scheduled adversarial intensity, reduces calibration error (ECE 0.081 vs 0.103) compared to fixed-stance debate. This is the strongest academic evidence so far that the architectural pattern Epistamate uses for its adversarial challenge stage is the right one.
None of these papers know about Epistamate. They are working on adjacent problems in academic settings. The convergence is the point. When independent groups reach similar conclusions about what the architecture needs to look like, it suggests the architecture is pointed in the right direction.
Independent research groups working on automated fact verification and claim verification have converged on a specific diagnosis of where the field has struggled. The bottleneck is not retrieval. It is the step between finding a source and determining what that source actually says about a specific claim.
Several distinct problems cluster at this interface. First, academic and policy sources use different terminology from the claims being researched. A system that matches lexically will miss relevant evidence that paraphrases, qualifies, or uses domain-specific synonyms for the same concept. Second, sources that are topically relevant are not necessarily evidentially relevant. A paper about AI in healthcare is not automatically evidence for a claim about AI diagnostic accuracy in emergency settings. Third, a source that supports part of a compound claim is not the same as a source that supports the whole claim. Fourth, two sources that both cite the same seminal paper are not independent corroborators, even if they reach similar conclusions.
The research literature addressing these problems has developed a consistent set of architectural recommendations: decompose compound claims into atomic units before attempting to verify them; classify the relationship between a source span and a claim (supports, partially supports, qualifies, contradicts, background only, method context) rather than treating evidence as binary; seek contradicting evidence deliberately rather than only accumulating support; and verify that corroborating sources are genuinely independent rather than tracing back to the same origin.
These are unsolved problems in the field in the sense that no production research tool has fully addressed them. They are not unsolved in the sense of being intractable. The architectural pattern required is understood. What is missing is its implementation in a practitioner-facing system that maintains the provenance discipline and fail-closed behaviour that professional use requires. This is the problem Epistamate is building toward.
Epistamate does not claim to have invented claim extraction, knowledge graphs, adversarial verification, confidence scoring, or session persistence. All of these exist, some of them in more rigorous forms than what Epistamate implements.
The claim is that no existing system, academic or commercial, combines all six properties in a single practitioner-facing deployment for general professional research. Typed claim extraction, formula-computed evidence-quality confidence, mandatory adversarial challenge, typed gap tracking as a first-class output, cross-session compounding of evidential state, and bidirectional operation. Removing any one of these degrades the system to something existing tools already do.
The second claim is that the confidence formula, which computes a score from source credibility tier, cross-provider consensus, adversarial challenge outcome, evidence recency, and sufficiency, is decoupled from LLM self-report. This is not a technical novelty in itself. It is a design choice that has practical consequences. A claim sourced from a single trade blog scores differently from a claim corroborated by three Tier 1 documents, regardless of how confident the model sounds about either one.
The third claim is that the decision log, which records the full evidence state at the moment a decision is logged, satisfies the epistemic accountability requirement of the EU AI Act Article 12 by construction rather than by bolt-on. This is an architectural claim, not a legal one. Qualified legal counsel must assess applicability to specific deployments.
These claims are published and citable. The architecture paper is available at Zenodo (10.5281/zenodo.19204972). The RegWatch domain configuration paper, which includes a direct live comparison against a frontier LLM and a structured prior art analysis, is at Zenodo (10.5281/zenodo.19301680). The system is defensively disclosed at IP.com (IPCOM/000277741).
| Property | FActScore | GraphRAG | Elicit | Scite | Undermind | DebateCV | MemGPT | Epistamate |
|---|---|---|---|---|---|---|---|---|
| Typed claim extraction | Yes, atomic | Partial | Partial | Partial | No | Yes | No | Yes |
| Formula-computed confidence | Retrieval-based | No | No | No | No | No | No | Yes, decoupled from model |
| Source credibility tier | No | No | No | No | No | No | No | Yes |
| Adversarial challenge stage | No | No | No | Partial | No | Yes, core | No | Yes, mandatory |
| Typed gap tracking | No | No | No | No | No | No | No | Yes, first-class output |
| Cross-session compounding | No | Partial | No | No | No | No | Partial | Yes |
| Bidirectional operation | No | No | No | No | No | No | No | Yes |
| Decision log / audit trail | No | No | No | No | No | No | No | Yes, immutable |
| Practitioner deployment | No | Developer | Yes | Yes | Yes, enterprise | No | Developer | Yes, desktop |
| Non-academic source types | Limited | Yes | Partial (Pro) | No | No | Limited | Yes | Yes |
GraphRAG is the system most likely to close the gap. The launch of Microsoft Discovery at Build 2026 partially answers the question of whether Microsoft would build GraphRAG into a practitioner-facing platform: they have, for scientific R&D. The question that remains is whether they will extend that focus to policy, regulatory, and professional services research specifically, or whether those domains continue to require organisations to configure their own workflows on top of the Azure infrastructure. Epistamate's domain advantage is narrowing on the technical side and depends increasingly on whether it can establish user relationships and domain-specific depth before the larger platforms extend their reach.
The agentic AI space is moving toward what people are calling verification layers and epistemic harnesses: supervision systems that sit above research agents and check their outputs. This is the architectural role Epistamate occupies. As more organisations deploy AI research agents, the demand for a system that scores, challenges, and audits what those agents produce will grow. Whether Epistamate or a larger platform fills that role depends partly on whether Epistamate can establish domain presence and user relationships before the larger platforms catch up.
The components Epistamate uses are not all SOTA individually. The source tier system is a structured heuristic where a WebTrust-class automated scoring system would be more rigorous. The adversarial challenge stage is a single-pass mechanism where MACI's dual-dial approach would produce better-calibrated outputs. The knowledge graph is SQLite-backed where a production graph database would scale better. These are known gaps and the relevant comparison is not against the academic frontier on each component. It is against what exists in practitioner-facing deployment for general professional research, where the gap is wider.
The architecture paper documents what the system is and what it claims. The RegWatch paper includes a structured self-critique and a proposed evaluation methodology for systems of this class. If you are assessing Epistamate seriously, both papers are the right starting point.
This page was last updated in July 2026. Several of the papers cited here appeared in the last six months. The DebateCV framework (now a peer-reviewed paper published at WWW '26), the MACI calibration results, the WebTrust source credibility system, and the Confidence Dichotomy findings all post-date the initial Epistamate architecture paper. Microsoft Discovery reaching general availability at Build 2026 has updated the competitive picture for GraphRAG, with a desktop preview app now targeting researchers and academic labs directly. Scite and Undermind have been added to this page following a June 2026 landscape review. The field is moving fast enough that a honest account of the landscape has a short shelf life. We will update this page as the picture changes. If you are aware of relevant work we have missed, we would like to hear about it.
The researcher behind Epistamate also writes on AI governance, the EU AI Act, and enterprise AI strategy from a CIO perspective at abhishek-sinha-bgl.github.io — a separate series written for technology executives and boards.