This field is moving fast. The account below reflects what we know as of May 2026 and will become incomplete. We think that is a reason to be precise about what we claim and what we don't, not a reason to avoid the question.
The core difficulty in AI-assisted research is not that language models make things up. It is that they do not know when they are making things up, and they report high confidence regardless. A 2025 study in Memory and Cognition found that LLM confidence judgments are systematically miscalibrated. A separate analysis found that tool-using agents show higher calibration errors than standalone models, not lower, meaning that adding retrieval to a language model often makes the confidence problem worse rather than better.
This creates a specific problem for professional research. A research brief that sounds authoritative and a research brief that is authoritative look identical in the output. The gap is only visible if you can trace each claim back to its source, assess the quality of that source, and check whether the claim survived scrutiny. Most AI research tools produce neither the trace nor the assessment.
The medical evidence synthesis community has documented this clearly. A 2026 scoping review in the Journal of Medical Internet Research covering 222 AI evidence synthesis tools concluded that current evidence does not support generative AI use in evidence synthesis without human involvement. Cochrane launched a formal platform study in 2026 to evaluate AI tools against traditional methods precisely because no tool has yet demonstrated sufficient reliability for systematic reviews.
The gap in that literature is significant. All serious work on AI evidence quality is happening in health and medical research. Policy research, regulatory intelligence, investment due diligence, and general professional research have no equivalent framework and no equivalent scrutiny. That is the domain Epistamate works in.
Several systems address parts of the problem. None addresses the combination that matters for professional research. The following is not a dismissal of these systems. They are serious work and some are more rigorous than Epistamate on the components they focus on.
FActScore decomposes generated text into atomic claims and verifies each claim against a knowledge source. It is rigorous on the NLP side and has been influential in establishing claim-level evaluation as the correct unit of analysis. VeriScore extends this with better handling of complex claims.
GraphRAG is the most technically serious system in the adjacent space. It builds a knowledge graph from ingested documents and uses that graph for retrieval rather than raw vector search. This produces better multi-hop reasoning and some degree of cross-session persistence. The engineering behind it is substantial and it is actively developed.
Elicit is the most direct product competitor for the research workflow. It does structured extraction from academic papers, claim-level analysis, and some degree of evidence quality assessment. For researchers working primarily with peer-reviewed literature it is a serious tool.
DebateCV (2025) is the closest academic parallel to Epistamate's adversarial challenge stage. It uses multiple LLM agents in a debate structure to verify claims, with agents challenging each other's assessments. The paper demonstrates that debate-driven verification improves claim assessment quality over single-model approaches.
MACI introduces what it calls dual-dial control: an information quality gate that filters evidence by credibility, and a behaviour dial that adjusts adversarial intensity across debate rounds. It achieves better confidence calibration than fixed-stance debate systems and uses fewer tokens. The calibration improvement is the relevant finding: a system that modulates how hard it challenges a claim based on evidence quality produces better-calibrated outputs than one that challenges everything equally.
MemGPT and similar systems solve the session persistence problem. They maintain memory across conversations and can retrieve prior context. This is real and useful.
A useful confidence scoring system uses evidence confidence, not model confidence. The score should tell you: we found three statutory provisions, two rulings, and a circular that directly address your question, and they agree, not: the model is 87% sure about its word choices.
Auryth, December 2025. Independent convergent framing of the same distinction Epistamate's formula is built on.Three pieces of independent work, from different research groups with no connection to Epistamate, have reached conclusions that validate the core architectural choices.
WebTrust (Tsinghua University / Chandigarh University, 2025) built an automated source credibility scoring system trained on 140,000 articles across 21 domains, using 35 reliability labels. It achieved a Mean Absolute Error of 0.09 on credibility prediction. The finding relevant to Epistamate: automated source credibility scoring at the tier level is tractable and accurate. Epistamate's source tier system is currently a structured heuristic. Anchoring it to a system like WebTrust would make the tier assignments more defensible empirically.
The Confidence Dichotomy paper (2025) found that tool-using agents exhibit systematically higher calibration errors than standalone LLMs. The relevant finding: adding retrieval to a language model does not fix confidence miscalibration; it often makes it worse. This is exactly the failure mode Epistamate's formula is designed to address. A formula-computed score based on source tiers, consensus counts, and adversarial outcomes is structurally different from retrieval-augmented model confidence.
MACI's calibration results show that information quality gating, combined with scheduled adversarial intensity, reduces calibration error (ECE 0.081 vs 0.103) compared to fixed-stance debate. This is the strongest academic evidence so far that the architectural pattern Epistamate uses for its adversarial challenge stage is the right one.
None of these papers know about Epistamate. They are working on adjacent problems in academic settings. The convergence is the point. When independent groups reach similar conclusions about what the architecture needs to look like, it suggests the architecture is pointed in the right direction.
Epistamate does not claim to have invented claim extraction, knowledge graphs, adversarial verification, confidence scoring, or session persistence. All of these exist, some of them in more rigorous forms than what Epistamate implements.
The claim is that no existing system, academic or commercial, combines all six properties in a single practitioner-facing deployment for general professional research. Typed claim extraction, formula-computed evidence-quality confidence, mandatory adversarial challenge, typed gap tracking as a first-class output, cross-session compounding of evidential state, and bidirectional operation. Removing any one of these degrades the system to something existing tools already do.
The second claim is that the confidence formula, which computes a score from source credibility tier, cross-provider consensus, adversarial challenge outcome, evidence recency, and sufficiency, is decoupled from LLM self-report. This is not a technical novelty in itself. It is a design choice that has practical consequences. A claim sourced from a single trade blog scores differently from a claim corroborated by three Tier 1 documents, regardless of how confident the model sounds about either one.
The third claim is that the decision log, which records the full evidence state at the moment a decision is logged, satisfies the epistemic accountability requirement of the EU AI Act Article 12 by construction rather than by bolt-on. This is an architectural claim, not a legal one. Qualified legal counsel must assess applicability to specific deployments.
These claims are published and citable. The architecture paper is available at Zenodo (10.5281/zenodo.19204972). The RegWatch domain configuration paper, which includes a direct live comparison against a frontier LLM and a structured prior art analysis, is at Zenodo (10.5281/zenodo.19301680). The system is defensively disclosed at IP.com (IPCOM/000277741).
| Property | FActScore | GraphRAG | Elicit | DebateCV | MemGPT | Epistamate |
|---|---|---|---|---|---|---|
| Typed claim extraction | Yes, atomic | Partial | Partial | Yes | No | Yes |
| Formula-computed confidence | Retrieval-based | No | No | No | No | Yes, decoupled from model |
| Source credibility tier | No | No | No | No | No | Yes |
| Adversarial challenge stage | No | No | No | Yes, core | No | Yes, mandatory |
| Typed gap tracking | No | No | No | No | No | Yes, first-class output |
| Cross-session compounding | No | Partial | No | No | Partial | Yes |
| Bidirectional operation | No | No | No | No | No | Yes |
| Decision log / audit trail | No | No | No | No | No | Yes, immutable |
| Practitioner deployment | No | Developer | Yes | No | Developer | Yes, desktop |
| Non-academic source types | Limited | Yes | No | Limited | Yes | Yes |
GraphRAG is the system most likely to close the gap. Microsoft is actively developing it and has the engineering capacity to add adversarial challenge, source tier scoring, and typed gap tracking. The question is not whether they will eventually build these features. It is whether they will build them for the practitioner research workflow specifically, or whether they will continue to develop GraphRAG as a developer infrastructure layer that organisations have to configure themselves.
The agentic AI space is moving toward what people are calling verification layers and epistemic harnesses: supervision systems that sit above research agents and check their outputs. This is the architectural role Epistamate occupies. As more organisations deploy AI research agents, the demand for a system that scores, challenges, and audits what those agents produce will grow. Whether Epistamate or a larger platform fills that role depends partly on whether Epistamate can establish domain presence and user relationships before the larger platforms catch up.
The components Epistamate uses are not all SOTA individually. The source tier system is a structured heuristic where a WebTrust-class automated scoring system would be more rigorous. The adversarial challenge stage is a single-pass mechanism where MACI's dual-dial approach would produce better-calibrated outputs. The knowledge graph is SQLite-backed where a production graph database would scale better. These are known gaps and the relevant comparison is not against the academic frontier on each component. It is against what exists in practitioner-facing deployment for general professional research, where the gap is wider.
The architecture paper documents what the system is and what it claims. The RegWatch paper includes a structured self-critique and a proposed evaluation methodology for systems of this class. If you are assessing Epistamate seriously, both papers are the right starting point.
This page was written in May 2026. Several of the papers cited here appeared in the last six months. The DebateCV framework, the MACI calibration results, the WebTrust source credibility system, and the Confidence Dichotomy findings all post-date the initial Epistamate architecture paper. The field is moving fast enough that a honest account of the landscape has a short shelf life. We will update this page as the picture changes. If you are aware of relevant work we have missed, we would like to hear about it.