Prior art and open problems

The research landscape

This field is moving fast. The account below reflects what we know as of May 2026 and will become incomplete. We think that is a reason to be precise about what we claim and what we don't, not a reason to avoid the question.

Why the problem is structural, not technical

The core difficulty in AI-assisted research is not that language models make things up. It is that they do not know when they are making things up, and they report high confidence regardless. A 2025 study in Memory and Cognition found that LLM confidence judgments are systematically miscalibrated. A separate analysis found that tool-using agents show higher calibration errors than standalone models, not lower, meaning that adding retrieval to a language model often makes the confidence problem worse rather than better.

This creates a specific problem for professional research. A research brief that sounds authoritative and a research brief that is authoritative look identical in the output. The gap is only visible if you can trace each claim back to its source, assess the quality of that source, and check whether the claim survived scrutiny. Most AI research tools produce neither the trace nor the assessment.

The medical evidence synthesis community has documented this clearly. A 2026 scoping review in the Journal of Medical Internet Research covering 222 AI evidence synthesis tools concluded that current evidence does not support generative AI use in evidence synthesis without human involvement. Cochrane launched a formal platform study in 2026 to evaluate AI tools against traditional methods precisely because no tool has yet demonstrated sufficient reliability for systematic reviews.

The gap in that literature is significant. All serious work on AI evidence quality is happening in health and medical research. Policy research, regulatory intelligence, investment due diligence, and general professional research have no equivalent framework and no equivalent scrutiny. That is the domain Epistamate works in.

Prior approaches worth understanding honestly

Several systems address parts of the problem. None addresses the combination that matters for professional research. The following is not a dismissal of these systems. They are serious work and some are more rigorous than Epistamate on the components they focus on.

FActScore / VeriScore
Academic · Factuality evaluation

FActScore decomposes generated text into atomic claims and verifies each claim against a knowledge source. It is rigorous on the NLP side and has been influential in establishing claim-level evaluation as the correct unit of analysis. VeriScore extends this with better handling of complex claims.

These are evaluation systems, not research tools. They measure factuality after the fact; they do not support a research workflow. They have no cross-session memory, no gap tracking, no adversarial challenge stage, and they are designed for academic benchmarks rather than practitioner use. The confidence measures are retrieval-based, not formula-computed from source credibility tiers.
GraphRAG (Microsoft Research)
Open source · Knowledge graph retrieval

GraphRAG is the most technically serious system in the adjacent space. It builds a knowledge graph from ingested documents and uses that graph for retrieval rather than raw vector search. This produces better multi-hop reasoning and some degree of cross-session persistence. The engineering behind it is substantial and it is actively developed.

GraphRAG does not have an adversarial challenge stage, typed gap tracking, a source credibility tier system, or a formula-computed confidence score that is decoupled from model self-report. It also does not produce a decision log. The gap between GraphRAG and Epistamate is real but it is narrowing, and Microsoft has significantly more engineering capacity. This is the competitive system to watch.
Elicit
Product · Academic literature synthesis

Elicit is the most direct product competitor for the research workflow. It does structured extraction from academic papers, claim-level analysis, and some degree of evidence quality assessment. For researchers working primarily with peer-reviewed literature it is a serious tool.

Elicit is locked to academic literature. It does not work well with grey literature, proprietary documents, regulatory sources, or the mix of source types that characterise professional research outside academia. It has no cross-session compounding, no adversarial challenge, and no typed gap tracking.
DebateCV
Academic · Multi-agent adversarial verification

DebateCV (2025) is the closest academic parallel to Epistamate's adversarial challenge stage. It uses multiple LLM agents in a debate structure to verify claims, with agents challenging each other's assessments. The paper demonstrates that debate-driven verification improves claim assessment quality over single-model approaches.

DebateCV is a claim verification framework, not a research workflow system. It addresses the adversarial challenge component in isolation. There is no knowledge graph, no cross-session memory, no gap tracking, and no practitioner deployment. The existence of this work is validation that mandatory adversarial challenge is the right architectural choice. It is not a competing product.
MACI (Multi-Agent Collaborative Intelligence)
Academic · Calibration-aware multi-agent debate

MACI introduces what it calls dual-dial control: an information quality gate that filters evidence by credibility, and a behaviour dial that adjusts adversarial intensity across debate rounds. It achieves better confidence calibration than fixed-stance debate systems and uses fewer tokens. The calibration improvement is the relevant finding: a system that modulates how hard it challenges a claim based on evidence quality produces better-calibrated outputs than one that challenges everything equally.

This is recent academic work without practitioner deployment. The dual-dial insight is relevant to how Epistamate's adversarial challenge stage could be developed: varying challenge intensity based on source tier rather than applying a uniform challenge to every claim.
MemGPT / long-context memory systems
Open source · Persistent memory

MemGPT and similar systems solve the session persistence problem. They maintain memory across conversations and can retrieve prior context. This is real and useful.

Memory without epistemic structure is not the same as a compounding knowledge graph. MemGPT remembers what was said. It does not track whether claims were verified, what confidence they were assigned, or what gaps remain. Storing a claim and storing a verified, scored, challenged claim with source provenance are different operations.

A useful confidence scoring system uses evidence confidence, not model confidence. The score should tell you: we found three statutory provisions, two rulings, and a circular that directly address your question, and they agree, not: the model is 87% sure about its word choices.

Auryth, December 2025. Independent convergent framing of the same distinction Epistamate's formula is built on.

Independent work pointing in the same direction

Three pieces of independent work, from different research groups with no connection to Epistamate, have reached conclusions that validate the core architectural choices.

WebTrust (Tsinghua University / Chandigarh University, 2025) built an automated source credibility scoring system trained on 140,000 articles across 21 domains, using 35 reliability labels. It achieved a Mean Absolute Error of 0.09 on credibility prediction. The finding relevant to Epistamate: automated source credibility scoring at the tier level is tractable and accurate. Epistamate's source tier system is currently a structured heuristic. Anchoring it to a system like WebTrust would make the tier assignments more defensible empirically.

The Confidence Dichotomy paper (2025) found that tool-using agents exhibit systematically higher calibration errors than standalone LLMs. The relevant finding: adding retrieval to a language model does not fix confidence miscalibration; it often makes it worse. This is exactly the failure mode Epistamate's formula is designed to address. A formula-computed score based on source tiers, consensus counts, and adversarial outcomes is structurally different from retrieval-augmented model confidence.

MACI's calibration results show that information quality gating, combined with scheduled adversarial intensity, reduces calibration error (ECE 0.081 vs 0.103) compared to fixed-stance debate. This is the strongest academic evidence so far that the architectural pattern Epistamate uses for its adversarial challenge stage is the right one.

None of these papers know about Epistamate. They are working on adjacent problems in academic settings. The convergence is the point. When independent groups reach similar conclusions about what the architecture needs to look like, it suggests the architecture is pointed in the right direction.

The claim is the co-presence, not any component

Epistamate does not claim to have invented claim extraction, knowledge graphs, adversarial verification, confidence scoring, or session persistence. All of these exist, some of them in more rigorous forms than what Epistamate implements.

The claim is that no existing system, academic or commercial, combines all six properties in a single practitioner-facing deployment for general professional research. Typed claim extraction, formula-computed evidence-quality confidence, mandatory adversarial challenge, typed gap tracking as a first-class output, cross-session compounding of evidential state, and bidirectional operation. Removing any one of these degrades the system to something existing tools already do.

The second claim is that the confidence formula, which computes a score from source credibility tier, cross-provider consensus, adversarial challenge outcome, evidence recency, and sufficiency, is decoupled from LLM self-report. This is not a technical novelty in itself. It is a design choice that has practical consequences. A claim sourced from a single trade blog scores differently from a claim corroborated by three Tier 1 documents, regardless of how confident the model sounds about either one.

The third claim is that the decision log, which records the full evidence state at the moment a decision is logged, satisfies the epistemic accountability requirement of the EU AI Act Article 12 by construction rather than by bolt-on. This is an architectural claim, not a legal one. Qualified legal counsel must assess applicability to specific deployments.

These claims are published and citable. The architecture paper is available at Zenodo (10.5281/zenodo.19204972). The RegWatch domain configuration paper, which includes a direct live comparison against a frontier LLM and a structured prior art analysis, is at Zenodo (10.5281/zenodo.19301680). The system is defensively disclosed at IP.com (IPCOM/000277741).

Property FActScore GraphRAG Elicit DebateCV MemGPT Epistamate
Typed claim extraction Yes, atomic Partial Partial Yes No Yes
Formula-computed confidence Retrieval-based No No No No Yes, decoupled from model
Source credibility tier No No No No No Yes
Adversarial challenge stage No No No Yes, core No Yes, mandatory
Typed gap tracking No No No No No Yes, first-class output
Cross-session compounding No Partial No No Partial Yes
Bidirectional operation No No No No No Yes
Decision log / audit trail No No No No No Yes, immutable
Practitioner deployment No Developer Yes No Developer Yes, desktop
Non-academic source types Limited Yes No Limited Yes Yes

What could close the gap and when

GraphRAG is the system most likely to close the gap. Microsoft is actively developing it and has the engineering capacity to add adversarial challenge, source tier scoring, and typed gap tracking. The question is not whether they will eventually build these features. It is whether they will build them for the practitioner research workflow specifically, or whether they will continue to develop GraphRAG as a developer infrastructure layer that organisations have to configure themselves.

The agentic AI space is moving toward what people are calling verification layers and epistemic harnesses: supervision systems that sit above research agents and check their outputs. This is the architectural role Epistamate occupies. As more organisations deploy AI research agents, the demand for a system that scores, challenges, and audits what those agents produce will grow. Whether Epistamate or a larger platform fills that role depends partly on whether Epistamate can establish domain presence and user relationships before the larger platforms catch up.

The components Epistamate uses are not all SOTA individually. The source tier system is a structured heuristic where a WebTrust-class automated scoring system would be more rigorous. The adversarial challenge stage is a single-pass mechanism where MACI's dual-dial approach would produce better-calibrated outputs. The knowledge graph is SQLite-backed where a production graph database would scale better. These are known gaps and the relevant comparison is not against the academic frontier on each component. It is against what exists in practitioner-facing deployment for general professional research, where the gap is wider.

The architecture paper documents what the system is and what it claims. The RegWatch paper includes a structured self-critique and a proposed evaluation methodology for systems of this class. If you are assessing Epistamate seriously, both papers are the right starting point.

A note on pace

This page was written in May 2026. Several of the papers cited here appeared in the last six months. The DebateCV framework, the MACI calibration results, the WebTrust source credibility system, and the Confidence Dichotomy findings all post-date the initial Epistamate architecture paper. The field is moving fast enough that a honest account of the landscape has a short shelf life. We will update this page as the picture changes. If you are aware of relevant work we have missed, we would like to hear about it.

References and further reading
Get in touch → See the demo The Engine →