Research sessions end, files are archived, and the next person starts from zero. A claim that should have been contested gets into the final report because nobody tracked the conflicting source. Six months later, the same question is being answered again.
This is not a technology problem. It is a structural problem with how evidence-based work is done, in consulting, in policy, in investigations, in accountability work. The format changes; the problem doesn't.
Epistamate is one implementation of a more general engine. This page describes the engine.
Claims are structured objects, text, status, computed confidence, credibility tier, citations, evidence type, not free-text summaries. Each claim is individually addressable: it can be verified, contested, weakened, or carried forward independently.
Confidence is computed from a deterministic formula, source tier, consensus across providers, adversarial challenge outcome, evidence recency, not LLM self-report. Range [5, 95]. LLM-reported confidence is recorded in audit metadata but never used in the score.
Mandatory Phase 3 runs before synthesis. Claims that don't survive lose their socratic bonus and are reassessed. Challenges are persisted as typed output, not discarded after scoring. The brief reflects what survived scrutiny, not what sounded best.
Knowledge gaps are first-class objects with importance ratings. They accumulate across sessions and narrow as evidence arrives. The reader knows where the brief stops being reliable, not as a disclaimer, but as a structured finding.
VERIFIED claims from session N reduce re-verification burden in session N+1. The knowledge graph accumulates with use. Contradictions between sessions are preserved, not silently resolved. Session five builds on sessions one through four.
Synthesis direction (Question → Brief) and Verification direction (Document → Decision Record) share the same graph, formula, and adversarial mechanism. Ingest an authoritative report; its claims enter the same evidence quality system as retrieved sources.
Source trust hierarchy, claim type vocabulary, scoring weights, and output format are runtime parameters, not prompts or hardcoded logic. The same binary runs policy research, investment due diligence, and regulatory compliance with no architectural change.
The hardest problem in evidence-based research is not retrieval. It is the gap between what a source says and what a claim needs it to prove. A topically relevant source is not the same as a supporting source. A supporting source is not the same as an independently corroborating one. Most research tools collapse these distinctions. The engine does not.
Claims that have survived adversarial challenge, passed source credibility gates, and achieved independent corroboration from sources with no shared citation lineage. These are the claims the brief can stand behind. The gate is deliberately hard. Zero strict findings is a correct output when the evidence does not warrant them.
Evidence that is relevant and sourced but does not yet meet the bar for a formal finding. A single high-quality source. A source that partially supports the claim. A finding from a constrained study presented in a broader context. Surfaced explicitly so the analyst knows what it is and is not.
Background material, metadata-only records, vendor claims, adjacent literature. Useful for orienting a research direction. Not eligible for the findings layer. The engine labels these explicitly rather than mixing them into the output and letting the analyst mistake noise for signal.
When evidence does not meet admission standards, the engine fails closed. It does not lower the threshold to produce output. It does not present qualified evidence as a finding. It does not fabricate references to fill a gap. A brief that clearly separates what is established from what is provisional is more useful than one that presents everything with equal confidence. The gaps are typed, rated, and visible. They are part of the output, not an absence of it.
The research community working on automated claim verification has converged on the same diagnosis: the bottleneck is not acquiring sources. It is determining what a fetched source actually says about a specific claim. Academic papers use different terminology from the query. They qualify findings. They describe methods alongside results. A system that treats topical relevance as evidential support will surface plausible-looking material that does not, on close reading, warrant the claim it appears to support. The engine is designed around this bottleneck, not around retrieval volume.
None of these domains currently has a tool that does what the engine does. Each is doing the equivalent work manually, in spreadsheets, committee documents, and institutional reports that nobody reads systematically three years later.
Dozens of UN agencies produce simultaneous situation reports on the same crisis. The claims conflict. Nobody tracks which ones are established. OCHA's Humanitarian Needs Overviews are evidence synthesis documents built under time pressure with no structured memory between crises.
A claim circulates in 40 sources. All 40 trace back to one original. That's amplification, not corroboration, but standard tools can't tell the difference. Fact-checking organisations and digital forensics labs do structured evidence work that needs to be itself defensible.
Government surveys, community testimonies, environmental studies, and corporate reports all exist in the same dispute. Contradictions between them are resolved by institutional power, not by evidence quality.
Treaty bodies assess state compliance claims cycle after cycle. Previous findings sit in PDF reports. Each new cycle starts from near-zero institutional memory. States report; bodies assess. Currently this is manual evidence synthesis with no compounding knowledge.
A company claims 95% accuracy. An independent study finds 60% on a specific demographic. Both claims exist in the public record. Neither is resolved, they just accumulate. Civil society and government auditors assessing AI harm need structured evidence work.
Government and defence procurement decisions are made over 18 months, across teams, based on claims that need to be traceable to source when the auditor arrives two years later. Capability, price, risk, compliance, each requires provenance, each is subject to challenge.
The best strategy research already works the way the engine works: individual claims are sourced and graded, contradictions between data points are noted, gaps in the evidence are named, and the final recommendation is honest about its confidence level. What it doesn't do is carry that structure forward to the next engagement, the next client, the next analyst who joins the team.
The structured brief a senior consultant produces for a board is a claim vault, it just doesn't look like one, and it evaporates when the project ends. The engine is what that process looks like when the institutional memory is preserved rather than PDF'd into an archive.
The claim extraction, confidence scoring, gap tracking, and decision logging are the same across every deployment. What changes is narrower than it looks: the source tier hierarchy that governs which publications and databases count as primary, the claim type vocabulary that structures what the engine is looking for, and the output artefact format that reflects how findings get used. Seven working modes are configured in the current build.
The EU AI Act, UNESCO's AI Ethics Recommendation, and a growing number of national frameworks share one underlying requirement: AI used in high-stakes contexts must be explainable and traceable, not just technically, but epistemically. The question is not only "what did the system output" but what evidence did it draw on, where did that evidence conflict, and what was uncertain when the decision was made.
Every claim carries its source tier, citations, and confidence derivation. Not a summary, a structured assertion with provenance.
Weak and contested findings surface explicitly. The brief reflects what the evidence supports, not what sounds most authoritative.
When a decision is logged, the full evidence state is preserved at that moment, verified, contested, gaps acknowledged. Article 12 record-keeping as a byproduct.
Marking a claim as evidence-grounded raises its confidence. The system reflects researcher judgment, not just model output.
Epistamate is in active development and talking to organisations where defensible, compounding evidence matters. If you recognise your domain in this page, I'd like to hear about the specific problem before describing the solution.