LLM hallucination detection, grounded in real sources.
Lenz detects hallucinations the honest way: it pulls every verifiable claim out of your model's output, researches each one against real independent sources, and returns a sourced verdict with citations — not a confidence guess from another model grading its own kind.
Catch hallucinations in CI before they ship, or gate answers at runtime before they reach a user.
What a detected hallucination looks like.
A real model-written claim, verified by the real pipeline. Click through for the full evidence.
“Albert Einstein won the 1921 Nobel Prize in Physics for his theory of general relativity.”
Plausible, fluent, and wrong — the classic hallucination shape. The 1921 Nobel was awarded for the photoelectric effect, not relativity. Lenz catches it because the sources say so, not because a model felt unsure.
See the full verification — sources, debate, reasoning →How detection works.
Three API primitives chain into a detection pipeline:
POST /extractpulls the verifiable claims out of any model output (free, 1000/day).POST /assessreturns a fast 3-model panel verdict per claim in ~5-10s.POST /verifyruns the full 8-model pipeline with citations on the claims that matter (~90s).
Evidence-grounded, not self-grading: verdicts come from research across real sources, argued through an adversarial debate and scored by independent reviewers — with the full audit trail attached.
Why multi-model beats a single checker.
A model grading its own vendor's output isn't independent — and a single checker inherits a single model's blind spots. Lenz runs frontier models from rival vendors against each other: one argues the claim is true, one argues it's false, and three independent reviewers adjudicate on the evidence.
The disagreement is the signal. When the panel splits, you see it — individual scores, not a blended guess.
Pick your latency tier
- Gate a response while the user waits →
/assess, ~5-10s, fits a chat-completion timeout - Deep check with citations and audit trail →
/verify, ~90s, async with webhooks - Both →
/assesseverything, escalate low-confidenceclaims to/verify
Detect hallucinations in 10 lines
Python pip install lenz-io
# model output in — hallucinations out from lenz_io import Lenz client = Lenz(api_key="lenz_...") claims = client.extract(text=model_output).identified_claims r = client.assess(text=" ".join(claims)) for c in r.claims: if c.verdict == "False": print("HALLUCINATION:", c.claim, c.confidence) # HALLUCINATION: Albert Einstein won the 1921 Nobel Prize ... high
TypeScript SDK too: npm install lenz-io. Same flow, typed end to end.
Not an inline guardrail. An investigator.
Millisecond guardrails classify against the model's own context — fast, cheap, and blind to anything outside it. Lenz sits at the other end of the latency-fidelity axis: it researches each claim against fresh, independent sources and shows its work. Use a guardrail to filter; use Lenz when you need to actually know — and to prove it.
Pricing
Start free, no card required. /extract is free at 1,000 calls/day.
Free
$0
/extract · 1,000/day
/assess · 100/mo
/verify · 10/mo
Wire detection into CI today, zero spend.
Pro
$99/mo
/assess · 5,000/mo
/verify · 500/mo
/ask · 1,000/mo
Self-serve. $999/yr annual.
Frequently asked questions
Checking whether the factual claims in a model's output are actually true. Lenz extracts each verifiable claim, researches it against real independent sources, and returns a verdict with citations — rather than asking another model to guess.
AI detectors guess whether a text was written by AI. Lenz does the opposite job: it takes text you already know came from a model and checks whether what it says is factually true.
Yes. Lenz verifies against fresh web research rather than your retrieval context, so it catches both RAG failure modes: answers unfaithful to the retrieved documents, and retrieved documents that are themselves wrong or stale.
Every verdict ships with cited sources and the full research, debate, and adjudication trail — you can verify the verification. We also publish open evaluations of frontier-model accuracy on real user claims.
Use /assess for a ~5-10s panel verdict that fits a chat-completion timeout; escalate low-confidence claims to the full /verify pipeline asynchronously with webhooks.
By use case
Catch the next hallucination before a customer does.
Self-serve from day one. Free tier, no card required.