What is LLM hallucination detection?

Hallucination detection is checking whether the factual claims in a language model's output are actually true. Lenz does it by extracting each verifiable claim from the output, researching it against real independent sources, and returning a verdict with citations — rather than asking another model to guess.

How accurate is the detection?

Every verdict is grounded in cited sources you can inspect — the full research, debate, and adjudication trail ships with each verification, so you can verify the verification. Lenz also publishes an open evaluation of where frontier models disagree on real-world fact-checks.

Is it fast enough to gate responses in production?

Use /assess for a ~5-10 second three-model panel verdict that fits a chat-completion timeout, and escalate low-confidence claims to the full /verify pipeline (~90s) asynchronously.

For LLM product teams

Catch AI hallucinations before they ship.

Q: Does it work for RAG hallucinations?

Yes, but differently from a groundedness checker. Lenz doesn't compare the answer to your retrieval context; it extracts the claims from your model's output and verifies each against fresh independent sources. So it catches the failure mode faithfulness tools miss: an answer built faithfully on a retrieved document that's wrong or stale still gets flagged, because the claim itself doesn't hold up against the evidence.

LLM hallucination detection, done the honest way: Lenz pulls every verifiable claim out of your model's output, researches each against real independent sources, and returns a sourced verdict with citations — not a confidence guess from another model grading its own kind.

Wire it into CI to catch regressions, or gate answers at runtime before they reach a user.

See how the /verify pipeline works → Try the pipeline on a claim first →

Get a free API key → See the quickstart API reference

What a detected hallucination looks like.

A real model-written claim, verified by the real pipeline. Click through for the full evidence.

“Albert Einstein won the 1921 Nobel Prize in Physics for his theory of general relativity.”

False Lenz score 1/10 · sourced verdict

Plausible, fluent, and wrong — the classic hallucination shape. The 1921 Nobel was awarded for the photoelectric effect, not relativity. Lenz catches it because the sources say so, not because a model felt unsure.

See the full verification — sources, debate, reasoning →

Built for RAG hallucinations, too.

Groundedness checkers tell you whether an answer is faithful to your retrieved context — not whether that context is actually right. Lenz works the other way: it extracts the claims from your model's answer and verifies each against fresh independent sources, catching what those tools miss — an answer built faithfully on a retrieved document that is itself wrong or out of date.

How detection works.

Three API primitives chain into a detection pipeline:

POST /extract pulls the verifiable claims out of any model output (free, 1000/day).
POST /assess returns a fast 3-model panel verdict per claim in ~5-10s.
POST /verify runs the full 8-model pipeline with citations on the claims that matter (~90s).

Evidence-grounded, not self-grading: verdicts come from research across real sources, argued through an adversarial debate and scored by independent reviewers — with the full audit trail attached.

Why multi-model beats a single checker.

A model grading its own vendor's output isn't independent — and a single checker inherits a single model's blind spots. Lenz runs frontier models from rival vendors against each other: one argues the claim is true, one argues it's false, and three independent reviewers adjudicate on the evidence.

The disagreement is the signal. When the panel splits, you see it — individual scores, not a blended guess.

Pick your latency tier

Gate a response while the user waits → /assess, ~5-10s, fits a chat-completion timeout
Deep check with citations and audit trail → /verify, ~90s, async with webhooks
Both → /assess everything, escalate low-confidence claims to /verify

Detect hallucinations in 10 lines

Python pip install lenz-io

# model output in — hallucinations out
from lenz_io import Lenz

client = Lenz(api_key="lenz_...")
claims = client.extract(text=model_output).identified_claims
r = client.assess(text=" ".join(claims))

for c in r.claims:
    if c.verdict == "False":
        print("HALLUCINATION:", c.claim, c.confidence)
# HALLUCINATION: Albert Einstein won the 1921 Nobel Prize ... high

TypeScript SDK too: npm install lenz-io. Same flow, typed end to end.

Not an inline guardrail. An investigator.

Millisecond guardrails classify against the model's own context — fast, cheap, and blind to anything outside it. Lenz sits at the other end of the latency-fidelity axis: it researches each claim against fresh, independent sources and shows its work. Use a guardrail to filter; use Lenz when you need to actually know — and to prove it.

See our open evaluation of where frontier models disagree on real-world fact-checks →

Pricing

Start free, no card required. /extract is free at 1,000 calls/day.

Free

/extract · 1,000/day
/assess · 100/mo
/verify · 10/mo
/ask · 20/mo

Prototype today, zero spend. No card required.

Developer

$99/mo

/extract · 1,000/day
/assess · 5,000/mo
/verify · 500/mo
/ask · 1,000/mo

Self-serve. $999/yr annual.

Scale

$399/mo

/extract · 1,000/day
/assess · 20,000/mo
/verify · 2,000/mo
/ask · 4,000/mo

$3,990/yr annual. For production integrations.

Enterprise

Volume beyond Scale, SLAs, white-label, custom integration support.

Talk to us →

Compare all plans on the plans page →

Frequently asked questions

Checking whether the factual claims in a model's output are actually true. Lenz extracts each verifiable claim, researches it against real independent sources, and returns a verdict with citations — rather than asking another model to guess.

AI detectors guess whether a text was written by AI. Lenz does the opposite job: it takes text you already know came from a model and checks whether what it says is factually true.

Yes — but differently from a groundedness checker. Lenz doesn't compare the answer to your retrieval context; it extracts the claims from your model's output and verifies each against fresh independent sources. So it catches the failure mode faithfulness tools miss: an answer built faithfully on a retrieved document that's wrong or stale still gets flagged, because the claim itself doesn't hold up against the evidence.

Every verdict ships with cited sources and the full research, debate, and adjudication trail — you can verify the verification. We also publish an open evaluation of where frontier models disagree on real-world fact-checks.

Use /assess for a ~5-10s panel verdict that fits a chat-completion timeout; escalate low-confidence claims to the full /verify pipeline asynchronously with webhooks.

By use case

Fact-checking API Verify AI output Automated fact-checking Fact-checking tool for editors Quickstart guide

Catch the next hallucination before a customer does.

Self-serve from day one. Free tier, no card required.

Get a free API key →