A practitioner's guide to detecting fluent-but-wrong outputs — and why scores you can't reproduce aren't really scores.

Consider a clinical RAG system grounded in published treatment guidelines. It returns a response recommending a 500mg twice-daily dose, citing a 2019 study by "Dr. Chen et al." that confirmed the protocol across 12,000 patients.

The dose is correct. The study does not exist.

A typical LLM-as-judge eval suite — DeepEval's FaithfulnessMetric, or the equivalent in RAGAS, MiniCheck, or FActScore — would not flag the fabrication. The judge model reads the response, finds the citation grammatically well-formed, semantically plausible, contextually appropriate, and stamps it as faithful. The eval is doing exactly what it was designed to do. The design is the bug.

The Fluency Trap: same response, different lenses. An LLM-as-judge eval reports 0.96 faithfulness while two of seven claims are unsupported by any source.

This is the failure mode that breaks the most expensive AI deployments — confident hallucinations that pass evaluation. We call it the Fluency Trap, and it's the single hardest problem in production LLM evaluation today.

This post is a long walk through what makes hallucination evaluation hard, why most graders miss the failures that actually matter, and how to build an evaluation pipeline you can defend in an audit eight months after the score was recorded. It's written for the engineer who has to design or pick the grader, not the executive who has to approve the budget. We'll cover anatomy, methodology, determinism, per-claim auditability, domain-specific tuning, public benchmark results, and an honest list of what the field still doesn't solve. Some of it is opinionated. Where the opinion is load-bearing, we name it.

Why Hallucination Evaluation Is a Different Discipline

Most "AI evaluation" content from the last 18 months has been about agent evaluation — multi-step workflows, tool calls, environment state. Important, but a different problem. Hallucination evaluation answers a narrower question:

"For each factual claim in this response, is there a source that supports it?"

That sounds simple. It isn't, and the reasons it isn't show up in three places that other eval categories don't have to deal with.

The grader must read the source, not just the response. Task-completion graders only need to see the agent's output. A hallucination grader has to read the source corpus, find the right span, and compare meaning against the claim. Two-document reasoning under uncertainty.

Single claims fail; aggregate scores hide it. A response with seven claims and one fabrication can score 86% (six of seven correct) and look healthy. In regulated domains, one fabricated claim out of seven is the failure, not a rounding error. Aggregate scores systematically understate the worst case.

The grader's failure mode is plausibility, not absence. A bad task-completion grader misses outputs. A bad hallucination grader approves fluent fabrications. The error mode of an LLM-as-judge grader on hallucination is to be talked into the answer by a confident-sounding response — the same failure mode the system under test is exhibiting. Two models converge on the same wrong answer.

These three properties — two-document grounding, per-claim resolution, plausibility-resistant verdicts — are what make hallucination evaluation a different discipline from agent evaluation, code evaluation, or open-ended quality scoring. The tools and trade-offs are different. You can't transfer instincts from one to the other.

The Anatomy of a Hallucination Grader

Every hallucination grader, whatever its underlying methodology, has to do three things in order. Understanding the three steps independently is the first move toward picking or building one.

Anatomy of a hallucination grader: decomposition produces atomic claims, grounding maps each claim to a source span, and the verdict step emits a three-way decision per claim. The methodology fork at each step separates deterministic graders from LLM-as-judge.

Step 1 — Decomposition

The response gets broken into atomic factual claims. "Metformin is a first-line treatment for type 2 diabetes; the recommended starting dose is 500mg twice daily, and clinical trials show a 47% improvement in outcomes when combined with lifestyle interventions" is one sentence, three claims:

Metformin is a first-line treatment for type 2 diabetes
The recommended starting dose is 500mg twice daily
Clinical trials show a 47% improvement in outcomes with lifestyle interventions

The grader's job at this step is to produce claim units that can be verified independently. Conjunctions, sub-clauses, and shared subjects all get expanded. Hedging language ("I think", "it's possible") gets dropped, because hedged statements aren't factual claims.

The decomposition step is where the first methodology fork happens. Most graders use an LLM to decompose — they prompt a model to "extract atomic claims from this response." That works, but it introduces non-determinism (same input → different claim lists) and adds cost (every eval pays for one LLM call before any grounding work begins). The alternative is deterministic linguistic decomposition — slower to design, but reproducible and free at runtime.

We covered why determinism matters in a previous post on reproducibility; the short version is that if your claim list moves between runs, your faithfulness score moves with it, even if nothing else changed.

Step 2 — Grounding

Each atomic claim gets compared against the source corpus. The grader has to answer: which source span, if any, supports this claim?

Most graders use a two-stage approach here. First, a fast retrieval step (embedding similarity, BM25, or hybrid) narrows the source corpus down to the most likely supporting spans. Then, a slower verification step — usually Natural Language Inference (NLI) for deterministic graders, or another LLM call for LLM-as-judge frameworks — produces a verdict on whether the candidate span actually entails the claim.

The retrieval step is the easy part. The verification step is where most graders fail.

Why retrieval-only grading isn't enough. A claim and a source span can be semantically similar — same topic, same domain, similar vocabulary — without the source actually supporting the claim. Two clinical statements about HbA1c thresholds will look almost identical to an embedding model even if one says "below 6.5%" and the other says "below 7%". The numbers differ; the embedding distance doesn't. A grounding decision based on embedding similarity alone marks the response as supported and moves on.

This is the central reason hallucination evaluation needs more than semantic search. The grader has to do logical verification, not just similarity.

Step 3 — Verdict

The grader emits a per-claim verdict — usually a three-way decision rather than binary:

Supported. A source span entails the claim.
Contradicted. A source span explicitly disagrees with the claim.
Unsupported. No source span supports the claim, and none contradicts it.

Most LLM-as-judge frameworks collapse "contradicted" and "unsupported" into a single "not faithful" bucket. This is a mistake. Contradiction means the system invented a fact that conflicts with the source; unsupported means it invented a fact that's silent in the source. Both are failures, but they have different remediation paths — a contradiction usually means a generation problem, an unsupported claim usually means a retrieval problem. Conflating them hides the diagnosis.

Once the per-claim verdicts are computed, the grader aggregates them into a faithfulness score (typically grounded claims / total claims). The aggregation is the easy step; the per-claim verdicts are the auditable artifact that matters.

Why Most Hallucination Graders Use LLMs (and What That Breaks)

The default methodology in 2026 is LLM-as-judge. Open up RAGAS, DeepEval, FActScore, MiniCheck, or the eval features in Langfuse and LangSmith, and the grounding/faithfulness check is almost always a downstream LLM call. The judge reads the response, reads the source, and emits a verdict.

This works. It also has three failure modes that compound at scale.

Failure Mode 1 — Non-determinism

LLMs are stochastic. Even at temperature 0, modern serving stacks introduce variance — that's why OpenAI added the seed parameter and system_fingerprint field, and why the docs explicitly note that completions can still vary when the underlying infrastructure changes.

For an LLM-as-judge faithfulness grader, this variance manifests as score drift between identical runs. We've seen the same input land at 0.71, 0.83, 0.79, 0.75, 0.83 across five runs of a popular open-source framework at temperature 0. Same response, same source, same judge model, same framework version. The score moved 12 points.

That variance breaks three downstream things:

Regression detection. A score moved from 0.86 to 0.79 after a prompt change. Did the prompt regress, or did the scorer hiccup? You can't tell.
Statistical experimentation. A/B significance tests assume a stable measurement instrument. Noisy graders inflate the required sample size by roughly the square of the measurement variance.
Audit defensibility. Under EU AI Act, OCC AI guidance, and HIPAA-adjacent regimes, you have to defend an evaluation result months after the fact. If running the same input today produces a different score, the original score has no defensible meaning.

Failure Mode 2 — Cost That Scales With Prompt Size

Every LLM-as-judge evaluation is another LLM call. In production, the judge is usually a more expensive model than the system being judged — because the methodology only works when the judge is stronger than the generator.

Run that math at scale: 10K evaluations per month with a strong judge can produce an evaluation bill larger than the inference bill of the system being evaluated. The judge bill also scales with prompt size — if your RAG context grows from 4K tokens to 16K tokens, every judge call gets ~4× more expensive, even though the response being evaluated didn't change. We walked through the cost mechanics in detail here.

Failure Mode 3 — Reliability Drops at the Tail

This one is the least discussed. LLM-as-judge graders quietly fail on long or complex inputs in ways that don't show up in headline benchmarks.

We ran a public-benchmark comparison this spring against DeepEval's FaithfulnessMetric. On RGB-style samples — the kind of synthetic-distractor inputs that are the bread and butter of hallucination evaluation — the metric failed outright on roughly 32% of cases. Not "got the wrong answer." Failed, in the sense of throwing an error or producing no score, because the claim-by-claim decomposition step ran the judge model out of output tokens.

This is the kind of reliability problem that doesn't show up when you cherry-pick a few examples for a demo. It shows up when you run the grader at production scale and discover a third of your evaluations didn't produce a usable score. Most teams have never run that test.

The Fluency Trap

We've been circling around the central concept. Here it is in full.

The Fluency Trap is the failure mode where a fluent, well-structured, semantically plausible response earns a high evaluation score from a grader that lacks the methodological discipline to detect that the response is factually wrong.

The clinical RAG example in the opening is a Fluency Trap case. The response was readable, well-cited (the citation looked real), grammatically clean, and topically on-target. Every surface signal said "good response." An LLM-as-judge grader, asked to evaluate faithfulness, read the same surface signals and came to the same conclusion.

The trap has three structural ingredients:

The system under test produces fluent output. This is what modern LLMs are best at. Even when they hallucinate, they hallucinate fluently.
The grader uses a methodology that is itself sensitive to fluency. LLM-as-judge is the worst offender here — judge models are trained to reward fluent, well-structured prose.
The grading signal is aggregated. A single fabrication in a six-claim response gets diluted to ~17% error rate. The aggregate score still looks acceptable.

The mitigation requires breaking each of the three ingredients. The system under test will keep producing fluent output — that's not the lever. The lever is the grader and the aggregation.

A grader that doesn't fall into the Fluency Trap has three properties:

It verifies claims logically, not by similarity or plausibility.
It emits per-claim verdicts, so a single fabrication is visible and can be flagged.
It is deterministic, so the same input produces the same verdict across runs — which means a flagged fabrication stays flagged.

Industry methodology converges on Natural Language Inference (NLI) as the verification primitive for property one. NLI models are trained on entailment data — millions of premise/hypothesis pairs labeled with whether the premise entails, contradicts, or is neutral with respect to the hypothesis. Run claim ↔ source span through an NLI model and you get an entailment probability that doesn't depend on whether the claim sounds plausible — it depends on whether the source logically supports the claim.

The benchmark data backs this up. On the public RGB benchmark (n=592), claim-level NLI-based grading flags 94% of injected hallucinations as ungrounded. LLM-as-judge grading on the same data flags 62%. The gap isn't a measurement nuance; it's the Fluency Trap operating across the board.

Property two — per-claim verdicts — is partially a UI question and partially a methodology one. A faithfulness score of 0.86 is uninteresting. A faithfulness score of 0.86 with a per-claim trail showing which claim failed, what source it failed against, and why (contradiction vs. unsupported vs. numeric mismatch) is auditable.

Property three — determinism — is the foundation. We've written about it in depth before; the short version is that a grader whose output moves between runs cannot be the basis for any production decision that has to hold up to scrutiny later.

The Six Categories of Hallucination

"Hallucination" as a single category is too coarse to be useful. In practice, what gets called a hallucination breaks down into at least six structurally different failure modes — and the methodology for catching them differs by category.

We worked out the six-category breakdown over the last year of production deployments. They're not unique to us; the methodology literature converges on roughly the same set under different names (CheckList, FActScore, AlignScore, and the SUMMAC family all map to subsets of these).

The six categories of hallucination — grounding, faithfulness, attribution, coherence, safety, and semantic — each with its own primary methodology. Most production graders only score the first two.

A well-designed hallucination eval suite scores across all six. Most production graders only score the first two — grounding and faithfulness — which is why so many of them score the clinical RAG example as 0.96 faithful. The fabrication was in the attribution layer (a non-existent citation) and the safety layer (medical misinformation, even if technically aligned with guidelines) — neither of which the grader was looking at.

We expose all six categories as a single framework on our evaluation methodology page, with ~49 individual metrics distributed across them. The point isn't the number; it's that "one faithfulness score" is too low-resolution to drive a production decision.

Determinism: The Property Everything Else Depends On

We've mentioned determinism three times already. It deserves its own section, because the conversation about hallucination eval keeps drifting back to it.

A deterministic grader is one where the same input — same response, same source corpus, same configuration — produces the same output every time. No variance. The score doesn't drift between runs. The per-claim verdicts don't flip.

For LLM-as-judge, determinism is structurally unachievable. Even at temperature 0, with explicit seeds, on a pinned model version, you cannot get true determinism — the serving stack introduces variance you don't control. We covered this with full citations in the deterministic evaluation post.

For algorithmic graders — NLI-based, with fixed-weight inference, fixed tokenization, fixed claim decomposition — determinism is achievable but not automatic. Every step in the pipeline has to be deterministic. The decomposition step has to produce the same claim list. The retrieval step has to surface the same candidate spans. The NLI verification has to use a fixed model with deterministic decoding. Numeric extraction has to use the same regex precedence.

The reproducibility test is the same one we've published before:

Take 100 evaluations from your production traffic. Run them through your grader. Save the scores. Wait an hour. Run them again. Compute the per-sample delta.

For a deterministic grader, every delta is exactly 0.

For an LLM-as-judge grader, the median delta is small but nonzero, the long tail is sometimes large, and a non-trivial fraction of samples flip across decision thresholds (e.g., 0.49 → 0.51, which flips a pass/fail gate).

Same input, five runs: LLM-as-judge scores drift across identical runs (σ ≠ 0); a deterministic NLI grader returns the same score every time (σ = 0). The bottom row is what reproducible looks like.

This is the test most teams have never run. It takes an afternoon. It costs nothing. It tells you whether the downstream regression detection, A/B significance math, and audit trail your team is building on top of the grader actually rest on stable ground.

Why does this matter for hallucination eval specifically? Because hallucination decisions are binary in consequence. A response either gets shipped or it doesn't. A claim either gets flagged or it doesn't. If your grader emits a per-claim verdict that flips between "supported" and "unsupported" across runs on the same input, you cannot ship a production gate on it. You cannot audit it. You cannot debug a regression.

The cost of LLM-as-judge in dollars is high but knowable. The cost in lost reproducibility is the one that quietly compounds.

Per-Claim Verdicts vs. Aggregate Scores

We've referenced this distinction throughout the post. Here's why it's the practical difference between a hallucination eval you can ship a regulated product on and one you can't.

An aggregate score is a single number summarizing the response: "faithfulness = 0.86." It's useful for dashboards, trending, and gating ("block any response below 0.80"). It's useless for anything else.

A per-claim verdict is a structured artifact: a list of claims, each with a verdict (supported / contradicted / unsupported), the source span that drove the verdict, and a failure reason if it failed (contradiction vs. unsupported vs. numeric mismatch). It's useful for the dashboards too, but more importantly:

Per-claim audit trail: each claim has a verdict, a source reference, an entailment score, and a failure reason when applicable. Auditors drill from the aggregate score to the individual claim that caused the drop.

Debugging. When a response is flagged, you can see which claim failed and why. That's the difference between "fix the prompt" and "fix the retrieval."
Auditing. A compliance reviewer six months from now needs to defend a specific shipping decision. "We shipped because faithfulness was 0.91" is a weak defense. "We shipped because all seven claims were grounded, here are the seven source spans" is a strong one.
Iteration. False positives (claims flagged that shouldn't have been) point to grader gaps. False negatives (claims missed that shouldn't have been) point to decomposition gaps. Without per-claim verdicts, you can't distinguish.
Threshold tuning. Different domains have different tolerances. Per-claim verdicts let you tune by failure type — "block any contradiction" is different from "block any unsupported claim" is different from "block any numeric mismatch."

Most LLM-as-judge frameworks produce aggregate scores by default. Some can be coaxed into producing claim-level output, but the per-claim output suffers from the same fluency-trap, non-determinism, and reliability issues as the aggregate score — only now you're paying for them at higher resolution.

The economic argument for aggregate-only scoring is that per-claim verdicts are expensive to produce. For LLM-as-judge, this is true. For deterministic NLI-based graders, it isn't — per-claim verdicts are a natural byproduct of the methodology, not an extra cost.

Domain Matters: One Model Doesn't Work Everywhere

Generic NLI models work well on generic text. They underperform on domain-specific text — clinical, legal, financial — in predictable ways.

We hit this directly when we ran our pipeline on PubMedQA, a clinical-text benchmark. The general-purpose NLI model that performed strongly on RGB (and on most agent/RAG benchmarks) underperformed on clinical claims because clinical entailment requires domain knowledge the model doesn't have. "Plan A includes metformin" entails "the patient is on first-line therapy" only if you know metformin is a first-line treatment. A general-purpose NLI model trained on news and Wikipedia text doesn't reliably bridge that gap.

The fix is per-domain routing. The architecture looks like this:

Classify the incoming evaluation by domain (the query, the retrieved context, and the response together).
Route to a domain-specialized NLI model with domain-tuned thresholds.
Fall back to the default model when the classifier is uncertain.

Domain-aware routing: a zero-shot classifier picks the domain; the input is routed to a specialized NLI model with a domain-tuned threshold; the routing decision is logged on every evaluation for audit.

The classifier is a zero-shot text classifier that doesn't need per-customer training data. It picks the domain bucket; the bucket determines which model and threshold apply.

The classifier output and routing decision are exposed in the evaluation telemetry. This matters for two reasons: first, you can audit which model graded which evaluation; second, you can monitor the empirical domain distribution and add new specialized models when you see a domain getting routed to default frequently.

For our production deployments today, clinical text routes to a domain-specialized model; everything else routes to the default. We have data for one specialized domain because we ran a public benchmark on it (PubMedQA) and tuned against it. Adding additional specialized domains — telecom, legal, finance — is a benchmark-and-tune exercise rather than a fundamental architecture change. The router accepts new domain entries without code changes.

Two honest caveats:

Per-domain tuning isn't optional. A clinical evaluation run through a general-purpose NLI model will underperform — sometimes badly. If you're shipping a regulated product, this is the kind of methodology gap that shows up in incident reports.
Per-domain models still don't solve every case. A clinical model trained on biomedical entailment data won't reliably catch a fabricated drug interaction that's not represented in the training distribution. Domain models reduce the error rate; they don't eliminate it.

The Public Benchmark Methodology

Most evaluation vendors describe their accuracy with curated examples and selective metrics. We took a different approach. The public Apache-2.0 variably-benchmark repo has the scripts, the raw per-sample outputs, and a reproducibility runner that lets anyone verify the math without an account.

Two layers of verification:

Verify the math from committed JSON outputs. Zero cost. No account. Run compare.py. The JSON outputs in {dataset}/results/ are the per-sample raw scores from every system; the comparator computes the headline numbers from them.
Re-run the scorers from scratch. Requires API keys — OpenAI for RAGAS, a free variA/Bly account for our scorer (because the algorithm is proprietary and runs on our hosted infra). Higher trust ceiling; same numbers.

The headline numbers from RGB (n=592, April 2026):

Hallucination catch rate: variA/Bly 94%, RAGAS 62%
False positive rate: variA/Bly 6.1%, RAGAS 38.2%

RGB benchmark (n=592): variA/Bly catches 94% of hallucinations with a 6.1% false positive rate. RAGAS catches 62% with a 38.2% false positive rate.

The interpretation: RAGAS approves 38.2% of injected distractors as grounded. That's not a tuning issue — it's the Fluency Trap operating across the methodology. The distractors are designed to be fluent and plausible; the LLM-judge methodology is sensitive to fluency and plausibility; the false positive rate goes up.

The PubMedQA numbers tell a more nuanced story. On clinical text with the domain-routed pipeline:

Hallucination catch rate: variA/Bly 63.9%, RAGAS 27.2%, DeepEval 4.5–21.8%
Precision: variA/Bly 64.7% (best of the three)
False positive rate: variA/Bly 36.1% (best — i.e., lowest)
Recall: variA/Bly 66.1%

The recall number is the honest part. Our pipeline misses some clinical hallucinations the field doesn't yet have a good answer for — particularly multi-hop logical inferences that require domain knowledge across multiple source spans. We've documented this directly in the benchmark repo and our internal roadmap. It's the kind of gap a benchmark surfaces and a roadmap addresses, not a marketing fact to dress up.

The DeepEval FaithfulnessMetric failure rate of ~32% on RGB samples — the reliability finding we mentioned earlier — is the other artifact worth flagging. It's not in our headline numbers because it's a reliability failure rather than a correctness failure. When the metric fails entirely, it doesn't produce a wrong score; it produces no score. But in production, "no score" is the same as "ungated." A grader that fails on a third of inputs is not a grader.

Going From Zero to One: Building Your Hallucination Eval Suite

If you're starting from scratch — no grader, no benchmark, no dataset — here's the roadmap that works for most teams. Treat it as a sequence; each step depends on the previous one.

Step 1 — Hand-label 20 to 50 examples

This is the same advice Anthropic's eval guide gives, and it's right. The instinct most teams have is to generate large synthetic datasets and hope volume substitutes for quality. It doesn't. A small set of carefully labeled examples — each one with a clear annotation for which claim is grounded, which is contradicted, which is unsupported — is worth more than a thousand auto-generated samples.

The 20-to-50 range is empirical: below 20 you don't have enough variation to detect grader weaknesses; above 50 the marginal value of each example drops sharply. We've labeled new datasets at this scale dozens of times and the curve is consistent.

The set should include obvious positives (well-grounded responses), obvious negatives (clearly hallucinated responses), and — most importantly — the boundary cases. The boundary cases are where graders actually differ. Anyone can detect "Dr. Chen et al. 2019 study, fabricated." The grader differentiator is whether it catches "the recommended dose is 500mg" when the source says "the recommended dose is 250mg starting, titrated to 500mg." Numeric mismatch buried in otherwise-correct prose.

Step 2 — Pick a grader methodology, not a vendor

Before evaluating tools, pick the methodology you want. The choices:

LLM-as-judge. Fast to set up, expensive at scale, non-deterministic, sensitive to the Fluency Trap. Fine for prototyping and for evaluations where determinism doesn't matter (subjective creative judgment, vibes-checks). Wrong for production gates.
NLI-based deterministic. Slower to set up (someone has to load and serve NLI models, design decomposition, tune thresholds), but reproducible, cheaper per evaluation at scale, and Fluency-Trap-resistant. Right for production gates and audit-grade evaluation.
Hybrid (NLI + LLM-as-judge for edge cases). The right answer for some teams. Use deterministic NLI for the high-volume baseline; route ambiguous cases to an LLM judge for human-style adjudication. Adds complexity.

Most production teams that have done this thinking end up at NLI-based or hybrid. The teams that stay on LLM-as-judge for hallucination evaluation usually haven't run the reproducibility test yet.

Step 3 — Design the decomposition

Decomposition is the unsung hard part. Most graders get it wrong in one of two directions:

Over-decomposition. Splitting a single claim into multiple pseudo-claims that all share a subject. "Patients should monitor blood glucose every 3 months for the first year, then every 6 months after" gets split into three claims, two of which are about timing and one of which is about duration. The grounding check fails on the timing claims because the source phrases it differently, even though the meaning is correct. False negative.
Under-decomposition. Treating a compound sentence with multiple verbs as a single claim. "The protocol recommends metformin and shows a 47% improvement in outcomes" is two distinct claims; the first might be grounded and the second fabricated. Treating it as one claim hides the fabrication under the average. False negative.

Industry-standard atomic decomposition handles patterns like coordinated clauses ("X is A and is B" → two claims) and shared-subject predicates ("Patients should limit X, increase Y, and consider Z" → three claims). Implementation can be LLM-based or rule-based; the deterministic path doesn't call an LLM.

Step 4 — Design the verdict logic

A claim is grounded only if all three of the following hold:

NLI entailment above threshold. Some source span entails the claim.
No source contradicts. No source span has a contradiction probability above threshold.
Numerics match. Any specific numbers in the claim (dosages, percentages, dates, currencies) appear in at least one source span.

All three are required because they catch different failure modes. Entailment alone misses contradictions buried in long source spans. Contradiction alone misses claims with no source at all. Numeric checks catch the highest-impact category of hallucination in domains where specific values matter — medicine, finance, law.

The "numerics must match" check is the most under-implemented one. Most LLM-as-judge graders don't verify numerics explicitly; they let the judge model decide whether 500mg and 250mg are "close enough" — and the judge model, being trained to be helpful, often says yes.

Step 5 — Set thresholds per domain

Generic NLI thresholds work as a starting point but rarely survive a real domain. A threshold tuned on a general-purpose dataset will be too high for clinical text (where domain models have lower confidence on average) and too low for legal text (where domain models are more confident on irrelevant similarities).

Per-domain threshold tuning is a small calibration exercise: 50–100 labeled examples per domain, sweep the threshold, pick the point that maximizes F1 against the labels. The point isn't the maximum-F1 threshold itself; it's that you've calibrated against the empirical distribution of the model's outputs on the domain.

Step 6 — Wire it into CI and production

The grader has to run in two places.

In CI, against a fixed regression set, gated on a threshold. The regression set is the labeled examples from Step 1, expanded as new failures get found in production. The gate is a hard fail — pull requests that lower faithfulness on the regression set don't merge.

In production, against live traffic, with the per-claim verdicts logged. The aggregate score becomes a dashboard signal; the per-claim verdicts become an alert source ("any response with a contradiction-class failure goes to a queue for human review").

Most teams skip the CI step and run the grader only in production. That's a mistake — it means the grader catches regressions only after they're deployed, which is the wrong end of the pipeline.

Step 7 — Iterate on the failures, not the wins

Once the grader is running, the high-value work is studying its failures. False positives (claims flagged that shouldn't have been) usually trace back to decomposition gaps or threshold over-tuning. False negatives (claims missed that shouldn't have been) usually trace back to grader methodology gaps — most often, the Fluency Trap — and need methodology changes, not parameter changes.

The teams that learn the most from their eval suites are the ones that maintain a log of every false positive and false negative, with a hypothesis for each. After a few weeks, the patterns become obvious — "we keep missing numeric hallucinations buried in long responses" or "we keep flagging legitimate paraphrases as unsupported because the decomposition is too aggressive." Those patterns are the eval suite telling you where to invest.

The Honest Limits

The state of hallucination evaluation in 2026 is real but unfinished. Three honest limits we and the field don't yet solve well.

Limit 1 — Omission Detection

Most graders, including ours, are precision-focused. They ask: is each emitted claim supported? They don't ask: was critical information from the source omitted?

Omission is a different shape of failure. A response that says "metformin is a first-line treatment" without mentioning the contraindication that's in the source isn't lying — it's leaving out something important. NLI-based grounding scores this as fully faithful. The clinical reviewer who reads the source would call it dangerous.

Detecting omission requires running the claim-extraction step in reverse: extract atomic facts from the source corpus, check which of them are reflected in the response, and flag the ones that should have been included but weren't. We have a coverage signal that flags ignored source chunks, but it isn't true semantic-omission detection. The methodology gap is industry-wide; nobody has shipped a great answer for it yet.

Limit 2 — Multi-Hop Logical Reasoning

Some claims require combining facts from multiple source spans to verify. "Plan A includes VoLTE" and "VoLTE requires 4G" together imply "Plan A is at least 4G." Our pipeline doesn't chain these inferences today, and most graders don't either.

The structural reason is that the obvious way to add multi-hop reasoning — an LLM at the verification step — reintroduces the non-determinism that deterministic graders are built to avoid. The methodology fork is real: you can have multi-hop reasoning, or you can have determinism, but not both with current techniques. We've made the determinism trade-off explicitly. It's a roadmap item to revisit when better symbolic-reasoning techniques mature.

Limit 3 — Coverage Outside Tuned Domains

Domain-specialized NLI models help dramatically on the domains they're tuned for. Outside those domains, you're running with the default model, which has the failure modes of any general-purpose NLI model — including some domain-specific blind spots that show up only after deployment.

The mitigation is data-driven: monitor which domains get routed to the default classifier, look at the false-positive and false-negative patterns on those domains, and decide whether to add a specialized model. It's a benchmark-and-tune exercise per domain. Not magic; just work that has to be done case by case.

These three limits aren't reasons not to ship a hallucination eval suite. They're reasons to be honest about what your suite catches and what it doesn't, and to write that into your audit-defensibility documentation rather than pretending the gaps don't exist.

Where Hallucination Eval Fits in the Larger Picture

A hallucination eval suite is one layer of a production AI quality stack. It complements rather than replaces:

Observability (Langfuse, LangSmith, Arize) tells you what your LLM is doing — traces, latency, token usage.
Hallucination evaluation (the topic of this post) tells you whether what your LLM produced is true.
Experimentation and decision tooling tells you which variant to ship given the evaluation results.

We covered the observability-vs-evaluation comparison in a previous post and the decision-layer question is its own essay. The point here is that hallucination eval is necessary but not sufficient. A team that has a great hallucination grader and no observability will catch hallucinations but won't know what triggered them. A team with great observability and no hallucination grader will see every trace but won't know which traces are wrong. Pick both layers; pick the right tool for each.

Closing — What "Good" Looks Like

A team that has hallucination evaluation working well has the following properties.

They can answer, for any production response from the last 90 days, the questions: which claims were grounded? which weren't? which sources supported each grounded claim? what was the failure reason for each ungrounded claim? The answer doesn't require a re-run; it's in the logged per-claim verdicts.

They can prove, to a regulator or an internal auditor, that the same input today produces the same score it produced when the response was generated. The score reproduces because the grader is deterministic and version-pinned.

They catch numeric hallucinations explicitly — dosages, percentages, dates — because the verdict logic includes a numeric-match step that's separate from the NLI step.

They run their grader in CI against a regression set and in production against live traffic, with different thresholds for different domains, and per-claim verdicts flowing into a review queue for the contradiction-class failures.

They are honest about what the grader doesn't catch — omission, multi-hop logical inference, claims requiring out-of-distribution domain knowledge — and they have a working hypothesis for how each gap evolves over time.

They don't conflate evaluation with decision-making. The grader tells them how good a response was. A separate experimentation layer tells them which variant to ship and how confident they are in that choice.

Hallucination evaluation is a discipline, not a feature. The teams that treat it as a discipline ship faster than the teams that treat it as a checkbox — because the failures they catch in CI don't make it to production, the regressions they detect don't get blamed on the model when the actual culprit is retrieval, and the audit trail they accumulate doesn't have to be reconstructed under deadline pressure when the regulator asks.

The most expensive mistake in production AI isn't a hallucination. It's a confident hallucination that your evaluation stack told you was safe to ship. Hallucination evaluation, done right, is the discipline of catching that mistake before it ships — every time, reproducibly, with a trail you can defend.