Your AI Isn't Hallucinating — Your Retrieval Is
A diabetes clinical agent scored 96% faithfulness on every standard hallucination metric — and was still completely wrong. The number nobody checks tells the real story.
A diabetes clinical agent scored 96% faithfulness — and was still completely wrong. The number nobody checks tells the real story.
We ran a clinical diabetes agent through deep grounding evaluation — 4 prompt variants, 5 dimensions, real patient queries.
One variant stood out: optimized_1773485979.
- Faithfulness: 96%
- Hallucination Rate: 4%
By any standard metric, this is a well-behaved LLM. It's sticking to its sources. It's not making things up. Every claim it makes can be traced back to a retrieved document.
Ship it, right?
Not so fast.
The number nobody checks
Same variant. Same evaluation. One more dimension:
- Retrieval Relevance: 2%
Read that again. The documents the retrieval pipeline pulled were almost entirely irrelevant to the user's actual query.
The LLM didn't hallucinate. It faithfully summarized the wrong documents.
How both can be true at the same time
This confuses people at first, so let's be precise:
-
Hallucination Rate (4%) measures whether the LLM's response is faithful to the sources it was given. It was. The model stuck to what the documents said.
-
Retrieval Relevance (2%) measures whether those source documents were actually relevant to what the user asked. They weren't.
Both numbers are real. Both are correct. And together, they tell a story that neither one tells alone:
The generation was faithful. The retrieval was broken.
This is a retrieval pipeline problem, not a generation problem. But if you're only measuring hallucination, you'd never know.
What the full picture looks like
Here's the grounding breakdown across all 4 variants:
| Dimension | Optimized | Concise Bullet Points | Algorithmic Decision Support | Empathetic Clinical Guide |
|---|---|---|---|---|
| Faithfulness | 96% | 80% | 43% | 79% |
| Hallucination Rate | 4% | 20% | 57% | 21% |
| Attribution Accuracy | 61% | 46% | 48% | 46% |
| Context Utilization | 60% | 32% | 24% | 90% |
| Retrieval Relevance | 2% | 8% | 7% | 2% |
Every variant has a different failure mode:
- Optimized — faithful to irrelevant sources (retrieval problem)
- Algorithmic Decision Support — 57% hallucination AND 7% retrieval relevance (both generation and retrieval are broken)
- Empathetic Clinical Guide — uses 90% of its context but retrieved the wrong context (high utilization of bad documents)
- Concise Bullet Points — most balanced, but still only 8% retrieval relevance
Not a single variant has good retrieval relevance. That tells you the problem isn't the prompt — it's the pipeline upstream.
Why this matters in healthcare
In a clinical setting, this failure mode is silent and dangerous.
A diabetes patient asks about insulin dose adjustments. The retrieval pipeline pulls documents about glucose monitoring protocols. The LLM faithfully summarizes those documents. The response is well-written, medically accurate for what it's talking about, and contains zero hallucinated claims.
But it didn't answer the question.
No hallucination detector catches this. No LLM-as-judge flags it. The output is grounded, coherent, and confident. It's just grounded in the wrong context.
The evaluation gap most teams have
Most evaluation setups measure one of two things:
- Is the output good? (quality, fluency, helpfulness)
- Is the output faithful? (hallucination, groundedness)
Almost nobody measures:
- Were the right documents retrieved in the first place?
This is the gap between "the LLM is working" and "the system is working." Your LLM can be perfect and your system can still be broken.
What we learned
Three takeaways from this evaluation:
1. Hallucination rate alone is misleading. A 4% hallucination rate sounds great until you realize the model is faithfully citing irrelevant sources. Low hallucination + low retrieval relevance = a system that's confidently wrong.
2. You need to evaluate the pipeline, not just the model. Generation quality and retrieval quality are independent dimensions. Measuring one without the other gives you half the picture.
3. The prompt isn't always the problem. We tested 4 different prompt strategies. All had low retrieval relevance. That means the issue is upstream — in chunking, embedding, or retrieval logic — not in how you're asking the LLM to respond.
The uncomfortable question
If you're running a RAG system in production right now, ask yourself:
Do you know your retrieval relevance score?
Not your hallucination rate. Not your RAGAS score. Not whether the output "looks right."
Do you know whether the documents your pipeline retrieves are actually relevant to what your users are asking?
If you don't, you might have a system that scores beautifully on faithfulness — and is still answering the wrong question.
This evaluation was run using variA/Bly's claim-level grounding analysis, which independently scores faithfulness, hallucination, attribution accuracy, context utilization, and retrieval relevance across prompt variants. If you're building RAG systems and want to see what your grounding scores actually look like, reach out.
Want this kind of evaluation for your RAG system?
Talk to us