Your RAG Evaluation Is Blind.Here's What It's Missing.
Standard evaluation scores your RAG responses without checking the retrieved context. Hallucinations pass. Fabricated citations go undetected. Quality drifts silently.
RAG eval needs claim-level grounding. Not response-level scoring.
General LLM evaluation grades the whole response. That works for fluency, safety, coherence. It fails for RAG because a response can have some claims grounded in retrieved chunks and others hallucinated — with the average score hiding the problem.
variA/Bly extracts every factual claim, matches each one against the actual retrieved chunks via NLI + numeric verification, then surfaces per-claim grounding scores. The audit trail your compliance reviewer needs is a side effect of how the scoring works.
Four layers your evaluation never sees
Retrieved Context
The chunks your retriever returned. Were they relevant? Were they used? Were claims grounded in them?
Source Documents
The original documents behind the chunks. Does the AI cite them accurately or fabricate references?
Conversation History
Previous turns in the conversation. Does turn 5 contradict turn 2? Does the AI retain context?
Claim Grounding
Every factual claim mapped to its source. Which claims are supported? Which are hallucinated?
Ready to see what your RAG evaluation is missing?
Stop trusting blind scores. Start verifying every claim against its source.
1,000 free credits.