Your RAG Evaluation Is Blind.Here's What It's Missing.

Standard evaluation scores your RAG responses without checking the retrieved context. Hallucinations pass. Fabricated citations go undetected. Quality drifts silently.

why RAG is harder

RAG eval needs claim-level grounding. Not response-level scoring.

General LLM evaluation grades the whole response. That works for fluency, safety, coherence. It fails for RAG because a response can have some claims grounded in retrieved chunks and others hallucinated — with the average score hiding the problem.

variA/Bly extracts every factual claim, matches each one against the actual retrieved chunks via NLI + numeric verification, then surfaces per-claim grounding scores. The audit trail your compliance reviewer needs is a side effect of how the scoring works.

missing layers

Four layers your evaluation never sees

Retrieved Context

The chunks your retriever returned. Were they relevant? Were they used? Were claims grounded in them?

Source Documents

The original documents behind the chunks. Does the AI cite them accurately or fabricate references?

Conversation History

Previous turns in the conversation. Does turn 5 contradict turn 2? Does the AI retain context?

Claim Grounding

Every factual claim mapped to its source. Which claims are supported? Which are hallucinated?

Ready to see what your RAG evaluation is missing?

Stop trusting blind scores. Start verifying every claim against its source.

Get free credits Run your first evaluation

1,000 free credits.