RGB Benchmark Whitepaper · April 30, 2026

Standard LLM-as-judge eval approves 38% of ungrounded AI answers as “grounded.”
We measured the gap, and held it to 6.1%.

We tested 592 RAG responses through deterministic grounding scoring alongside the industry-standard LLM-as-judge approach (RAGAS). The gap is large enough to determine whether a clinical, financial, or compliance-bound AI system passes audit. Methodology, dataset, and reproduction code all open below.

If you're using RAGAS or LLM-as-judge based evaluators in production today, 38% of your hallucinations are slipping past your scorer.

More than 1 in 3 hallucinated AI responses your customers see - your current evaluator approved.

In a production RAG system processing 10,000 responses per day, that's the difference between ~610 hallucinations reaching your customers (with variA/Bly) and ~3,820 hallucinations (with RAGAS) - every day.

What this means

RAGAS misses 38 out of every 100 AI hallucinations.
variA/Bly catches 94 out of every 100.

A “distractor passage” is a source the AI cites that does not actually support the claim it just made. When that happens, the AI is hallucinating - and a grounding scorer's job is to flag it before the response ships. That's exactly the failure mode this benchmark measured, on 592 cases.

RAGAS

38 of 100 hallucinations
slip through silently

Run RAGAS on 1,000 hallucinated AI responses and it tells you ~380 are fine when they aren't. Each of those is a potentially wrong answer auto-shipped to your customer.

variA/Bly

94 of 100 hallucinations
caught and surfaced

On the same 1,000 responses variA/Bly flags ~940 of them. The hallucinations don't reach your customer; they reach your review queue with a per-claim audit trail.

That's the difference of 6× cleaner hallucination detection - same 592 cases, same data, two scorers. For regulated AI (healthcare, finance, legal, insurance) that gap is the difference between a deployable system and a compliance liability.

A real disagreement from the benchmark

Look at one example side-by-side.

Sample rgb_id=12 from the public RGB dataset. This is a real evaluation from the run, not a hand-picked illustration.

Query

How much is Microsoft acquiring Activision Blizzard for?

AI response

Microsoft is acquiring Activision Blizzard for $68.7 billion, which equates to $95.00 per share.

Reference (one of five retrieved)

“When the news of the acquisition first broke, in January 2022, Reuters ran the headline ‘Microsoft to gobble up Activision in a $69 billion metaverse bet’…”

How each scorer rated this response:

RAGAS

✓ Grounded

1.00

The LLM judge decided the claims were close enough to what the references say. The verdict is a gestalt impression with no per-claim trace.

variA/Bly

✗ Not grounded

0.00

The dollar figure in the response ($68.7 billion) does not appear in any reference. Closest match in the sources is $69 billion. The numeric mismatch is the failure.

Why variA/Bly said no - the full audit trail

Claim under test“…acquiring Activision Blizzard for $68.7 billion”

NLI entailment0.18 (low)

NLI contradiction0.04

Numeric mismatch$68.7B not in refs; closest $69B

VerdictNot grounded - numeric verification failed

RAGAS gives you a score. variA/Bly gives you a courtroom-grade trace of why. Every claim variA/Bly flags comes with the reference excerpt, the entailment score, the contradiction score, and the specific numeric or factual mismatch - reproducible byte-for-byte on rerun. That's what makes the verdict auditable.

The number nobody else publishes

False positive rate on distractor passages

How often each scorer wrongly approves a passage that doesn't support the AI's claim. Lower is better. This is the failure mode that produces compliance breaches - every false positive is a wrong response auto-passed to the customer.

variA/Bly6.1%

RAGAS38.2%

RAGAS publishes AUC. We publish AUC and FPR - over 6× cleaner on the same 592 distractor passages.

Why we publish FPR (and nobody else does)

AUC tells you “is the scorer good?” FPR tells you “is this verdict trustworthy?”

Every benchmark vendor publishes AUC, including us (0.748 vs RAGAS's 0.864). But AUC measures ranking quality across every possible threshold. In a deployed system, you pick one threshold and live with the precision and FPR it produces.

For regulated, customer-facing AI, the number that decides whether a scorer is shippable is its false-positive rate at the threshold you actually deploy with - how often it wrongly approves a passage that doesn't support the claim. Every wrong approval is a potential compliance breach that lands on the audit trail your regulator reads.

That single number is what compliance teams audit against. RAGAS doesn't publish it. variA/Bly does. And the gap - 6.1% vs 38.2% on RGB - is the variA/Bly product in one chart.

variA/Bly leads on

When the scorer says “grounded” - is it actually right?89.9%vs71.8%
What fraction of hallucinations ship to your customers?6.1%vs38.2%
Same input twice - does the verdict change?No (deterministic)
Can you show an auditor why a verdict was made?Yes (per-claim trail)
How much does scoring add to your SDK roundtrip?<1 ms
Will your scoring bill be the same next month?Yes (no judge-model surcharge)

The metrics compliance teams audit against. variA/Bly's strict entailment + numeric verification + contradiction signal is what holds the FPR at 6.1% - by design.

How to read this

Built for opposite kinds of AI workflow.

variA/Bly - precision-first, by design.

When variA/Bly says “grounded”, it's right 9 out of 10 times - and rejects 94% of distractor passages. Deterministic, audit-trail-ready, sub-1ms SDK integration, predictable per-eval price. The right tool for regulated, customer-facing AI: healthcare, finance, legal, insurance, compliance-led workflows where being right when you say yes is non-negotiable.

RAGAS - recall-first, LLM-as-judge.

Catches almost every real grounding (97% recall) - and approves 38.2% of distractor passages along the way. A reasonable fit for low-stakes search, exploration, and summarisation, where the user is in the loop and a wrong “yes” verdict is a redundant click rather than a compliance breach.

Methodology

Same data into both scorers. Reproducible scripts.

Dataset: RGB (Retrieval-augmented Generation Benchmark, Chen et al. 2023) - public, labeled, cited by competitors. github.com/chen700564/RGB
Sample count: 296 RGB samples × 2 test cases each = 592 evaluations per scorer. Each sample produces a positive case (response paired with reference passages that contain the answer) and a negative case (paired with distractor passages that don't).
Realistic responses: generated with gpt-4o-mini from the positive chunks, then used unchanged in both positive and negative cases. The same response sees the two reference sets - only the references swap. This isolates scorer behaviour from response quality.
RAGAS judge: gpt-4o-mini with the default RAGAS faithfulness pipeline. Measured cost across the run: ~$17.76 (~$0.030 per evaluation).
Threshold: 0.5 for the precision / recall / FPR / FNR numbers. AUC is threshold-independent.

Get the whitepaper

Three editions, same numbers.

Whitepaper

~5 page whitepaper · headline numbers · methodology · cost analysis. Best place to start.

Technical edition

25 pages · methodology · 12-phase journey · error-mode breakdown · reproduction scripts. For CTOs / ML platform leads.

Business edition

5-7 pages · plain English · compliance-breach math at scale. For ops directors / compliance leads.

Verify the math (no account, no cost)

Our per-sample scores from the April 2026 run are committed to the public benchmark repo as JSON. To recompute the headline AUC / precision / recall / FPR / FNR straight from those raw scores - no API key, no rescoring needed:

git clone https://github.com/varia-bly/variably-benchmark
cd variably-benchmark/rgb
python3 compare.py

This reads the per-sample JSON in rgb/results/ and prints the same comparison table at the top of this page. If hand-calc on the same JSON disagrees with compare.py, file an issue - we mean it when we say reproducible.

Re-run the scorers from scratch

Both scorers cost money to run end-to-end and require their own account. The RAGAS side is fully self-contained - just an OpenAI key. The variA/Bly side requires a free API key from variably.tech because the scoring algorithm runs on Variably's hosted infrastructure.

# Re-run RAGAS independently (OpenAI key only):
git clone https://github.com/varia-bly/variably-benchmark
cd variably-benchmark
pip install -r requirements.txt
git clone --depth 1 https://github.com/chen700564/RGB rgb/data/RGB

export OPENAI_API_KEY=<your key>
python3 rgb/run_ragas.py
python3 rgb/compare.py

RAGAS run: ~$17.76 (gpt-4o-mini judge × 592 evals), ~28 min wall. variA/Bly rerun + step-by-step commands: rgb/README.md .

Standard LLM-as-judge eval approves 38% of ungrounded AI answers as “grounded.”We measured the gap, and held it to 6.1%.

RAGAS misses 38 out of every 100 AI hallucinations.variA/Bly catches 94 out of every 100.