LLM bill shock is no longer a punchline.

In 2023, a16z's Navigating the High Cost of AI Compute warned that the cost of running models in production would surface as a real P&L line item. By 2024, founder Twitter was littered with screenshots of $30K, $50K, $100K OpenAI bills posted with stunned-face emojis. Every CFO running a GenAI product now has a Slack channel watching tokens.

Most teams responded the way you'd expect: shipped a smaller model in production, cached aggressively, optimized prompts.

What almost nobody did: looked at their evaluation bill.

Here's the thing — your AI evaluation infrastructure is making LLM calls too. Often more of them than your production code. And the model doing the evaluating is usually more expensive than the one doing the generating.

That's the second bill shock. It's quieter, but at scale it's bigger.

What "LLM-as-judge" actually does to your bill

Two of the most popular open-source evaluation libraries — RAGAS and DeepEval — share the same underlying methodology: LLM-as-judge.

The pattern works like this:

Your production system generates a response (using, say, gpt-4o-mini).
The evaluation library sends the question, the response, and the retrieved context to a separate LLM and asks: "Is this answer faithful? Is it grounded? Is it relevant?"
That second LLM — the "judge" — produces a score.

Two LLM calls per evaluation. Sometimes more, when the methodology decomposes a response into individual claims and judges each one independently (that's how RAGAS's faithfulness metric is implemented under the hood).

Per evaluation, this looks cheap. About $0.030 with gpt-4o-mini as judge. Easy to defend in a vendor pitch.

It gets expensive in three ways most teams underestimate.

#1: In production, the judge is bigger than the generator

Teams optimize their generation stack ruthlessly — cheaper, faster, smaller models. gpt-4o-mini. Haiku. Llama 70B.

But for evaluation, every team I've talked to has the same instinct: "I want the smart model judging the smaller model's output." So they wire up Sonnet, GPT-4o, or Opus as the judge.

That decision is reasonable — you don't want your evaluator to be more error-prone than the thing it's evaluating. The cost consequence: judge tokens are routinely 3–4× more expensive than generator tokens per call.

So when a vendor quotes you their judge cost at "$0.030/eval with gpt-4o-mini," your actual production bill probably looks closer to $0.10–$0.15/eval, because no one ships gpt-4o-mini as a judge in a high-stakes domain.

The naive math undersells the real bill by 3–5×.

#2: The bill scales with prompt size, not just call count

LLM-as-judge bills are token-based. So your evaluation cost moves with:

Length of the user query
Length of the retrieved context
Length of the response
Length of the judge's chain-of-thought reasoning

A 500-token prompt and a 5,000-token prompt look identical in your evaluation dashboard. They cost very different amounts on the invoice.

Predicting your monthly evaluation bill requires predicting your prompt-length distribution. Most teams can't. That's how the screenshots get posted.

#3: You pay for evaluations that don't even complete

This one surprised us.

When we benchmarked DeepEval against the public RGB hallucination dataset, its FaithfulnessMetric failed on ~32% of real samples. The cause is mechanical: the claim-decomposition step generates more output tokens than gpt-4o-mini's 16K output limit allows, the judge call truncates, and the metric returns "score unavailable."

You still pay for those calls. So the effective cost per successful evaluation is ~47% higher than the per-call number suggests.

DeepEval's HallucinationMetric doesn't have this failure mode. Neither does variA/Bly. But "your scorer actually worked" is not a given when the methodology is LLM-as-judge.

The voice-AI-for-insurance example

One of our customers builds voice AI agents for the insurance industry. Each customer call gets transcribed, summarized, and routed; every claim about a policy term, coverage limit, or quote needs to be grounded in actual policy documents.

That's about 10,000 evaluations per month — and that's per scoring dimension. With compliance, policy grounding, hallucination, and tone checked separately, the real number is 4× higher.

If they'd built that on LLM-as-judge:

Naive math (gpt-4o-mini judge): 40,000 evals × ~$0.030 = $1,200/mo, judge tokens alone
Realistic math (Sonnet judge for regulated content): 40,000 × ~$0.10 = $4,000/mo, judge tokens alone
Plus the RAGAS/DeepEval ops overhead — hosting, queue management, retry logic
Plus the ~30% silent failures on claim-decomposition

Now multiply that by every voice AI startup scaling from 10K to 100K to 1M calls a month. The line moves from "annoying" to "a real ARR-equivalent cost."

The 10K/month math, side by side

Here's the apples-to-apples cost at the volume most early-stage AI teams hit first:

	RAGAS (LLM-as-judge, gpt-4o-mini)	DeepEval (LLM-as-judge, gpt-4o-mini)	variA/Bly (deterministic)
Per-eval cost	~$0.030 (judge tokens only)	~$0.030 (judge tokens only)	$0.015 (entry tier, all-in)
Monthly cost @ 10K evals	~$300+ (judge alone)	~$300+ (judge alone)	$150
With realistic stronger judge (Sonnet)	$1,000–$1,500+	$1,000–$1,500+	unchanged — $150
Bill varies with prompt length?	Yes	Yes	No
Pays for failed evaluations?	Yes	Yes (~32% RGB failure rate)	No

variA/Bly is ~50–60% less than RAGAS's LLM-judge cost alone, and ~70% less once you include their ops overhead. At enterprise volume tiers, the per-eval rate drops to $0.012, widening the gap.

"Cheaper" without "more accurate" is a false economy

Cost only matters in context with quality. A cheaper scorer that misses hallucinations is more expensive — because the incidents, audits, and rollbacks live downstream of bad scoring.

Two public benchmark results:

RGB hallucination benchmark (n=592) — RAGAS approves 38.2% of distractor passages as grounded; variA/Bly holds that to 6.1%.
PubMedQA hallucination catch rate — variA/Bly catches 63.9% of hallucinations on clinical text. RAGAS catches 27.2%. DeepEval lands between 4.5% and 21.8% depending on the metric.

The same low-cost evaluation stack catches 2–6× more hallucinations than the LLM-as-judge alternatives. That's not a coincidence — claim-level deterministic evaluation has fundamentally different failure modes than asking an LLM to judge another LLM.

You can verify the math from the public Apache-2.0 benchmark scripts.

Why variA/Bly's pricing is structured the way it is

variA/Bly doesn't run LLM-as-judge under the hood. The scoring pipeline is deterministic — NLI models, embedding similarity, claim-level grounding, retrieval relevance — purpose-built for evaluation, not LLM calls reasoning about other LLM calls.

That's where the predictability comes from. The bill doesn't move when prompts get longer. It doesn't move when the next frontier model launches and changes judge-model pricing. The Standard Evaluation Unit (SEU) is one evaluation, all-in, regardless of input size.

Three questions to ask any evaluation vendor:

What's the per-eval cost at my token size? If the answer references "judge tokens," your bill will surprise you.
What's the score-failure rate? Anything LLM-as-judge based should publish this. Most don't.
Does the cost scale linearly with volume, or does it surge with prompt length?

The hidden bill shock isn't a question of whether evaluation should cost money. It's whether you can predict the number on next month's invoice.

If you can't, you're already overpaying — you just don't know by how much yet.

See variA/Bly's pricing, the public benchmark repo, or read about the underlying evaluation methodology.