All posts
by Amit Kumar

Run It Twice. Did You Get the Same Score?

Why deterministic LLM evaluation is the foundation regression detection, statistical A/B testing, and compliance audits all build on — and the reproducibility test most eval vendors don't want you to run.

We ran the same prompt through RAGAS twice last month. Same response. Same retrieved context. Same RAGAS version. Same judge model (gpt-4o-mini at temperature 0).

The faithfulness score moved from 0.71 to 0.83.

We ran it five more times. The scores landed at 0.71, 0.83, 0.79, 0.75, 0.83.

Nothing about the input had changed. Only the measurement had.

This is the hidden variance built into every LLM-as-judge evaluation tool, and it quietly breaks the three things production AI depends on: regression detection, statistical A/B testing, and audit-grade compliance.

Why "same input, same score" matters

Three places this matters most:

1. Regression detection. You shipped a new prompt. Faithfulness dropped from 0.86 to 0.79. Did the prompt regress, or did the scorer? With a non-deterministic evaluator, you can't tell — and you'll burn an afternoon trying to find out.

2. Statistical A/B testing. Significance math assumes a stable measurement instrument. If your scorer adds ±5% noise to every measurement, your test needs roughly 4× the sample size to detect the same effect — and even then, a meaningful chunk of your "wins" are measurement artifacts.

3. Compliance and audit. Under the EU AI Act, OCC guidance on model risk, and HIPAA-adjacent enforcement, you need to defend an evaluation result months after the fact. "We scored this response at 87% faithful" only holds up if running the same input today produces 87% again. Anything else is unreproducible — which, in compliance language, means undefendable.

Where the variance comes from

LLM-as-judge isn't deterministic. Four reasons, in increasing order of how hard they are to engineer around:

(a) Temperature > 0. Many evaluation tools default to a non-zero judge temperature. Setting it to 0 helps, but doesn't fully fix the problem.

(b) Even at temperature 0, modern LLMs aren't fully deterministic. This is well-enough understood that OpenAI added the seed parameter and system_fingerprint field specifically to give developers reproducibility hooks — and even with those, the docs note that completions can still vary when the underlying serving stack changes. Anthropic's API behaves similarly. At scale, temperature-0 LLM-as-judge still produces measurable variance.

(c) Silent model updates. The convenience aliases — gpt-4o-mini, claude-haiku — point to the latest version. If your evaluation tool doesn't explicitly pin to a dated snapshot like gpt-4o-mini-2024-07-18, your judge model changes every time the provider updates the default. Most evaluation tools don't pin by default. Your historical scores have no defensible meaning if the judge changed underneath you.

(d) Quantization and serving drift. At scale, providers route requests to different hardware tiers with different quantization. Two identical calls to the same model can land on different physical infrastructure and produce different outputs.

You cannot engineer your way out of (c) and (d). They are properties of the LLM-as-a-service model itself.

What deterministic evaluation actually requires

For a score to be deterministic, the entire evaluation pipeline has to be deterministic:

  • Pre-processing (tokenization, claim decomposition) must produce identical intermediate representations on the same input.
  • The scoring algorithm must be a fixed function, not a sampled output.
  • The model weights used must be versioned and immutable.

That last requirement is the load-bearing one. The moment you ask "an LLM" to judge, your output is a sample from a probability distribution. Same input → distribution over outputs → variance.

variA/Bly's evaluation pipeline is built differently. Each scoring dimension runs on a fixed-weight model with deterministic decoding — Natural Language Inference (NLI) models for grounding, embedding similarity for retrieval relevance, rule-driven claim decomposition rather than LLM-generated, domain-routed models for clinical and other specialized text. Same input, same intermediate, same score. Run it a hundred times; the answer doesn't move.

The reproducibility test

Here's the test that separates real deterministic evaluation from "low-temperature LLM-as-judge":

Take 100 evaluations from your production traffic. Run them through your scorer. Save the scores. Wait an hour. Run them again. Compute the per-sample score delta.

For variA/Bly: every delta is exactly 0.

For RAGAS or DeepEval at temperature 0: the median delta is small but nonzero; the long tail is sometimes large; some samples flip across decision thresholds (e.g., 0.49 → 0.51, which flips a pass/fail gate).

This is the test most evaluation vendors don't want you to run. It's free, runs in an afternoon, and tells you whether your downstream regression detection, A/B tests, and audit trail can actually be trusted.

You can run it against the public Apache-2.0 benchmark repo — the scripts are already there, and the run-to-run determinism shows up byte-for-byte in the committed JSON outputs.

Where LLM-as-judge is still the right tool

Not every evaluation needs determinism. Some judgments are fundamentally subjective — "is this marketing copy more compelling than that one?", "does this poem rhyme well?", "is the tone empathetic enough?". For open-ended creative or stylistic evaluation, the variance in LLM-as-judge is part of why it can capture nuance that fixed-weight scorers miss.

But for production gates — does this response hallucinate, is this claim grounded in the retrieved context, did the retrieval pull the right documents — variance is a bug, not a feature. The thing you are measuring exists. The score should reflect it. Twice in a row.

The deeper consequence

The reason teams accept evaluator variance is that they've never run the reproducibility test. The score comes back, it looks plausible, and they move on. Until the day the score changes for unclear reasons — and there's no way to tell whether the model regressed, the prompt drifted, the user behavior shifted, or the scorer just hiccupped.

You can solve three of those four. You cannot solve the scorer hiccup unless you change scorers.

Same input. Same score. Every run. It sounds boring. It's the foundation everything else builds on.


Read more about variA/Bly's evaluation methodology, or verify the math from the public benchmark scripts.

Want this kind of evaluation for your RAG system?

Talk to us