blog

Field notes on production AI evaluation.

Claim-level grounding, retrieval relevance, deterministic scoring — what we learn building the evaluation methodology behind variA/Bly.

  • Hallucination evaluationNLIDeterministic evaluationRAGAI agentsFluency Trap

    Demystifying Hallucination Evaluation for RAG and AI Agents

    A practitioner's guide to detecting fluent-but-wrong outputs — and why scores you can't reproduce aren't really scores. Anatomy of a hallucination grader, the Fluency Trap, the six categories of hallucination, determinism, per-claim verdicts, domain routing, public benchmark numbers, and an honest 0-to-1 roadmap.

    Read the post
  • CostLLM-as-judgeRAGASDeepEvalLow-cost AI evaluation

    The Hidden Bill Shock in LLM-as-Judge Evaluation

    Your LLM bill shocked you last year. The evaluation bill is next. Why RAGAS and DeepEval — both LLM-as-judge under the hood — can quietly double your AI cost, with worked math at 10K calls/month.

    Read the post
  • Deterministic evaluationReproducibilityLLM-as-judgeCompliance

    Run It Twice. Did You Get the Same Score?

    Why deterministic LLM evaluation is the foundation regression detection, statistical A/B testing, and compliance audits all build on — and the reproducibility test most eval vendors don't want you to run.

    Read the post