blog

Field notes on production AI evaluation.

Claim-level grounding, retrieval relevance, deterministic scoring — what we learn building the evaluation methodology behind variA/Bly.

June 8, 2026Hallucination evaluationNLIDeterministic evaluationRAGAI agentsFluency Trap
Demystifying Hallucination Evaluation for RAG and AI Agents
A practitioner's guide to detecting fluent-but-wrong outputs — and why scores you can't reproduce aren't really scores. Anatomy of a hallucination grader, the Fluency Trap, the six categories of hallucination, determinism, per-claim verdicts, domain routing, public benchmark numbers, and an honest 0-to-1 roadmap.
Read the post
May 20, 2026LLMOpsLangfuseLangSmithComparisonTool selection
When to Use Langfuse, LangSmith, or variA/Bly: Choosing Your LLMOps Stack in 2026
Langfuse, LangSmith, and variA/Bly all show up in the same "which LLMOps tool should I use" conversations — but they answer different questions. A practical, honest comparison with feature matrix, methodology breakdown, and a decision tree.
Read the post
April 8, 2026CostLLM-as-judgeRAGASDeepEvalLow-cost AI evaluation
The Hidden Bill Shock in LLM-as-Judge Evaluation
Your LLM bill shocked you last year. The evaluation bill is next. Why RAGAS and DeepEval — both LLM-as-judge under the hood — can quietly double your AI cost, with worked math at 10K calls/month.
Read the post
February 11, 2026Deterministic evaluationReproducibilityLLM-as-judgeCompliance
Run It Twice. Did You Get the Same Score?
Why deterministic LLM evaluation is the foundation regression detection, statistical A/B testing, and compliance audits all build on — and the reproducibility test most eval vendors don't want you to run.
Read the post
December 17, 2025RAGEvaluationRetrievalHallucination
Your AI Isn't Hallucinating — Your Retrieval Is
A diabetes clinical agent scored 96% faithfulness on every standard hallucination metric — and was still completely wrong. The number nobody checks tells the real story.
Read the post

Field notes on production AI evaluation.

Demystifying Hallucination Evaluation for RAG and AI Agents

When to Use Langfuse, LangSmith, or variA/Bly: Choosing Your LLMOps Stack in 2026

The Hidden Bill Shock in LLM-as-Judge Evaluation

Run It Twice. Did You Get the Same Score?

Your AI Isn't Hallucinating — Your Retrieval Is