how it works

One evaluation framework.Any LLM output.

The deterministic alternative to LLM-as-judge. Six scoring categories. Per-claim audit trail. The same methodology grades RAG responses, agent outputs, copilot drafts, and anything else your LLM produces — same input, same score, every run.

the illusion

Your eval says everything is fine.

These scores look great. The reality is different.

Standard evaluation report
Fluency
0.95
Safety
0.98
Relevance
0.91
Coherence
0.93
All checks passed
What it doesn't tell you
23% of claims have no source document
Ungrounded claims passed fluency checks
3 fabricated citations detected
References to documents that don't exist in the corpus
Turn 4 contradicts Turn 2
Context loss across conversation turns
Retrieved chunks unused
40% of relevant context was ignored by the model
hidden hallucination

Confident responses. Fabricated claims.

The response reads perfectly. But look closer.

Example shown: healthcare RAG copilot. The same hallucination pattern shows up in agent outputs, sales drafts, and any LLM-generated content.

AI Response

The treatment protocol recommends starting with metformin at 500mg twice daily. Clinical trials showed a 47% improvement in outcomes when combined with lifestyle interventions.

Patients should be monitored every 3 months for the first year. The landmark 2019 study by Dr. Chen et al. confirmed these findings across 12,000 patients.

Grounded
Fabricated
Context-Aware Verification
Metformin 500mg dosing
Source: guidelines_v3.pdf, chunk 47
"47% improvement" claim
No matching claim in any retrieved document
3-month monitoring schedule
Source: protocol_2024.pdf, chunk 12
"Dr. Chen et al. 2019 study"
Fabricated citation — no such study in corpus
the framework

Six categories. Deterministic scoring. Same answer, every run.

40+ metrics across six categories cover the full surface of what can go wrong with an LLM output. Each score reproduces byte-for-byte on the same input.

Quality

Correctness, completeness, relevance to the request. Did the response actually answer the question?

Safety

Toxicity, bias, PII leakage, prompt-injection resistance. Catch unsafe outputs before they reach users.

Grounding

Claim-by-claim verification against source material. NLI + numeric checks. Hallucination rate per response.

Semantic

Embedding-based similarity to reference answers. Topic drift, off-domain detection, semantic faithfulness.

Coherence

Multi-turn consistency, context retention, contradiction detection across an interaction sequence.

Advanced

Reasoning depth, citation quality, structured-output validity, edge-case behaviors. The metrics that catch what generic scoring misses.

Validated on public benchmarks: 6× cleaner hallucination detection than RAGAS on RGB (n=592)
built for any LLM workload

One framework. Many applications.

The same scoring methodology grades whatever your LLM produces — the response text, the synthesized answer, the inline draft.

RAG

Retrieval-augmented generation

Claim-level grounding against retrieved chunks. Citation verification. Hallucination catch rates 6× cleaner than RAGAS on public benchmarks.

Agents

Multi-step agent workflows

Agent text outputs are grounded in tool results — structurally RAG-shaped. Score grounding, coherence across steps, and safety with the same framework. Multi-turn consistency for sequential reasoning.

Copilots

Inline assistants & copilots

Score inline drafts, completions, and suggestions across all six categories. Compare suggestion variants on quality, safety, and groundedness before rolling out company-wide.

Stop guessing. Start scoring.

1,000 free credits. No credit card. See your first deterministic evaluation in under 10 minutes.