How it works

From a suite to a release decision

The pipeline pairs deterministic checks with an LLM judge so the score is both cheap and defensible — and frames the result as a release gate, not a leaderboard.

1
Suite

A task, a set of test cases (each with deterministic checks), and the prompt/model variants under test.

2
Generate

Each variant produces an output for each case. Mock mode returns deterministic synthetic outputs; live mode runs the real model.

3
Deterministic checks

Schema validity, must-include / exclude, regex, max-length — run first because they're free and catch the obvious failures.

4
LLM-as-judge

claude-haiku-4-5 scores relevance, faithfulness, and safety (1–5) against the task rubric, with a one-line rationale and a confidence label.

5
Aggregate

Per-variant averages for quality, check pass-rate, latency, cost, and unsafe-case count.

6
Release gate

The best variant's average quality is compared to the suite threshold to produce a ship / hold recommendation.

Why mock-first

The default mode needs no API key and no database, so the demo always works and CI never depends on a paid model call. The runner exposes one interface; live mode swaps in the Anthropic adapter when ANTHROPIC_API_KEY is set.

Why a judge needs guardrails

An LLM judge alone is noisy. Pairing it with deterministic checks, an explicit rubric, and confidence labels — and surfacing disagreement rather than hiding it — is what makes the eval trustworthy enough to gate a release on.