Evaluation / LLMOpsMock demo

AgentEval Studio

An evaluation and observability workbench that compares AI prompt / RAG / agent variants on quality, cost, latency, and failure modes, and recommends a release gate.

Next.jsTypeScriptAnthropic APILLM-as-judgeSupabase (optional)Vercel

At a glance

ICP
Small AI product teams and solo builders shipping LLM features without enterprise-grade eval tooling.
Features
  • Create eval suites from test cases
  • Register prompt / model variants
  • Run batch evals (deterministic checks + LLM-as-judge)
  • Score on relevance, faithfulness, safety, latency, and cost
  • Compare experiments side by side
  • Release-gate recommendation + exportable report

AI architecture

  1. 1
    Prompt / version registry
    Register variants under test with metadata.
  2. 2
    Test dataset
    10–25 curated cases per suite, versioned.
  3. 3
    Evaluator runners
    Deterministic checks (schema, regex, must-include) run first.
  4. 4
    LLM-as-judge
    claude-haiku-4-5 scores each output against a rubric.
  5. 5
    Scorecards
    Per-dimension quality, cost, and latency aggregates.
  6. 6
    Release gate
    Pass/block recommendation against a configurable threshold.

Case study

Product problem

AI PMs and eng leads need a defensible answer to 'is this good enough to ship?' Today that decision is vibes-based. AgentEval turns it into a measured, repeatable gate.

ICP & MVP scope

ICP: a small AI product team or solo builder deploying LLM features. In scope for MVP: suite creation, batch eval, scorecards, comparison, and a release-gate recommendation. Out of scope: org RBAC, dataset labelling workflows, and live production tracing.

Metrics & experiments

North star is the share of eval suites that clear the release threshold. A natural first experiment: does showing a release-gate recommendation (vs raw scores) increase the rate at which users actually act on a failing eval?

Resume bullets · AI Engineering
  • Built an LLM evaluation harness combining deterministic checks with an LLM-as-judge (Claude Haiku) and per-run cost/latency tracking.
  • Designed a mock-first runner so the hosted demo works with zero API keys and CI never depends on a paid model call.
Resume bullets · AI PM
  • Defined ICP, MVP scope, and a release-gate metric framework (north star, activation, retention, quality, guardrail) for an LLM eval product.
  • Reframed model evaluation as a shippable release gate, turning a vibes-based decision into a measured one.