Evaluation / LLMOpsMock demo

AgentEval Studio

An evaluation and observability workbench that compares AI prompt / RAG / agent variants on quality, cost, latency, and failure modes, and recommends a release gate.

Next.jsTypeScriptAnthropic APILLM-as-judgeSupabase (optional)Vercel

Live demo GitHub All projects

At a glance

ICP: Small AI product teams and solo builders shipping LLM features without enterprise-grade eval tooling.
Features: Create eval suites from test cases
Register prompt / model variants
Run batch evals (deterministic checks + LLM-as-judge)
Score on relevance, faithfulness, safety, latency, and cost
Compare experiments side by side
Release-gate recommendation + exportable report

AI architecture

1
Prompt / version registry
Register variants under test with metadata.
2
Test dataset
10–25 curated cases per suite, versioned.
3
Evaluator runners
Deterministic checks (schema, regex, must-include) run first.
4
LLM-as-judge
claude-haiku-4-5 scores each output against a rubric.
5
Scorecards
Per-dimension quality, cost, and latency aggregates.
6
Release gate
Pass/block recommendation against a configurable threshold.

Case study

Product problem

AI PMs and eng leads need a defensible answer to 'is this good enough to ship?' Today that decision is vibes-based. AgentEval turns it into a measured, repeatable gate.

ICP & MVP scope

ICP: a small AI product team or solo builder deploying LLM features. In scope for MVP: suite creation, batch eval, scorecards, comparison, and a release-gate recommendation. Out of scope: org RBAC, dataset labelling workflows, and live production tracing.

Metrics & experiments

North star is the share of eval suites that clear the release threshold. A natural first experiment: does showing a release-gate recommendation (vs raw scores) increase the rate at which users actually act on a failing eval?

Resume bullets · AI Engineering

Built an LLM evaluation harness combining deterministic checks with an LLM-as-judge (Claude Haiku) and per-run cost/latency tracking.
Designed a mock-first runner so the hosted demo works with zero API keys and CI never depends on a paid model call.

Resume bullets · AI PM

Defined ICP, MVP scope, and a release-gate metric framework (north star, activation, retention, quality, guardrail) for an LLM eval product.
Reframed model evaluation as a shippable release gate, turning a vibes-based decision into a measured one.