Did the chair come back beige? In frame? Brand-safe?
batch_137.png · pink fabric on a beige productcolor_fidelity: 0.12 · subject_match: 0.41image-gen → eval → cdn
AI pipelines aren’t deterministic. The same flow that nailed your last batch will quietly fail when an input drifts off-distribution. Eval scores every output against your rubric - automatically - and flags the ones that shouldn’t ship.
AZ Design StdEach square is one output, scored against a 5-dimension rubric. The reds didn’t fail because the model crashed - they failed because the chair came back wrong.
3 outputs failed silently - including batch_137.png (fabric mismatch, score 34%)
// would have shipped to the CDN without Eval. anton@ notified by email.
A profile is a rubric - name dimensions, weight them, set a threshold. The judge does the rest.
batch_137.png · pink fabric on a beige productcolor_fidelity: 0.12 · subject_match: 0.41image-gen → eval → cdn
no_hallucinations down 2ptgithub actions on push
new posts only
Point Eval at a URL or RSS feed. It re-scores on every change, every day, every push - and emails you the moment something dips below threshold. No cron jobs to wire up. No CI pipeline to maintain.
/products/chair-117
PASS
100%
/products/chair-118
PASS
95%
/products/chair-119
FAIL
34%
blog/feed.xml
SKIPPED
-
3 new posts
PASS
93%
anton@
SENT
-
apiai.me API key
Pick dimensions, weights, pass threshold. Add good/bad criteria so the LLM judge has examples to anchor on.
# profile.yaml
name: "AZ Design Std"
threshold: 70
dims:
- subject_match: .30
- color_fidelity: .15
Attach to a pipeline as a step, watch a URL or RSS feed, or hit the eval endpoint directly from CI.
curl -X POST \
api.apiai.me/eval/run \
-H "Authorization: Bearer $KEY" \
-F "profile=az_std"
Every output gets a verdict and a stored audit trail. Failures email the team. Pipelines auto-block on FAIL.
# response
verdict: "PASS"
score: 95
latency_ms: 2099
audit_url: ".../r/9d2"
Same API key. Same dashboard. New gate on every output.