EVAL · GATE BEFORE, MONITOR AFTER

Five great outputs in a row. The sixth ships broken.

AI pipelines aren’t deterministic. The same flow that nailed your last batch will quietly fail when an input drifts off-distribution. Eval scores every output against your rubric - automatically - and flags the ones that shouldn’t ship.

// the hidden problem 8% of pipeline outputs drift off-spec without anyone noticing // silent failures don’t trigger errors
avg score 92% across 69 evals
pass rate 97% 3 profiles, last 7d
caught before ship 2 flagged for human review

Last batch · 200 generated chair photos · profile: AZ Design Std

Each square is one output, scored against a 5-dimension rubric. The reds didn’t fail because the model crashed - they failed because the chair came back wrong.

Pass ≥ 70 Review Fail

3 outputs failed silently - including batch_137.png (fabric mismatch, score 34%)

// would have shipped to the CDN without Eval. anton@ notified by email.

Use cases

Three teams. Same gate.

A profile is a rubric - name dimensions, weight them, set a threshold. The judge does the rest.

e-commerce / brand ops

Did the chair come back beige? In frame? Brand-safe?

// AZ Design Std rubric
subject_match
.30
background_clean
.25
in_frame
.20
color_fidelity
.15
resolution
.10
×
batch_137.png · pink fabric on a beige product
color_fidelity: 0.12 · subject_match: 0.41
34%
Pipeline image-gen → eval → cdn
ai engineers

Eval-as-CI for the LLM and image features you’re shipping.

// LLM output quality rubric
accuracy
.30
no_hallucinations
.25
format_valid
.20
latency_ok
.15
tone_match
.10
PR #482 regression test · 1,200 prompts
avg score 91% · no_hallucinations down 2pt
91%
Triggered by github actions on push
editorial / content

Catch a thin AI-written paragraph before it goes live.

// blog posts apiai rubric
clarity
.25
substance
.25
structure
.15
voice_tone
.15
accuracy
.20
RSS watch · /blog/feed.xml · daily 09:00
15 posts scored · 2 flagged for editor review
88%
Auto-runs on new posts only
THE DIFFERENTIATOR

Most eval tools run when you ask. Eval runs on a schedule.

Point Eval at a URL or RSS feed. It re-scores on every change, every day, every push - and emails you the moment something dips below threshold. No cron jobs to wire up. No CI pipeline to maintain.

  • Watch any URL or RSS feed - single, recurring, or change-based.
  • Pipe pipeline outputs straight in via API - score before publish.
  • Email recipients on FAIL or REVIEW. Webhooks coming soon.
// monitor activity · last 24h Live
11:38 scored /products/chair-117 PASS 100%
11:32 scored /products/chair-118 PASS 95%
11:15 scored /products/chair-119 FAIL 34%
10:43 RSS poll · blog/feed.xml SKIPPED -
09:00 scored 3 new posts PASS 93%
06:27 notified anton@ SENT -

Three steps from rubric to running gate.

Needs an apiai.me API key
1 DEFINE

Create a Profile

Pick dimensions, weights, pass threshold. Add good/bad criteria so the LLM judge has examples to anchor on.

# profile.yaml
name: "AZ Design Std"
threshold: 70
dims:
  - subject_match: .30
  - color_fidelity: .15
2 CONNECT

Watch a feed or POST

Attach to a pipeline as a step, watch a URL or RSS feed, or hit the eval endpoint directly from CI.

curl -X POST \
  api.apiai.me/eval/run \
  -H "Authorization: Bearer $KEY" \
  -F "profile=az_std"
3 TRUST

Ship with confidence

Every output gets a verdict and a stored audit trail. Failures email the team. Pipelines auto-block on FAIL.

# response
verdict: "PASS"
score: 95
latency_ms: 2099
audit_url: ".../r/9d2"

Already on apiai.me? Eval is one POST away.

Same API key. Same dashboard. New gate on every output.