Verification scenarios
Promote a failed episode to a verification scenario and any future candidate policy has to replay it without re-failing before it's eligible for deploy. Same model the AV industry shipped a decade ago, made native to general robotics.
Built on the existing replay regression harness - the customer-side runner, the metric math, and the cross-tenant guards are unchanged. The only new surface is the "these scenarios are mandatory" bookkeeping plus a CI-callable check.
Compare the two: the replay harness measures how much better a candidate is. Verification measures whether it's allowed to ship. Use both.
When to reach for it
| You want to… | Reach for |
|---|---|
| Lock in a known failure mode the next candidate must pass | Promote on the episode detail page |
| Gate a CI deploy on every critical scenario | robotrace verify check --candidate ... |
| Trigger the gate when a replay sweep finalizes | Happens automatically on eval finalize |
| Programmatically check the gate from a release script | rt.verify.check_gate(...) |
Severity, and what it means
Every scenario carries one of three severities:
| Severity | Effect on robotrace verify check |
|---|---|
critical | Failing any critical scenario exits non-zero. Blocks the deploy. |
warning | Reported in the CI summary; exit code stays 0. |
info | Tracked for visibility only. |
The default when promoting from the portal is warning - moving
to critical is a deliberate "this must never regress" call. The
CLI surfaces the count of each in its exit summary so you can see
what got upgraded over time without a portal trip.
Promoting an episode
Two paths to the same row:
# 1. From the portal: open the episode, click "Promote to verification set".
# Choose name, severity (defaults to "critical" for failed episodes),
# and an optional "why this matters" note.
# 2. From the SDK / CI:
import robotrace as rt
scenario = rt.verify.promote(
baseline_episode_id="ep-d3e1...",
name="Teal mug grasp failure",
severity="critical",
description="Camera highlight on the rim throws off the grasp.",
)
# {"scenario_id": "...", "promoted": True}Both paths are idempotent: re-promoting the same episode returns
the existing scenario_id and promoted=False. The portal route
detail page renders both responses identically.
Running the gate from CI
The CI entry point is one CLI call:
robotrace verify check \
--candidate pap-v4.0.0-rc1 \
--policy my_pkg.policies:v4_rc1It does three things:
- Calls
POST /api/verify/checkto read the current gate state for this candidate version. - For every active critical scenario that hasn't already
passed for this candidate, it opens a small replay run using the
same harness as
robotrace replay runand records a verification result per scenario. - Re-evaluates the gate and exits
0(pass) or1(blocked).
Verify check: candidate=pap-v4.0.0-rc1 policy=my_pkg.policies:v4_rc1
✓ Eval run created: a17c9d2e-...
✓ [1/2] ep-d3e1 → c1f2a4b8
✓ [2/2] ep-9a4f → 7e3d1f2c
✗ BLOCKED: 1/2 critical scenarios pass (1 failed, 0 pending)
· Critical scenario "Teal mug grasp failure" failed for pap-v4.0.0-rc1.
View scenarios: https://app.robotrace.dev/portal/verifyPipe that into your CI job script and exit non-zero on BLOCKED.
The portal link in the trailing line is a real hyperlink in
terminals that support OSC 8 - the same affordance robotrace login
already uses.
Flags
| Flag | Required | What |
|---|---|---|
--candidate | yes | Stable identifier for the candidate (e.g. pap-v4.0.0-rc1) |
--policy | no | module:fn policy callable - needed for scenarios that haven't passed yet |
--dry-run | no | Run the policy locally, skip uploading verification results |
--profile | no | Pick a non-default profile from ~/.robotrace/credentials |
Omitting --policy is fine when you just want to read the
gate state (e.g. a release script that gates on a candidate the
nightly sweep already covered). The command exits 0 if every
critical scenario already has a recent pass, 1 otherwise.
Auto-sync from eval runs
When robotrace replay run finalizes a campaign, the server
mirrors every per-episode result onto any matching active scenario.
If your nightly sweep's baseline set overlaps your verification
set, you get the verification rows for free - no second CLI call.
The eval-run detail page renders a Verification scenarios card right under the DiffCard with the pass/fail counts so a CI engineer checking last night's sweep can answer "did this candidate clear the gate?" without leaving the eval view.
Programmatic API
robotrace.verify exposes the same surface as the CLI:
import robotrace as rt
# Check the gate without running new replays.
gate = rt.verify.check_gate(candidate_policy_version="pap-v4.0.0-rc1")
if not gate["passed"]:
for blocker in gate["blockers"]:
print(blocker)
raise SystemExit(1)
# Run the full loop (replay + record + re-check).
exit_code, gate = rt.verify.run_check(
candidate_policy_version="pap-v4.0.0-rc1",
policy_callable=my_policy,
)
raise SystemExit(exit_code)The gate body shape:
{
"candidate_policy_version": "pap-v4.0.0-rc1",
"passed": false,
"critical_total": 3,
"critical_passed": 2,
"critical_failed": 1,
"critical_pending": 0,
"warning_total": 1,
"warning_failed": 0,
"scenarios": [
{
"scenario_id": "...",
"name": "Teal mug grasp failure",
"severity": "critical",
"baseline_episode_id": "ep-d3e1...",
"latest_status": "fail",
},
...
],
"blockers": ["Critical scenario \"Teal mug grasp failure\" failed for pap-v4.0.0-rc1."],
}HTTP 422 is the blocked response (the JSON body is identical to
200). The CLI normalizes both to exit codes; the Python helper
returns the parsed body either way.
Errors
robotrace.verify raises the same typed exceptions as the rest
of the SDK. See Errors for the full hierarchy.
| Exception | When |
|---|---|
ConfigurationError | Missing candidate version, or critical scenarios need a policy_callable you didn't pass |
AuthError | API key bad, revoked, or doesn't own a scenario / baseline |
NotFoundError | Scenario / baseline / candidate episode doesn't exist |
ValidationError | Server rejected the result payload |
TransportError | Network failure during artifact download or result POST |
What we explicitly don't ship (V0)
Mirrors the replay harness' deliberate scope cuts - see the project canvas:
- Webhooks for external CI providers (GitHub Actions / GitLab / Buildkite integrations) - V1.
- Hosted runner - Phase 2+. The schema already has the
runner_kindplumbing. - Severity rollups across teams (org-wide deploy gate) - V1.
- Auto-promote from a failed replay (
replay runnotices a regression and offers to lock it in) - V1.
Don'ts
- Don't mark every scenario
critical. Reserve it for failure modes you genuinely want to block deploys. A 30-row critical set with one pending check blocks every CI build. - Don't delete a baseline episode that's the
baseline for an active scenario - the schema rejects the
delete with
ON DELETE RESTRICTso you don't accidentally orphan a gate. - Don't route the verify check at the end of your CI pipeline. Front-load it: deploys are cheaper to abort than to roll back.