Verification scenarios

Promote a failed episode to a verification scenario and any future candidate policy has to replay it without re-failing before it's eligible for deploy. Same model the AV industry shipped a decade ago, made native to general robotics.

Built on the existing replay regression harness - the customer-side runner, the metric math, and the cross-tenant guards are unchanged. The only new surface is the "these scenarios are mandatory" bookkeeping plus a CI-callable check.

Compare the two: the replay harness measures how much better a candidate is. Verification measures whether it's allowed to ship. Use both.

When to reach for it

You want to…Reach for
Lock in a known failure mode the next candidate must passPromote on the episode detail page
Gate a CI deploy on every critical scenariorobotrace verify check --candidate ...
Trigger the gate when a replay sweep finalizesHappens automatically on eval finalize
Programmatically check the gate from a release scriptrt.verify.check_gate(...)

Severity, and what it means

Every scenario carries one of three severities:

SeverityEffect on robotrace verify check
criticalFailing any critical scenario exits non-zero. Blocks the deploy.
warningReported in the CI summary; exit code stays 0.
infoTracked for visibility only.

The default when promoting from the portal is warning - moving to critical is a deliberate "this must never regress" call. The CLI surfaces the count of each in its exit summary so you can see what got upgraded over time without a portal trip.

Promoting an episode

Two paths to the same row:

# 1. From the portal: open the episode, click "Promote to verification set".
#    Choose name, severity (defaults to "critical" for failed episodes),
#    and an optional "why this matters" note.
 
# 2. From the SDK / CI:
import robotrace as rt
 
scenario = rt.verify.promote(
    baseline_episode_id="ep-d3e1...",
    name="Teal mug grasp failure",
    severity="critical",
    description="Camera highlight on the rim throws off the grasp.",
)
# {"scenario_id": "...", "promoted": True}

Both paths are idempotent: re-promoting the same episode returns the existing scenario_id and promoted=False. The portal route detail page renders both responses identically.

Running the gate from CI

The CI entry point is one CLI call:

robotrace verify check \
  --candidate pap-v4.0.0-rc1 \
  --policy my_pkg.policies:v4_rc1

It does three things:

  1. Calls POST /api/verify/check to read the current gate state for this candidate version.
  2. For every active critical scenario that hasn't already passed for this candidate, it opens a small replay run using the same harness as robotrace replay run and records a verification result per scenario.
  3. Re-evaluates the gate and exits 0 (pass) or 1 (blocked).
Verify check: candidate=pap-v4.0.0-rc1 policy=my_pkg.policies:v4_rc1
✓ Eval run created: a17c9d2e-...
  ✓ [1/2] ep-d3e1 → c1f2a4b8
  ✓ [2/2] ep-9a4f → 7e3d1f2c
 
✗ BLOCKED: 1/2 critical scenarios pass (1 failed, 0 pending)
  · Critical scenario "Teal mug grasp failure" failed for pap-v4.0.0-rc1.
 
View scenarios: https://app.robotrace.dev/portal/verify

Pipe that into your CI job script and exit non-zero on BLOCKED. The portal link in the trailing line is a real hyperlink in terminals that support OSC 8 - the same affordance robotrace login already uses.

Flags

FlagRequiredWhat
--candidateyesStable identifier for the candidate (e.g. pap-v4.0.0-rc1)
--policynomodule:fn policy callable - needed for scenarios that haven't passed yet
--dry-runnoRun the policy locally, skip uploading verification results
--profilenoPick a non-default profile from ~/.robotrace/credentials

Omitting --policy is fine when you just want to read the gate state (e.g. a release script that gates on a candidate the nightly sweep already covered). The command exits 0 if every critical scenario already has a recent pass, 1 otherwise.

Auto-sync from eval runs

When robotrace replay run finalizes a campaign, the server mirrors every per-episode result onto any matching active scenario. If your nightly sweep's baseline set overlaps your verification set, you get the verification rows for free - no second CLI call.

The eval-run detail page renders a Verification scenarios card right under the DiffCard with the pass/fail counts so a CI engineer checking last night's sweep can answer "did this candidate clear the gate?" without leaving the eval view.

Programmatic API

robotrace.verify exposes the same surface as the CLI:

import robotrace as rt
 
# Check the gate without running new replays.
gate = rt.verify.check_gate(candidate_policy_version="pap-v4.0.0-rc1")
if not gate["passed"]:
    for blocker in gate["blockers"]:
        print(blocker)
    raise SystemExit(1)
 
# Run the full loop (replay + record + re-check).
exit_code, gate = rt.verify.run_check(
    candidate_policy_version="pap-v4.0.0-rc1",
    policy_callable=my_policy,
)
raise SystemExit(exit_code)

The gate body shape:

{
  "candidate_policy_version": "pap-v4.0.0-rc1",
  "passed": false,
  "critical_total": 3,
  "critical_passed": 2,
  "critical_failed": 1,
  "critical_pending": 0,
  "warning_total": 1,
  "warning_failed": 0,
  "scenarios": [
    {
      "scenario_id": "...",
      "name": "Teal mug grasp failure",
      "severity": "critical",
      "baseline_episode_id": "ep-d3e1...",
      "latest_status": "fail",
    },
    ...
  ],
  "blockers": ["Critical scenario \"Teal mug grasp failure\" failed for pap-v4.0.0-rc1."],
}

HTTP 422 is the blocked response (the JSON body is identical to 200). The CLI normalizes both to exit codes; the Python helper returns the parsed body either way.

Errors

robotrace.verify raises the same typed exceptions as the rest of the SDK. See Errors for the full hierarchy.

ExceptionWhen
ConfigurationErrorMissing candidate version, or critical scenarios need a policy_callable you didn't pass
AuthErrorAPI key bad, revoked, or doesn't own a scenario / baseline
NotFoundErrorScenario / baseline / candidate episode doesn't exist
ValidationErrorServer rejected the result payload
TransportErrorNetwork failure during artifact download or result POST

What we explicitly don't ship (V0)

Mirrors the replay harness' deliberate scope cuts - see the project canvas:

  • Webhooks for external CI providers (GitHub Actions / GitLab / Buildkite integrations) - V1.
  • Hosted runner - Phase 2+. The schema already has the runner_kind plumbing.
  • Severity rollups across teams (org-wide deploy gate) - V1.
  • Auto-promote from a failed replay (replay run notices a regression and offers to lock it in) - V1.

Don'ts

  • Don't mark every scenario critical. Reserve it for failure modes you genuinely want to block deploys. A 30-row critical set with one pending check blocks every CI build.
  • Don't delete a baseline episode that's the baseline for an active scenario - the schema rejects the delete with ON DELETE RESTRICT so you don't accidentally orphan a gate.
  • Don't route the verify check at the end of your CI pipeline. Front-load it: deploys are cheaper to abort than to roll back.