Verification scenarios

Promote a failed episode to a verification scenario and any future candidate policy has to replay it without re-failing before it's eligible for deploy. Same model the AV industry shipped a decade ago, made native to general robotics.

Built on the existing replay regression harness - the customer-side runner, the metric math, and the cross-tenant guards are unchanged. The only new surface is the "these scenarios are mandatory" bookkeeping plus a CI-callable check.

Compare the two: the replay harness measures how much better a candidate is. Verification measures whether it's allowed to ship. Use both.

When to reach for it

You want to…	Reach for
Lock in a known failure mode the next candidate must pass	Promote on the episode detail page
Gate a CI deploy on every critical scenario	`robotrace verify check --candidate ...`
Trigger the gate when a replay sweep finalizes	Happens automatically on eval finalize
Programmatically check the gate from a release script	`rt.verify.check_gate(...)`

Severity, and what it means

Every scenario carries one of three severities:

Severity	Effect on `robotrace verify check`
`critical`	Failing any critical scenario exits non-zero. Blocks the deploy.
`warning`	Reported in the CI summary; exit code stays `0`.
`info`	Tracked for visibility only.

The default when promoting from the portal is warning - moving to critical is a deliberate "this must never regress" call. The CLI surfaces the count of each in its exit summary so you can see what got upgraded over time without a portal trip.

Promoting an episode

Two paths to the same row:

# 1. From the portal: open the episode, click "Promote to verification set".
#    Choose name, severity (defaults to "critical" for failed episodes),
#    and an optional "why this matters" note.
 
# 2. From the SDK / CI:
import robotrace as rt
 
scenario = rt.verify.promote(
    baseline_episode_id="ep-d3e1...",
    name="Teal mug grasp failure",
    severity="critical",
    description="Camera highlight on the rim throws off the grasp.",
)
# {"scenario_id": "...", "promoted": True}

Both paths are idempotent: re-promoting the same episode returns the existing scenario_id and promoted=False. The portal route detail page renders both responses identically.

Running the gate from CI

The CI entry point is one CLI call:

robotrace verify check \
  --candidate pap-v4.0.0-rc1 \
  --policy my_pkg.policies:v4_rc1

It does three things:

Calls POST /api/verify/check to read the current gate state for this candidate version.
For every active critical scenario that hasn't already passed for this candidate, it opens a small replay run using the same harness as robotrace replay run and records a verification result per scenario.
Re-evaluates the gate and exits 0 (pass) or 1 (blocked).

Verify check: candidate=pap-v4.0.0-rc1 policy=my_pkg.policies:v4_rc1
✓ Eval run created: a17c9d2e-...
  ✓ [1/2] ep-d3e1 → c1f2a4b8
  ✓ [2/2] ep-9a4f → 7e3d1f2c
 
✗ BLOCKED: 1/2 critical scenarios pass (1 failed, 0 pending)
  · Critical scenario "Teal mug grasp failure" failed for pap-v4.0.0-rc1.
 
View scenarios: https://app.robotrace.dev/portal/verify

Pipe that into your CI job script and exit non-zero on BLOCKED. The portal link in the trailing line is a real hyperlink in terminals that support OSC 8 - the same affordance robotrace login already uses.

Flags

Flag	Required	What
`--candidate`	yes	Stable identifier for the candidate (e.g. `pap-v4.0.0-rc1`)
`--policy`	no	`module:fn` policy callable - needed for scenarios that haven't passed yet
`--dry-run`	no	Run the policy locally, skip uploading verification results
`--profile`	no	Pick a non-default profile from `~/.robotrace/credentials`

Omitting --policy is fine when you just want to read the gate state (e.g. a release script that gates on a candidate the nightly sweep already covered). The command exits 0 if every critical scenario already has a recent pass, 1 otherwise.

Auto-sync from eval runs

When robotrace replay run finalizes a campaign, the server mirrors every per-episode result onto any matching active scenario. If your nightly sweep's baseline set overlaps your verification set, you get the verification rows for free - no second CLI call.

The eval-run detail page renders a Verification scenarios card right under the DiffCard with the pass/fail counts so a CI engineer checking last night's sweep can answer "did this candidate clear the gate?" without leaving the eval view.

Programmatic API

robotrace.verify exposes the same surface as the CLI:

import robotrace as rt
 
# Check the gate without running new replays.
gate = rt.verify.check_gate(candidate_policy_version="pap-v4.0.0-rc1")
if not gate["passed"]:
    for blocker in gate["blockers"]:
        print(blocker)
    raise SystemExit(1)
 
# Run the full loop (replay + record + re-check).
exit_code, gate = rt.verify.run_check(
    candidate_policy_version="pap-v4.0.0-rc1",
    policy_callable=my_policy,
)
raise SystemExit(exit_code)

When you already have a candidate result from your own runner and just want to file it against a scenario (without re-running the replay), reach for record_result directly:

rt.verify.record_result(
    scenario_id="...",
    candidate_policy_version="pap-v4.0.0-rc1",
    status="pass",                # omit to let the server derive from metrics
    metrics={"success": True, "reward_total": 14.1},
    candidate_episode_id="ep-c1f2a4b8",
    eval_run_id="a17c9d2e-...",   # optional - link this result to a sweep
)

record_result is what run_check(...) calls internally for each scenario it replays. Exposed publicly so customer-side CI that has its own replay infra (different runner, exotic policy boot, etc.) can still feed results into the gate without going through our runner.

The gate body shape:

{
  "candidate_policy_version": "pap-v4.0.0-rc1",
  "passed": false,
  "critical_total": 3,
  "critical_passed": 2,
  "critical_failed": 1,
  "critical_pending": 0,
  "warning_total": 1,
  "warning_failed": 0,
  "scenarios": [
    {
      "scenario_id": "...",
      "name": "Teal mug grasp failure",
      "severity": "critical",
      "baseline_episode_id": "ep-d3e1...",
      "latest_status": "fail",
    },
    ...
  ],
  "blockers": ["Critical scenario \"Teal mug grasp failure\" failed for pap-v4.0.0-rc1."],
}

HTTP 422 is the blocked response (the JSON body is identical to 200). The CLI normalizes both to exit codes; the Python helper returns the parsed body either way.

Errors

robotrace.verify raises the same typed exceptions as the rest of the SDK. See Errors for the full hierarchy.

Exception	When
`ConfigurationError`	Missing candidate version, or critical scenarios need a `policy_callable` you didn't pass
`AuthError`	API key bad, revoked, or doesn't own a scenario / baseline
`NotFoundError`	Scenario / baseline / candidate episode doesn't exist
`ValidationError`	Server rejected the result payload
`TransportError`	Network failure during artifact download or result POST

Auto-promote from a failed replay

When a replay sweep finalizes and lands a candidate row in the "Worse" column of the DiffCard, the eval-run finalize route can auto-promote that baseline episode to a warning verification scenario - no portal click required. The scenario gets stamped with a provenance trail (eval run id, candidate version, success-flip metric) so the next reviewer can audit exactly why it was promoted. Off by default; flip it on per-client from Portal → Settings → Auto-promote regressions (an owner role + a global AUTO_PROMOTE_REGRESSIONS_ENABLED env flag both have to be on - either knob alone is a no-op). Auto-promoted scenarios always land as warning, never critical, so a CI gate is never silently raised by an unattended sweep.

What we explicitly don't ship (V0)

Mirrors the replay harness' deliberate scope cuts - see the project canvas:

Webhooks for external CI providers (GitHub Actions / GitLab / Buildkite integrations) - V1.
Hosted runner - Phase 2+. The schema already has the runner_kind plumbing.
Severity rollups across teams (org-wide deploy gate) - V1.

Don'ts

Don't mark every scenario critical. Reserve it for failure modes you genuinely want to block deploys. A 30-row critical set with one pending check blocks every CI build.
Don't delete a baseline episode that's the baseline for an active scenario - the schema rejects the delete with ON DELETE RESTRICT so you don't accidentally orphan a gate.
Don't route the verify check at the end of your CI pipeline. Front-load it: deploys are cheaper to abort than to roll back.