Failure intelligence
The fourth pillar of the product is Explain. Record gives you the bytes. Replay makes them playable. Verify gates deploys on them. Explain tells you why a run failed - without making you read a metadata blob.
The moment an episode lands with status="failed", RoboTrace runs
a chain of heuristic rules over its metadata, the replay against
its baseline (if any), and the verification scenarios this
candidate is part of (if any). The output is a ranked list of
findings - structured objects with a title, a description,
confidence, evidence, and a suggested next step.
You see them as a "Failure insights" card on the episode detail page in the portal. Highest-confidence findings first. No button to click - the analyzer runs at finalize time.
What triggers it
| When | What happens |
|---|---|
SDK calls POST /api/ingest/episode/<id>/finalize with status="failed" | Analyzer runs synchronously inside the finalize handler. Failure of the analyzer never fails the finalize. |
Admin flips an episode to failed from the actions menu | Same analyzer, same result, audited as failure_analysis.completed. |
| Admin clicks "Re-run analysis" on the Failure insights card | Forces a fresh run. Useful after a rule deploy or when new metadata is merged in. |
The analyzer is idempotent - re-running just upserts the row in
failure_analyses. There's exactly one analysis per episode.
The V1 rule set
These are the heuristic rules we ship today. Each one reads the
episode row, its metadata jsonb, and at most three joined
tables (no NPZ download in V1 - the heavy bytes stay in R2).
| Code | Confidence | What it looks for |
|---|---|---|
explicit_outcome | High | The SDK's EpisodeOutcome typed-metadata payload reports success=False. The user's own ground-truth label - whatever check ran inside the rollout decided the run failed. |
failure_reason_in_metadata | High | An exception was caught by Episode.__exit__ or the ROS 2 live-record context manager and stamped into metadata.failure_reason. We surface the first line of the traceback verbatim. |
adapter_upload_error | High | The robot run completed, but an adapter (ros2 / lerobot / gymnasium / generic) couldn't ship the bytes to object storage. Distinct from a policy bug - check R2 connectivity first, not the policy. |
replay_regression | High | The episode is a replay candidate (source="replay" with metadata.eval_run_id), and its eval_results metrics show it regressed against the baseline. The deploy gate would block this candidate. |
verification_failed | High | A verification_results row references this episode as a candidate with status="fail". The CI gate (robotrace verify check) will exit non-zero until the scenario passes. |
battery_low | Medium | Any Battery typed-metadata payload reports percent < 15. Robots in brownout regularly show degraded actuation, IMU drift, and dropped comms - any of which read as policy failures. |
gymnasium_truncated | Medium | metadata.terminated === false && metadata.truncated === true. The env hit its time limit without termination - the classic "policy got stuck or wanders" RL failure mode. |
duration_anomaly | Low | duration_s < 50% of the median of the last successful runs from the same robot (≥5 samples). Suggests e-stop, manual abort, env timeout, or an exception we couldn't capture explicitly. |
status_failed_no_reason | Low | Catch-all - status is failed but nothing above fired. Tells you "we couldn't pin it down" so the analyzer always says something on a failed episode. |
Rules are pure functions. New ones slot in without touching the
harness; existing codes never change meaning. When the rule set
changes meaningfully we bump analyzer_version, and stale
analyses show a "v1 available" pill in the card prompting a
re-run.
Anatomy of a finding
Each finding is a structured object with these fields:
{
"code": "replay_regression",
"title": "Candidate failed where baseline succeeded",
"description": "This episode is a replay candidate and its eval-results metrics show worse performance than the baseline. The deploy gate (verify check) will block any candidate carrying this regression.",
"confidence": "high",
"evidence": {
"eval_run_id": "e2…",
"baseline_episode_id": "b1…",
"success_delta": -1.0,
"reward_delta": -0.42
},
"suggested_action": "Open the eval run, side-by-side the baseline vs candidate timelines, and look for the moment where the action trajectories diverge."
}The portal renders that as a card row with a confidence pill, the title, the description, an evidence grid, and a "next step" suggestion. Same shape regardless of which rule fired.
What it won't tell you (yet)
V1 is metadata-only by design. Things the analyzer doesn't do today:
- Read NPZ sensor / action data. It only inspects what's in the jsonb metadata bag plus joined eval/verify rows. The next iteration will sign GET URLs for the sensor blob and compute joint-effort spikes, IMU shocks, and OOD share over the observation distribution.
- Write a narrative summary. The findings are structured; a future LLM layer will read the same rows and produce a paragraph-length explanation on top.
- Backfill historical failures. Episodes that failed before the analyzer shipped don't have an analysis row. Re-run from the admin actions menu (per-episode) - a backfill cron is not in V1.
Audit + visibility
Every analyzer run writes a row to audit_log:
failure_analysis.completed- normal path, withfinding_codesandanalyzer_versionin the metadata column.failure_analysis.errored- the harness caught an exception but persisted whatever findings it could.
Filter to Failure analysis on /admin/audit to see them in
context.
Privacy
The analyzer runs server-side inside the existing Vercel
deployment. No external services, no LLM calls in V1. Findings
are written to the failure_analyses table with the same RLS
shape as eval_results - org members see their own client's
analyses, admins see all of them via the service role.