Failure intelligence

The fourth pillar of the product is Explain. Record gives you the bytes. Replay makes them playable. Verify gates deploys on them. Explain tells you why a run failed - without making you read a metadata blob.

The moment an episode lands with status="failed", RoboTrace runs a chain of heuristic rules over its metadata, the replay against its baseline (if any), and the verification scenarios this candidate is part of (if any). The output is a ranked list of findings - structured objects with a title, a description, confidence, evidence, and a suggested next step.

You see them as a "Failure insights" card on the episode detail page in the portal. Highest-confidence findings first. No button to click - the analyzer runs at finalize time.

What triggers it

WhenWhat happens
SDK calls POST /api/ingest/episode/<id>/finalize with status="failed"Analyzer runs synchronously inside the finalize handler. Failure of the analyzer never fails the finalize.
Admin flips an episode to failed from the actions menuSame analyzer, same result, audited as failure_analysis.completed.
Admin clicks "Re-run analysis" on the Failure insights cardForces a fresh run. Useful after a rule deploy or when new metadata is merged in.

The analyzer is idempotent - re-running just upserts the row in failure_analyses. There's exactly one analysis per episode.

The V1 rule set

These are the heuristic rules we ship today. Each one reads the episode row, its metadata jsonb, and at most three joined tables (no NPZ download in V1 - the heavy bytes stay in R2).

CodeConfidenceWhat it looks for
explicit_outcomeHighThe SDK's EpisodeOutcome typed-metadata payload reports success=False. The user's own ground-truth label - whatever check ran inside the rollout decided the run failed.
failure_reason_in_metadataHighAn exception was caught by Episode.__exit__ or the ROS 2 live-record context manager and stamped into metadata.failure_reason. We surface the first line of the traceback verbatim.
adapter_upload_errorHighThe robot run completed, but an adapter (ros2 / lerobot / gymnasium / generic) couldn't ship the bytes to object storage. Distinct from a policy bug - check R2 connectivity first, not the policy.
replay_regressionHighThe episode is a replay candidate (source="replay" with metadata.eval_run_id), and its eval_results metrics show it regressed against the baseline. The deploy gate would block this candidate.
verification_failedHighA verification_results row references this episode as a candidate with status="fail". The CI gate (robotrace verify check) will exit non-zero until the scenario passes.
battery_lowMediumAny Battery typed-metadata payload reports percent < 15. Robots in brownout regularly show degraded actuation, IMU drift, and dropped comms - any of which read as policy failures.
gymnasium_truncatedMediummetadata.terminated === false && metadata.truncated === true. The env hit its time limit without termination - the classic "policy got stuck or wanders" RL failure mode.
duration_anomalyLowduration_s < 50% of the median of the last successful runs from the same robot (≥5 samples). Suggests e-stop, manual abort, env timeout, or an exception we couldn't capture explicitly.
status_failed_no_reasonLowCatch-all - status is failed but nothing above fired. Tells you "we couldn't pin it down" so the analyzer always says something on a failed episode.

Rules are pure functions. New ones slot in without touching the harness; existing codes never change meaning. When the rule set changes meaningfully we bump analyzer_version, and stale analyses show a "v1 available" pill in the card prompting a re-run.

Anatomy of a finding

Each finding is a structured object with these fields:

{
  "code": "replay_regression",
  "title": "Candidate failed where baseline succeeded",
  "description": "This episode is a replay candidate and its eval-results metrics show worse performance than the baseline. The deploy gate (verify check) will block any candidate carrying this regression.",
  "confidence": "high",
  "evidence": {
    "eval_run_id": "e2…",
    "baseline_episode_id": "b1…",
    "success_delta": -1.0,
    "reward_delta": -0.42
  },
  "suggested_action": "Open the eval run, side-by-side the baseline vs candidate timelines, and look for the moment where the action trajectories diverge."
}

The portal renders that as a card row with a confidence pill, the title, the description, an evidence grid, and a "next step" suggestion. Same shape regardless of which rule fired.

What it won't tell you (yet)

V1 is metadata-only by design. Things the analyzer doesn't do today:

  • Read NPZ sensor / action data. It only inspects what's in the jsonb metadata bag plus joined eval/verify rows. The next iteration will sign GET URLs for the sensor blob and compute joint-effort spikes, IMU shocks, and OOD share over the observation distribution.
  • Write a narrative summary. The findings are structured; a future LLM layer will read the same rows and produce a paragraph-length explanation on top.
  • Backfill historical failures. Episodes that failed before the analyzer shipped don't have an analysis row. Re-run from the admin actions menu (per-episode) - a backfill cron is not in V1.

Audit + visibility

Every analyzer run writes a row to audit_log:

  • failure_analysis.completed - normal path, with finding_codes and analyzer_version in the metadata column.
  • failure_analysis.errored - the harness caught an exception but persisted whatever findings it could.

Filter to Failure analysis on /admin/audit to see them in context.

Privacy

The analyzer runs server-side inside the existing Vercel deployment. No external services, no LLM calls in V1. Findings are written to the failure_analyses table with the same RLS shape as eval_results - org members see their own client's analyses, admins see all of them via the service role.