Failure intelligence

The fourth pillar of the product is Explain. Record gives you the bytes. Replay makes them playable. Verify gates deploys on them. Explain tells you why a run failed - without making you read a metadata blob.

The moment an episode lands with status="failed", RoboTrace runs a chain of heuristic rules over its metadata, the replay against its baseline (if any), and the verification scenarios this candidate is part of (if any). The output is a ranked list of findings - structured objects with a title, a description, confidence, evidence, and a suggested next step.

You see them as a "Failure insights" card on the episode detail page in the portal. Highest-confidence findings first. No button to click - the analyzer runs at finalize time.

What triggers it

When	What happens
SDK calls `POST /api/ingest/episode/<id>/finalize` with `status="failed"`	Analyzer runs synchronously inside the finalize handler. Failure of the analyzer never fails the finalize.
Admin flips an episode to `failed` from the actions menu	Same analyzer, same result, audited as `failure_analysis.completed`.
Admin clicks "Re-run analysis" on the Failure insights card	Forces a fresh run. Useful after a rule deploy or when new metadata is merged in.

The analyzer is idempotent - re-running just upserts the row in failure_analyses. There's exactly one analysis per episode.

The rule set

These are the rules we ship today. Each one is a pure function - new ones slot in without touching the harness; existing codes never change meaning. When the rule set changes meaningfully we bump analyzer_version, and stale analyses show a "v2 available" pill in the card prompting a re-run.

V1 - metadata + joined rows (always on)

The V1 layer reads the episode row, its metadata jsonb, and at most three joined tables. No NPZ download, no external services - fits inside the finalize request's serverless budget.

Code	Confidence	What it looks for
`explicit_outcome`	High	The SDK's `EpisodeOutcome` typed-metadata payload reports `success=False`. The user's own ground-truth label - whatever check ran inside the rollout decided the run failed.
`failure_reason_in_metadata`	High	An exception was caught by `Episode.__exit__` or the ROS 2 live-record context manager and stamped into `metadata.failure_reason`. We surface the first line of the traceback verbatim.
`adapter_upload_error`	High	The robot run completed, but an adapter (ros2 / lerobot / gymnasium / generic) couldn't ship the bytes to object storage. Distinct from a policy bug - check R2 connectivity first, not the policy.
`replay_regression`	High	The episode is a replay candidate (`source="replay"` with `metadata.eval_run_id`), and its eval_results metrics show it regressed against the baseline. The deploy gate would block this candidate.
`verification_failed`	High	A `verification_results` row references this episode as a candidate with `status="fail"`. The CI gate (`robotrace verify check`) will exit non-zero until the scenario passes.
`battery_low`	Medium	Any `Battery` typed-metadata payload reports `percent < 15`. Robots in brownout regularly show degraded actuation, IMU drift, and dropped comms - any of which read as policy failures.
`gymnasium_truncated`	Medium	`metadata.terminated === false && metadata.truncated === true`. The env hit its time limit without termination - the classic "policy got stuck or wanders" RL failure mode.
`duration_anomaly`	Low	`duration_s` < 50% of the median of the last successful runs from the same robot (≥5 samples). Suggests e-stop, manual abort, env timeout, or an exception we couldn't capture explicitly.
`status_failed_no_reason`	Low	Catch-all - status is `failed` but nothing above fired. Tells you "we couldn't pin it down" so the analyzer always says something on a failed episode.

V2 - NPZ-aware (on when R2 is configured)

V2 adds rules that read the actual sensor / action trajectories from R2. The analyzer caps every dimension (max 16 MB per artifact, max 4096 samples per series, max 8 series per artifact) so a long ROS bag can't drag the analyzer over its request budget. If R2 is not configured or the artifact is too large, the V2 rules skip silently and the V1 rules still fire on their own.

Code	Confidence	What it looks for
`action_saturation`	Medium	An action channel sat at its observed min or max for ≥40% of the run. Classic bang-bang policy or control-authority saturation - the policy lost smooth control of that DOF.
`sensor_flatline_pre_failure`	Medium	A sensor stream stopped changing for ≥2s right before the run ended. Usually a dropped topic, a stuck driver, or comms blackout - the policy keeps acting on a stale observation.
`joint_limit_breach`	Medium	Joint position or velocity left a conservative envelope (>270° or >6 rad/s on any joint). Often a unit-mismatch bug (degrees vs radians) or a real physical-stop event.

The V2 thresholds are deliberately wide because we don't yet ship per-robot calibration data. As real pilot episodes flow through R2 we'll replace these heuristics with model-fit thresholds.

V2 - LLM narrative (opt-in)

Off by default. When FAILURE_INTEL_LLM_ENABLED=true and OPENAI_API_KEY is set, the analyzer asks gpt-4o-mini to braid the structured findings into a 2-3 sentence Slack-ready paragraph and stores it alongside the findings. The structured findings stay the source of truth - the narrative is decoration, never a replacement. Every failure mode (timeout, rate limit, malformed response) is swallowed; the analyzer still persists the structured findings.

Anatomy of a finding

Each finding is a structured object with these fields:

{
  "code": "replay_regression",
  "title": "Candidate failed where baseline succeeded",
  "description": "This episode is a replay candidate and its eval-results metrics show worse performance than the baseline. The deploy gate (verify check) will block any candidate carrying this regression.",
  "confidence": "high",
  "evidence": {
    "eval_run_id": "e2…",
    "baseline_episode_id": "b1…",
    "success_delta": -1.0,
    "reward_delta": -0.42
  },
  "suggested_action": "Open the eval run, side-by-side the baseline vs candidate timelines, and look for the moment where the action trajectories diverge."
}

The portal renders that as a card row with a confidence pill, the title, the description, an evidence grid, and a "next step" suggestion. Same shape regardless of which rule fired.

What it won't tell you (yet)

Things the analyzer still doesn't do:

Compute per-robot calibration. V2's NPZ rules use conservative hard-coded thresholds because we don't yet have a fleet of real pilot episodes to fit against. Expect a joint_limit_breach false positive on robots with unusually wide ranges - the evidence panel shows the raw value so you can decide.
Read camera frames. Video bytes stay in R2 untouched. Visual-anomaly rules (collision flash detection, target out of frame, etc.) are still on the roadmap and require an inference hop we don't have today.
Backfill historical failures. Episodes that failed before the analyzer shipped don't have an analysis row. Re-run from the admin actions menu (per-episode) - a backfill cron is not shipped.

Audit + visibility

Every analyzer run writes a row to audit_log:

failure_analysis.completed - normal path, with finding_codes and analyzer_version in the metadata column.
failure_analysis.errored - the harness caught an exception but persisted whatever findings it could.

Filter to Failure analysis on /admin/audit to see them in context.

Privacy

The V1 + V2 NPZ rules run server-side inside the existing Vercel deployment - no external services. NPZ artifacts are read directly from R2 over the AWS SDK (same credentials the ingest path uses to mint upload URLs); bytes never leave Cloudflare except into the analyzer's memory and they're not persisted.

The optional V2 LLM narrative sends a compact JSON summary (episode name, robot, policy version, the structured findings) to OpenAI when both FAILURE_INTEL_LLM_ENABLED=true and OPENAI_API_KEY are set. Raw sensor / action samples are never included in the prompt. Disable the flag if your client policy forbids any third-party calls; the structured findings stand on their own.

Findings are written to the failure_analyses table with the same RLS shape as eval_results - org members see their own client's analyses, admins see all of them via the service role.