diff --git a/.ai/HANDOFF.md b/.ai/HANDOFF.md index 47ba7648..fe528b98 100644 --- a/.ai/HANDOFF.md +++ b/.ai/HANDOFF.md @@ -9,17 +9,20 @@ branch `feat/l1-ai-tree-builder-phase-2a` (branched from `main` @ `87236b5`), pu Gitea, **PR #193 open** (`main` ← `feat/l1-ai-tree-builder-phase-2a`, mergeable): . -## Resume point — review & merge PR #193 +## Resume point — FIX REVIEW FINDINGS on PR #193, do NOT merge yet -Nothing left to build. Next session: -1. Check Gitea CI on PR #193 (`gitea.resolutionflow.com/chihlasm/resolutionflow/actions` - — `gh` cannot read Gitea CI). If green, review + merge. -2. After merge: `alembic upgrade head` on prod (3 new migrations, head `1fd88a68b145`), - update CURRENT-STATE.md + roadmap. -3. **Before wide enablement (spec §5.3):** run a live constrained-decoding smoke test for - `ai_tree_builder.generate_next_node` and benchmark Sonnet vs Opus for the - `l1_realtime_build` action key. All model calls are mocked in tests — AI *quality* is - unverified against a live model. +A 2026-06-09 multi-agent review (7 finder angles, every finding code-verified) found +**10 confirmed defects** — including a showstopper (AI-generated nodes carry no `id`, +so ai_build walks can never advance past the first question) and proof that Tasks 16–17 +(ProposalDetail L1-source block, L1EscalationsSection mount) were recorded as done here +but were **never committed**. Full findings, evidence (file:line), fixes, and execution +order: [`docs/plans/2026-06-09-pr193-phase2a-review-findings.md`](../docs/plans/2026-06-09-pr193-phase2a-review-findings.md). + +Next session: work that doc top-to-bottom (findings 1–7 are merge blockers), re-run the +Phase 2A test gate + tsc/lint/build + migration roundtrip, then resume the old plan: +merge PR #193, prod `alembic upgrade head` (3 migrations, head `1fd88a68b145`), and the +live AI-quality smoke test before wide enablement (spec §5.3 — all model calls are +mocked in tests). ## What shipped (all verified this session) diff --git a/docs/plans/2026-06-09-pr193-phase2a-review-findings.md b/docs/plans/2026-06-09-pr193-phase2a-review-findings.md new file mode 100644 index 00000000..18f9eb46 --- /dev/null +++ b/docs/plans/2026-06-09-pr193-phase2a-review-findings.md @@ -0,0 +1,143 @@ +# PR #193 (Phase 2A — L1 AI Tree Builder) Review Findings + +**Date:** 2026-06-09 +**Reviewed:** `feat/l1-ai-tree-builder-phase-2a` vs `main` (42 files, +2,326/−154) +**Process:** 7 independent finder angles, every candidate independently verified against actual code (quoted lines confirmed, not speculation). +**Verdict: DO NOT MERGE as-is.** The headline feature (AI-guided walkthrough) is non-functional end-to-end, two tasks recorded as complete in `.ai/HANDOFF.md` were never actually committed, and one DB constraint is a deletion time bomb. + +How to use this file: work the findings in order. Findings 1–7 are merge blockers; 8–10 can be fast-follows. Each finding lists the verified evidence (file:line) and a suggested fix. Several findings share two root causes — fix those at the root rather than patching symptoms: + +- **Root cause A:** AI-generated nodes have no `id`, but the advance protocol keys on `node_id`. (Finding 1; touches 8.) +- **Root cause B:** The intake category is smuggled into `walked_path` as a fake `{"node_type":"meta"}` entry that every consumer must know to skip — and most don't. (Findings 2b, 9; the deeper fix is a real `category` column on `l1_walk_sessions`, plus `problem_text` while you're there — see Finding 8's note.) + +--- + +## MERGE BLOCKERS + +### 1. AI walkthrough can never advance past the first question (showstopper) + +**Evidence:** +- `backend/app/services/ai_tree_builder.py:37-41` — SYSTEM_PROMPT's JSON output shapes (`{"node_type":"question","text":...}` etc.) define **no `id` field**; `validate_node` (lines 71-79) returns the node unchanged; nothing anywhere assigns an id. +- `backend/app/services/l1_session_service.py:156` — `advance_ai_build` only appends to `walked_path` `if node_id is not None`; docstring (line 139) says "On the first call (node_id is None) nothing is appended." +- `frontend/src/components/l1/L1WalkTreeVariant.tsx:52-54` — sends `node_id: node.id`, which is `undefined` at runtime (server never sends an id; `JSON.stringify` drops undefined keys) → backend always receives `node_id=None`. +- `l1_session_service.py:174` — `session.current_node_id = next_node.get("id")` is always `None`. + +**User impact:** Tech answers the first question → answer is discarded, the same (or a re-rolled) first question regenerates forever. `walked_path` never grows past the meta entry, the depth cap never fires, and resolve captures an empty tree. + +**Why tests missed it:** `test_l1_api_ai_build` and friends mock `advance_ai_build` / hand-craft nodes **with** `id` keys — a shape the real model is never instructed to produce. + +**Fix:** Assign a server-side id to every generated node before returning it (e.g., `uuid4().hex[:8]` in `generate_next_node` after `validate_node`), persist it as `session.current_node_id`, and add a test that runs the real (unmocked-shape) prompt contract: generate → assert node has id → advance with that id → assert walked_path grew. Do NOT ask the LLM to invent ids. + +### 2. Escalations from AI sessions go nowhere (two linked defects) + +**2a — Component never mounted.** +- `grep -rn "L1EscalationsSection" frontend/src` → exactly one hit: its own definition (`frontend/src/components/l1/L1EscalationsSection.tsx:10`). It is imported nowhere. +- `frontend/src/pages/EscalationQueuePage.tsx:3` imports only `EscalationQueue, EscalationMetricCard` from `@/components/flowpilot`. +- `backend/app/services/notification_service.py:449` — `"l1.session.escalated": "/escalations"` deep-link → `frontend/src/router.tsx:299` renders `EscalationQueuePage` → engineer sees only FlowPilot escalations; the L1 handoff is invisible. `GET /l1/escalations` (`backend/app/api/endpoints/l1.py:330`) has no UI surface. +- **Note:** `.ai/HANDOFF.md:38` claims "L1EscalationsSection on EscalationQueuePage" — that claim is false (documentation drift; see Finding 3). + +**2b — Component renders wrong fields once mounted.** +- `backend/app/services/l1_session_service.py:162-168` — ai_build entries are `{"node_type", "id", "text", "answer", "l1_note"}` (key is `text`); legacy `record_step` (lines 199-204) uses `question`/`node_id`. +- `L1EscalationsSection.tsx:61` renders `{step.question}` → blank for every ai_build entry. +- `L1EscalationsSection.tsx:46` — `{s.walked_path.length} steps walked` counts the hidden meta entry → "N+1 steps walked". +- `backend/app/api/endpoints/l1.py:41` — `_to_response` returns `walked_path` raw (meta entry included). + +**Fix:** Mount `L1EscalationsSection` on `EscalationQueuePage` (or fold L1 rows into the existing queue). Render `step.question ?? step.text`. Filter `node_type === 'meta'` entries — ideally server-side in `_to_response` (or eliminate the meta entry entirely per Root cause B). Also use the shared `timeAgo` util (`frontend/src/lib/timeAgo.ts`) instead of `new Date(...).toLocaleString()` at line 50, to match every sibling queue. + +### 3. Two tasks recorded as complete were never committed + +- Task 16 ("ProposalDetail L1-source block") and Task 17 (mounting the escalations section) appear in `.ai/HANDOFF.md` / `SESSION_LOG.md` as done, but **no hunk for `ProposalDetail.tsx` exists in the diff**, and Finding 2a proves the mount never happened. +- Concrete user impact today: `frontend/src/components/flowpilot/ProposalDetail.tsx:91-101` renders the "Source Session" card unconditionally; line 95 is `` to={`/pilot/${proposal.source_session_id}`} `` with no null guard. L1-sourced proposals (created with `source_session_id=None`, `l1_session_id=`) reach the review queue as `pending` → engineers get a broken **`/pilot/null`** link. + +**Fix:** Implement the missing work: in `ProposalDetail.tsx`, gate the `/pilot/` link on `source_session_id != null` and render an L1-source block (problem statement, category, link to the L1 session / escalations view) when `l1_session_id` is set. The backend already serves `l1_session_id` via `backend/app/schemas/flow_proposal.py`. Then correct `.ai/HANDOFF.md`. + +### 4. "Use this flow" button silently does nothing + +- `frontend/src/pages/l1/L1Dashboard.tsx:77-86` — `useSuggestedFlow` re-POSTs `/l1/intake` with the same text, no `flow_id`. The in-code comment ("it matches again and returns a `matched` outcome") is factually wrong: the same text scores in the same 0.60–0.75 suggest band (`backend/app/services/match_or_build.py:66-72`, `MATCH_THRESHOLD = 0.75` line 21) → `suggest` again, no `session_id` → handler falls to `resetPrompts()` and the card vanishes. The suggested flow can never be started. +- `backend/app/api/endpoints/l1.py` — the rewritten intake **never reads `payload.flow_id`** (old branch deleted, diff confirms); `IntakeRequest.flow_id` (`backend/app/schemas/l1.py:13`) is now dead. + +**Fix:** Make intake honor an explicit `flow_id` (bypass the matcher, call `start_flow_session` directly — restores the deleted behavior), and have the suggest card pass `near_miss.flow_id`. This also kills the wasteful re-run of the embedding + pgvector + keyword pipeline just to rediscover a flow_id the client already holds. + +### 5. Out-of-scope problems lost the ad-hoc walk fallback + +- Old intake had `else: start_adhoc_session(...)`; the rewrite (`backend/app/api/endpoints/l1.py:88-102`) dispatches only matched/build/suggest/out_of_scope. `start_adhoc_session` (`l1_session_service.py:82`) now has **zero callers** — ad-hoc sessions are unreachable product-wide (the only remaining `session_kind="adhoc"` creation is `escalate_without_walk`, an audit record, not walkable). +- `L1Dashboard.tsx:269-292` — out_of_scope prompt offers only "Escalate to engineering" / "Cancel". +- Stale copy: `frontend/src/pages/account/L1CategoriesPage.tsx:57-58` still promises "Disabled categories fall back to an ad-hoc walk or escalation." A diff test comment also claims adhoc "is offered from the out_of_scope prompt" — it is not. + +**Fix (decide deliberately, don't drift):** Either (a) add a "Walk it ad-hoc" option to the out_of_scope prompt that hits a path creating an adhoc session (restore the capability), or (b) if dropping ad-hoc is intentional, fix the L1CategoriesPage copy and the test comment, and note the decision in `.ai/DECISIONS.md`. Option (a) preserves pre-existing user capability; recommend (a). + +### 6. DB constraint makes L1-session deletion always fail (time bomb) + +- `backend/app/models/flow_proposal.py:60-63` — `CheckConstraint("(source_session_id IS NOT NULL) <> (l1_session_id IS NOT NULL)")` (XOR). +- Line 87-92 — `l1_session_id` FK is `ondelete="SET NULL"`; `source_session_id` (line 83) is `ondelete="CASCADE"`. +- Migration `backend/alembic/versions/1fd88a68b145_flow_proposal_l1_source_linkage.py:32-45` ships the same DDL. +- Postgres CHECK constraints are non-deferrable and ARE evaluated on the UPDATE produced by `ON DELETE SET NULL`. So deleting any `l1_walk_sessions` row referenced by a proposal (whose `source_session_id` is NULL by construction) → both columns NULL → CHECK violation → the DELETE fails. The `SET NULL` action can literally never fire successfully. +- Reachable today via `backend/app/api/endpoints/admin.py:1336` (`hard_delete_user` → `db.delete(account)`, DB-side cascades with unspecified ordering), and via any future GDPR/retention purge. + +**Fix:** Change `l1_session_id` to `ondelete="CASCADE"` (matching `source_session_id`'s behavior — proposal dies with its source), in both the model and a new migration. Keep the XOR check. Alternative (`num_nonnulls(...) >= 1` style relaxation) is weaker; prefer CASCADE. + +### 7. Account admins locked out of L1 category settings (3-layer inconsistency) + +- Frontend route: `frontend/src/router.tsx:368-372` — `requiredRole="owner"`; `frontend/src/hooks/usePermissions.ts:21-28` (`getEffectiveRole`) has **no admin branch** → `account_role='admin'` maps to `viewer` → bounced to /trees. +- Backend GET: `backend/app/api/endpoints/accounts.py:175-178` uses `require_l1_or_above` (`deps.py:235-242`: `l1_tech/engineer/owner` only) → admin gets **403 on read**. +- Backend PATCH: `accounts.py:193-197` uses `require_account_owner_or_admin` (`deps.py:279-289`) → admin **can write**. +- `admin` is a real role: `backend/app/models/user.py:25` CHECK constraint; `user.py:132` treats admin as account-manager. + +**Fix:** Pick one rule — owner+admin manage L1 categories — and apply it at all three layers: GET should use `require_account_owner_or_admin` too (or a combined dep), and the route guard needs admins to pass (either add an admin branch to `getEffectiveRole` — check blast radius on other `requiredRole` uses first — or a dedicated `canManageAccount`-style guard for this route). Also note `require_account_owner_or_admin` duplicates `User.can_manage_account` (`user.py:130-132`); delegate to it. + +--- + +## FAST-FOLLOWS (real bugs, lower urgency) + +### 8. Every walk-view mount fires a fresh paid LLM call (and may swap the question) + +- The served-but-unanswered node is never persisted: `l1_session_service.py:156-174` — `node_id is None` path goes straight to `generate_next_node`; only `current_node_id` (always None today, see Finding 1) is stored. No replay branch. +- `L1WalkTreeVariant.tsx:26-44` — mount effect unconditionally POSTs `/next-node {}`; `frontend/src/main.tsx:4,34` — StrictMode is on, so dev double-mounts double-generate. + +**Impact:** Refresh/back-forward = duplicate Sonnet spend, multi-second stall, and possibly a *different* question than the one the tech was answering. + +**Fix:** Persist the pending node (e.g., a `pending_node` JSONB column on `l1_walk_sessions`, or reuse `current_node_id` + stored payload) and replay it when `node_id is None` and a pending node exists. Note: if adding columns, this is the moment to also add `category` and `problem_text` columns and delete the meta-entry convention (Root cause B) — `/next-node` currently re-fetches the internal ticket and re-scans walked_path on every step (`l1.py:302-310`) just to recover these immutable values. + +### 9. Hidden meta entry: junk proposals + depth cap off-by-one + +- Junk proposal: `l1_session_service.py:270` — `if helpful and session.session_kind == "ai_build" and session.walked_path:` — a meta-only walked_path (seeded at intake, `l1.py:132-134`) is truthy. `normalize_walked_path` (`ai_tree_builder.py:131-137`) strips meta → empty → returns the `"Empty walk — needs authoring."` stub, which "passes the proposal approval guard" per its own docstring → a `status="pending"`, `validated_by_outcome=True` junk proposal reaches the review queue when a tech resolves immediately after intake. +- Depth cap: `l1_session_service.py:172-173` passes the **raw** walked_path; `ai_tree_builder.py:82-83,96-98` — `len(walked_path) >= MAX_DEPTH` (12) counts the meta entry → cap fires after 11 real steps. `_strip_meta` is applied only downstream. + +**Fix (symptom-level):** strip meta before both the truthiness guard and `escalate_if_depth_exceeded`. **Fix (root):** real `category` column, delete the meta convention (see Finding 8 note). Root fix preferred per project principle (correct architecture over minimal diff). + +### 10. Escalation notification silently dropped when recipient query is empty + +- `notification_service.py:180` changed `if target_user_ids:` → `if target_user_ids is not None:` (intentional, documented at lines 176-178). +- `l1_session_service.py:371-381` — `escalate()` passes its computed `target_ids` unconditionally; if all owners/admins/engineers are inactive, `[]` → zero in-app notifications, no log, no fallback. (Existing callers are safe — they all use `[x] if x else None` patterns.) +- Bonus divergence: escalate's hand-rolled query filters only `is_active`, while `handoff_manager.py:323-333` also filters `deleted_at IS NULL` — soft-deleted engineers would be notified. + +**Fix:** In `escalate()`: `target_user_ids=target_ids or None` (falls back to default recipients) plus a warning log when empty; add the `deleted_at` filter. Longer-term: give `_resolve_recipients` a roles parameter so callers stop hand-rolling recipient queries. + +--- + +## Cleanups (optional, do alongside adjacent fixes) + +- `L1Dashboard.tsx:47-110` — `handleStart` / `useSuggestedFlow` / `buildNew` are three near-identical intake calls; collapse to one `runIntake(opts)` switching on `response.outcome` (this also prevents Finding-4-class drift). +- `backend/app/schemas/l1.py` — `IntakeRequest.flow_id` is dead unless Finding 4 revives it; `NextNodeRequest.acknowledged` is sent by the frontend but never read by the backend (advance infers ack from `answer is None`) — wire it or drop it. `IntakeResponse` lost its per-outcome guarantees (all fields Optional, `ticket_kind` no longer `Literal["psa","internal"]`); add a `model_validator(mode="after")` requiring `session_id`/`ticket_id` when outcome is matched/build, and add a `session_id` null-guard before `navigate()` in `handleStart` (`L1Dashboard.tsx:58-59` — currently navigates to `/l1/walk/undefined` on a regression). +- `backend/app/services/match_or_build.py:55` — unused positional `ticket_ref` param (only caller passes `""`); delete it. Also note `classify()` is a second bespoke LLM intake classifier alongside `flowpilot_engine._classify_intake`; `l1.py` passes `problem_domain=None` to matching, losing the domain signal the existing classifier provides — consider unifying in Phase 2B. +- `backend/app/services/l1_category_service.py:17` + `models/account.py:75-81` — DEFAULT_L1_CATEGORIES duplicated as a hand-escaped JSON `server_default`; derive one from the other (migration copy stays frozen). +- `frontend/src/pages/account/L1CategoriesPage.tsx` — local `prettify()` duplicates `humanizeFeatureKey` (`UpgradePrompt.tsx:62`); page skips shared `PageHeader`/`Spinner` used by sibling settings pages. +- Missing index for `GET /l1/escalations` (`l1.py:338`): consider `CREATE INDEX ... ON l1_walk_sessions (account_id, last_step_at DESC) WHERE status = 'escalated'`. +- `backend/tests/test_l1_session_service.py` — the `escalation_reason_category == "no_kb_content"` assertion was deleted from `test_escalate_without_walk_writes_audit_log`, weakening audit coverage; restore it. +- Per-step `walked_path` rewrite is O(n²) cumulative bytes (`session.walked_path = [*session.walked_path, entry]`); bounded by MAX_DEPTH=12 so fine today — note for Phase 2B if depth grows. + +--- + +## Suggested execution order + +1. Finding 1 (node ids) — unblocks everything; add the contract test. +2. Finding 6 (FK/constraint) — new migration; do early so it ships in the same release. +3. Findings 2 + 3 together (mount section, fix field names/meta filter, ProposalDetail L1 block + null-guard the /pilot link). +4. Finding 4 (intake honors flow_id; suggest card passes it). +5. Finding 5 (decide adhoc: restore option (a) recommended, or fix copy + DECISIONS.md). +6. Finding 7 (align all three permission layers). +7. Findings 8 + 9 via the root fix: add `category`/`problem_text` (+ optionally `pending_node`) columns, delete the meta-entry convention, strip-meta fixes become moot. +8. Finding 10 (one-line guard + deleted_at filter). +9. Cleanups opportunistically alongside the file they touch. + +After fixes: run the 11 Phase 2A backend test files together (authoritative gate per HANDOFF — do NOT trust a full local serial `pytest tests/`; use `--override-ini="addopts="`), frontend `tsc -b` + lint + build, and migration downgrade/upgrade roundtrip. Update `.ai/HANDOFF.md` to correct the Task 16/17 record.