Files
resolutionflow/.ai/DECISIONS.md
Michael Chihlas db446e1fd6 docs(handoff): PR #193 all 10 review findings resolved + 2 decisions
Findings doc gets a per-finding RESOLUTION section; HANDOFF resume point moves to
"re-push + merge" and corrects the false Task 16/17 "done" record; CURRENT_TASK
updated; two architectural decisions logged (real ai_build columns replacing the
meta convention; ad-hoc walk restored); SESSION_LOG entry added.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 15:56:03 -04:00

315 lines
35 KiB
Markdown

# DECISIONS.md
> Append-only architectural decision log. Newest entries at the top.
> Entry format:
>
> ```
> ## YYYY-MM-DD — <short title>
> **Context:** why this came up
> **Decision:** what we chose
> **Rejected:** what we didn't choose and why
> **Consequences:** what this means going forward
> ```
---
## 2026-06-09 — L1 ai_build context lives in columns, not a hidden `meta` walked_path entry
**Context:** PR #193 review found that the intake category was smuggled into the
ai_build session's `walked_path` as a fake `{"node_type":"meta","category":...}`
entry that every consumer had to remember to skip. Most didn't: it made an
otherwise-empty walk truthy (junk `pending` proposals reached the review queue),
pushed the depth cap off by one (counted as a real step), and rendered as a blank
row in the escalations UI. Compounding it, AI-generated nodes carried no `id`, but
the advance protocol keys on `node_id` — so the walk could never advance past the
first question (the headline feature was non-functional end-to-end).
**Decision:** Add real `category`, `problem_text`, and `pending_node` columns to
`l1_walk_sessions` (migration `61dda4f615c6`) and **delete the meta-entry convention
entirely**. Intake stores `category`/`problem_text` on the session; `/next-node`
reads them off the row (no ticket re-fetch, no walked_path scan). The server assigns
every node a `uuid4().hex[:8]` id (`ai_tree_builder._assign_id`) — never the model.
`pending_node` persists the served-but-unanswered node so a refresh / StrictMode
double-mount replays it instead of firing a fresh paid LLM call.
**Rejected:** Symptom-level strip-meta fixes (filter the meta entry at each consumer).
Smaller diff, but leaves the landmine convention in place for the next consumer to
trip over — contrary to the project principle (correct architecture over minimal diff).
Asking the LLM to invent node ids: not stable, not trustworthy.
**Consequences:** `walked_path` now holds only real steps. Adding a new consumer no
longer requires knowing about a hidden entry. `WalkSessionResponse` exposes
`category`/`problem_text` (escalations UI shows the real problem). The `meta`
node_type and `_strip_meta` are gone.
---
## 2026-06-09 — Keep the L1 ad-hoc walk fallback (don't drop it)
**Context:** The Phase 2A intake rewrite dropped the `else: start_adhoc_session(...)`
branch, leaving `start_adhoc_session` with zero callers and the out_of_scope prompt
offering only Escalate/Cancel — while `L1CategoriesPage` copy still promised "Disabled
categories fall back to an ad-hoc walk or escalation." A capability silently regressed.
**Decision:** Restore it (review Finding 5 option a). Intake honors `adhoc=True`
(a new `IntakeRequest` field → `"adhoc"` outcome) and the out_of_scope prompt gained a
"Walk it ad-hoc" button. This preserves the pre-existing free-form-walk capability and
keeps the settings copy honest.
**Rejected:** Dropping ad-hoc and fixing the copy. It removes a capability techs had,
for a problem class (out-of-scope) where a free-form walk is the natural fallback before
escalation. Cheaper, but a product regression dressed as cleanup.
**Consequences:** `start_adhoc_session` has a caller again. The walker renders adhoc
sessions via its existing non-ai_build branch (free-form notes, no AI tree).
---
## 2026-05-29 — Single source of truth for plan-tier taxonomy (derive admin UI + validation from `plan_limits`)
**Context:** A prod report ("AI sessions aren't working") traced to the owner account having no paid plan (AI is plan-gated), compounded by a real bug: the admin "Change Plan" dropdown ([`AccountDetailPage.tsx:443-445`](../frontend/src/pages/admin/AccountDetailPage.tsx)) still offered the dead `team` slug (renamed to `enterprise` in migration `4ce3e594cb87`, 2026-05-07) and omitted `starter`/`enterprise`. Selecting "Team" 400s against the hardcoded allow-list in [`admin.py:994`](../backend/app/api/endpoints/admin.py#L994). The dropdown was missed during the 2026-05-07 taxonomy reconciliation because the allowed-plan list is hand-duplicated across ≥6 backend + frontend sites. Second taxonomy-drift incident.
**Decision:** Option B — make `plan_limits` the single source of truth: admin dropdown + pricing/checkout derive plan options from a plans endpoint (filter `is_public`, order by `sort_order`, label from `display_name`), and backend validation checks against actual `plan_limits` rows rather than a hardcoded tuple. Implementation deferred (active work is on another branch); fully specced in [TODO.md](TODO.md). A trivial dropdown-options fix may land first to unblock the admin tool.
**Rejected:** Option A (patch only the `AccountDetailPage` dropdown). Fixes the symptom but leaves the duplication that has now caused two drift incidents — and there is no outage forcing a minimal diff (bug is admin-only and was already worked around via direct Pro assignment). Conflicts with the repo principle "prefer correct architecture over minimal diff."
**Consequences:** New plan tiers become a data change (a `plan_limits` row) instead of a multi-file code edit; UI and validation can no longer drift from the catalog. Requires a public-plans read endpoint (or extending billing state) consumed by the admin UI + pricing page. The `'team'` visibility string (`Tree.visibility` / `StepLibrary.visibility`) is a separate domain and is explicitly out of scope.
---
## 2026-05-28 — Scope Anthropic structured outputs to flat-array JSON only
**Context:** Optimizing the existing Claude API usage (no model change). The Anthropic path in `generate_json` (`ai_provider.py`) had no equivalent to the Gemini path's `response_mime_type="application/json"` — it prompted for JSON and relied on downstream defenses: `_strip_markdown_fences` (ai_fix), `parse_llm_json` (knowledge_flywheel), and `_try_repair_json` (kb_conversion, which balances unclosed braces on truncated output). Anthropic structured outputs (`output_config.format` with a JSON schema) guarantee valid, parseable JSON and would eliminate those band-aids. The question was which of the four `generate_json` call sites can adopt it.
Structured outputs has hard schema limits: **no recursive schemas**, and **every object must set `additionalProperties: false`** (so the schema must enumerate exactly the fields the model emits — a superset is impossible, an omission makes a field unproducible). Tracing the call sites against those limits:
- **kb_conversion** → output is `{title, description, nodes: [...]}` / `{...steps[], intake_form[]}`**flat arrays**, references by `next_node_id`/id, no nesting. Expressible.
- **ai_fix** → returns a fixed *node that is itself a subtree*; `_find_node_by_id` recurses `node["children"]` and the prompt requires decision nodes to have ≥2 children. **Recursive, arbitrary depth.**
- **knowledge_flywheel flow-gen** → emits `tree_structure`, a decision-tree root with nested `children`/`options`, persisted as an opaque blob.
- **knowledge_flywheel enhancement** → flat `new_nodes[] + modified_options[]`; expressible but low-frequency and only fence-stripped.
**Decision:** Apply structured outputs to **flat-array outputs only** — i.e. `kb_conversion`. Wired via an optional `schema=` param on `AIProvider.generate_json` (`None` = legacy prompt-only behavior; Anthropic maps it to `output_config.format`, Gemini ignores it), with the two KB schemas + `_schema_for_target_type()` in `kb_conversion_service.py`, gated behind `settings.AI_KB_CONVERT_STRUCTURED_OUTPUT` (default **False**) pending a live constrained-decoding smoke-test in staging. The robustness fixes that motivated the work — `_extract_text_from_response` (skip non-text blocks, log `max_tokens`/`refusal`, raise on no-text) — live in the shared provider, so **all four** callers already benefit regardless of schema adoption.
**Rejected:**
- **Forcing schemas on ai_fix / flow-gen.** Their outputs are recursive/nested decision trees; a bounded-depth schema would reject valid deeper trees and break generation. Wrong architecture for marginal/zero benefit (flow-gen's tree is stored as a blob, never schema-validated downstream).
- **Wiring the flywheel enhancement site.** Flat and technically expressible, but low call frequency and only fence-stripping today — marginal benefit against the risk of a blind (un-live-tested) `additionalProperties: false` schema.
- **Deleting the fence-strip / repair helpers now.** `_strip_markdown_fences` / `parse_llm_json` must stay — they protect the recursive paths that can't use schemas. Only `_try_repair_json` (kb-only) becomes removable, and only *after* the flag is validated in staging.
**Consequences:**
- Structured outputs is the tool for flat JSON; recursive decision-tree outputs are excluded by design. New flat-JSON `generate_json` callers can opt in via `schema=`; recursive ones should not.
- `AI_KB_CONVERT_STRUCTURED_OUTPUT` must be smoke-tested against the live model (both target types) before production enablement. Open risk: whether Anthropic accepts optional (non-`required`) fields — if not, the schemas need every field in `required` with nullable types. The flag makes this fully reversible.
- Deferred cleanup: once the flag is validated, remove only `_try_repair_json` from the kb_conversion Anthropic path; leave the fence-strippers.
- Work lives on branch `feat/ai-structured-outputs` (commits `84a02a5`, `1388357`), based on `design/l1-workspace`.
---
## 2026-05-13 — Session expiration policy: 3d idle / 14d absolute defaults + per-account override
**Context:** User report: "I login to ResolutionFlow and never have to log back in." Investigation found refresh tokens at `REFRESH_TOKEN_EXPIRE_DAYS=7` with JTI rotation (`security.py:36`) — every `/auth/refresh` minted a fresh 7-day window. Net effect: a sliding 7-day session with no absolute cap. Visit once a week, logged in forever. Acceptable for pilot but not for MSP buyers whose SOC2 / cyber-insurance auditors require enforced session timeouts. Required for the same Phase O launch readiness as the other gates already in flight.
**Decision:** Two-window model snapshotted into the refresh JWT at login. Defaults to Strict (3-day idle, 14-day absolute), bounded by env-var system min/max. Per-account override via two new `accounts` columns (NULL = use system default). Owner-only `GET/PATCH /accounts/me/security` endpoint with effective-value validation (partial-override case caught at the app layer because the DB CHECK can't see Settings). Sibling `POST /accounts/me/security/revoke-sessions` for `all|others`-scoped bulk revocation. Frontend: Strict/Standard/Custom presets, active-users list (name + email + last-login-ago), differentiated SessionExpiryToast (idle = warning amber with "Stay signed in" → `/auth/refresh`; absolute = info cyan, informational only), cyan info-tone banner on `/login?reason=session_expired`, auto-redirect after scope=all bulk-revoke. Error-detail taxonomy on the wire: `session_expired_idle`, `session_expired_absolute`, `invalid_refresh_token`. Grandfather path: legacy refresh tokens (no `auth_time` claim) get one free rotation under the new policy. Atomic-revoke-then-check on `/auth/refresh` so absolute-expired tokens can't be replayed.
8 commits on `feat/session-expiration-policy` branch (`92fa3bc``c7cd711`), ~1300 LoC backend + frontend including 28 backend tests. Plan + design review at `docs/plans/2026-05-13-session-expiration-policy.md` (initial design score 4/10 → final 9/10 via `/plan-design-review`; 7 design decisions locked).
**Rejected:**
- **Idle-only or absolute-only enforcement.** Idle without absolute is the current broken state (sliding forever). Absolute without idle is too strict — kicks users out daily.
- **Hard cutover on deploy (SECRET_KEY rotation).** Forces every pilot to log in again immediately; high support cost. Grandfather path is friendlier and adds ~50 lines of code.
- **Distinguish `session_revoked_by_admin` from `invalid_refresh_token` on the wire** for users whose sessions were killed via bulk-revoke. Requires tracking revocation reason per `refresh_tokens` row. Not worth the complexity for v1 — affected users see they're logged out, same as any other revoke.
- **Per-user device list with per-device revoke.** Refresh tokens don't carry device/user-agent metadata today. Account-wide bulk revoke covers the breach-response use case; per-device is a follow-up if pilots ask.
- **"Loose" preset (90d).** Strict default suggests we shouldn't ship a one-click loose option. Owners who want a loose policy can use Custom and own the choice explicitly.
- **Always-required `idle_minutes`+`absolute_minutes` (XOR-NULL invariant).** Forces owners who only want to override idle to also re-declare the absolute window, leaking the system default into account data. Partial overrides allowed; validated at the app layer against current defaults.
- **Reveal-on-Custom UI for the minute inputs.** Hidden-by-default-reveal-on-radio shifts page layout when Custom is selected. Always-visible-but-disabled is more stable and previews the Custom interaction.
- **Modal-stays-open-success-state for scope=all bulk-revoke.** User preferred auto-redirect-with-toast (more standard SaaS pattern); the toast acts as the success acknowledgment before /login loads.
**Consequences:**
- "Logged in forever" is fixed. Every user sees a hard 14-day re-auth at minimum (3-day idle in practice for typical usage).
- Account owners get a complete self-service surface for policy + bulk session control. New `/account/security` route, owner-gated.
- Audit-log entries on both mutations: `account.session_policy_update` and `account.sessions_revoked_bulk`. SOC2-ready.
- Frontend `idle_expires_at` + `absolute_expires_at` flow through the entire auth surface (`Token`, `OAuthCallbackResponse`, `authStore`, persistence). `useAuthSessionExpiry` hook is the single source for "is the session about to end."
- Future improvements (filed as follow-ups in plan §9): per-user device list (requires `refresh_tokens.last_used_at` column), super-admin global ceiling UI, per-user policy. None block current shipping.
- Cyan info-tone banner on `/login` is the first of its kind in the app; sets precedent for future neutral system messages.
---
## 2026-05-07 — Per-email allowlist (`INTERNAL_TESTER_EMAILS`) for self-serve soft cutover
**Context:** Phase O Task 46 ("internal validation pass") needed a way to exercise the full self-serve flow against the prod backend before flipping `SELF_SERVE_ENABLED=true` for everyone. The plan doc described the mechanism but the backend support was never built — flagged in `SESSION_LOG.md` as a code blocker. Stripe live-mode setup is also gated on having a working internal-tester path in prod test mode.
**Decision:** Comma-separated allowlist `INTERNAL_TESTER_EMAILS` parsed by a Pydantic field_validator into a normalized lowercase list. Two helpers on `Settings`: `is_internal_tester(email)` (case-insensitive membership check) and `is_self_serve_active_for(email)` (returns `SELF_SERVE_ENABLED OR is_internal_tester(email)`). Both endpoints that gate on the global flag now call the helper:
- `/config/public` accepts optional auth via new `get_current_user_optional` dep; returns `self_serve_enabled=true` for allowlisted authenticated callers; anonymous calls always see the global flag.
- `/auth/register` allows allowlisted emails to register without an invite code.
**Rejected:**
- **Custom header `X-Internal-Tester-Email` for anonymous flows.** Spoofable. The auth/register-payload checks are sufficient because the user has to OWN the email to register or log in.
- **Separate allowlists per surface (`INTERNAL_PRICING_TESTERS`, `INTERNAL_OAUTH_TESTERS`).** Premature splitting. The Phase O use case is "this small set of people can see the new flow"; one variable handles it. If finer granularity emerges, split then.
- **Database table for the allowlist.** Env var matches the spec from the plan doc and fits the soft-cutover lifecycle — list is small, changes infrequently, lives alongside other deployment-time config.
**Consequences:**
- Stripe internal validation can run end-to-end in prod test mode without flipping the global flag.
- Anonymous callers always see the global flag — the allowlist never leaks via unauthenticated request content. Three regression tests in `test_config_public.py` enforce this.
- `INTERNAL_TESTER_EMAILS` plumbed through `docker-compose.dev.yml` and documented in `backend/.env.example`. Railway prod env will need the same var set during Phase O cutover.
---
## 2026-05-07 — Reconcile plan tier taxonomy (rename `team` → `enterprise`, add `starter`)
**Context:** PR #162 left a real architectural gap. Marketing surface (PricingPage, Stripe products) was wired for `Starter / Pro / Enterprise` while backend was on `free / pro / team`. `plan_billing.plan` FK referenced `plan_limits.plan` so the `BillingPlan` schema's `Literal["pro", "starter", "team", "enterprise"]` could accept values that violated the FK. `plan_billing` was unseeded in dev, so no checkout could complete. `Subscription.plan.in_(["pro", "team"])` paid-plan checks wouldn't recognize `enterprise`. Self-serve cutover was blocked at the data layer.
**Decision:** Reconcile to a single taxonomy — backend slugs become `free / pro / starter / enterprise`, matching the marketing surface and Stripe products. Migration `4ce3e594cb87`:
1. Defensive `UPDATE subscriptions SET plan='enterprise' WHERE plan='team'` (dev had zero such rows; safety for any prod stragglers).
2. Rename the `plan_limits.plan='team'` row to `'enterprise'`.
3. Insert a `starter` row with caps interpolated between free and pro: `max_trees=10`, `max_sessions=75`, `max_users=1`, `max_ai_builds_per_month=15`, no KB Accelerator, no custom branding, no priority support.
Code rename across schemas, `Subscription` paid-plan/`has_pro_entitlement` checks, admin endpoints, frontend `useSubscription.isPaidPlan`. Resource visibility (`Tree.visibility='team'`, `StepLibrary.visibility='team'`) is a separate domain and intentionally untouched — that string means "shared with my account" and has nothing to do with the subscription tier.
New `backend/scripts/sync_stripe_plan_ids.py` — idempotent upsert of `plan_billing` rows from Stripe products by exact name match (`ResolutionFlow Starter / Pro / Enterprise`). Picks the active monthly recurring price for tiers that have one. Annual fields stay NULL by design — annual pricing is intentionally out of scope for the soft cutover ("want to be able to exit if necessary without breaching any terms").
**Rejected:**
- **Map marketing names to existing slugs (Option A from the discussion).** Smallest diff but means PricingPage cards have to translate `enterprise``team` at render time, and "Starter" can't exist as a real backend tier — it'd have to be hidden or dropped. Kicks the can.
- **Add `starter` only, keep `team` slug as cosmetic enterprise (Option C).** Mixed taxonomy across layers — slug-vs-display-name divergence guarantees confusion in 6 months. Compromise that's worse than either pure choice.
- **Annual pricing in this iteration.** User's explicit constraint: skip annual to keep exit-flexibility. Schema columns (`annual_price_cents`, `stripe_annual_price_id`) preserved as nullable for future re-enable.
- **Auto-archive the existing Enterprise `$500/mo` test-mode price.** Done manually via Stripe MCP after un-setting the product's `default_price` first. Spec says Enterprise is sales-led with no catalog price.
**Consequences:**
- `plan_billing` table is now seedable and seeded. Test-mode `plan_billing` populated for all 3 tiers via `sync_stripe_plan_ids.py`. Live mode runs the same script after manual Dashboard setup of products + prices.
- New consumers of `Subscription.plan` literal must use `("free", "pro", "starter", "enterprise")`. Three call sites already updated. Backend-wide grep is the safety net for new ones.
- `Subscription.is_paid` and `has_pro_entitlement` now include `starter` — Starter is a paid tier with a real $19.99/mo price.
- 86/86 passing across the subscription/billing/plan/invite/admin sweep after the rename.
- Test fixtures: `conftest.py` plan_limits seed updated to the new taxonomy. `_seed_plan_limits` helper in `test_plans_public.py` is now a true upsert so tests can override `max_users` even when conftest seeded the canonical value.
---
## 2026-05-07 — Standardize backend Python on 3.12
**Context:** Runtime facts had drifted from docs. The backend Dockerfiles and running dev container were already on Python 3.12, GitHub CI had just been updated to 3.12, but project docs still said Python 3.11 and Gitea CI relied on the runner's ambient Python.
**Decision:** Treat Python 3.12 as the backend standard. Pin local pyenv via `.python-version` to 3.12.13, matching the current `python:3.12-slim` container patch level. Add explicit Python 3.12 setup to Gitea CI and keep GitHub CI on Python 3.12.
**Rejected:** Moving Docker/runtime back to Python 3.11. The application was already building and running on 3.12, so reverting the runtime would add churn without a product or dependency reason.
**Consequences:** Native backend work should use `backend/venv` created from Python 3.12.13. Future docs/CI/runtime changes should preserve Python 3.12 unless a deliberate upgrade decision is recorded.
## 2026-04-30 — Add `applied_pending` non-terminal status to suggested fixes
**Context:** The verifying banner forces a synchronous verdict — worked / didn't / partial — but a lot of real MSP fixes are async. Engineer ran the script but is waiting on the client to power-cycle, AD replication, an O365 license sync. With only the existing outcomes, the engineer either leaves the banner stale (eroding the verifying signal) or guesses wrong (corrupting outcome data). User flagged the gap directly. Today's `NudgeBanner` "Still checking" button just silences the nudge — it doesn't tell the system anything.
**Decision:** Add a fourth, non-terminal outcome `applied_pending`, parallel to `applied_partial`. Required `pending_reason` Text column stores the "what are you waiting on?" reason. Outcome endpoint allows pending → {success, failed, partial, dismissed} transitions; pending stamps `applied_at` but NOT `verified_at` (it's parked, not verified). Resolution-note generator frames the fix as provisional (no closure language); escalation-package generator surfaces pending verification as the leading hypothesis with a reference to what's being waited on. Frontend exposes the state via a new `PendingBanner` component (info-tone, mirrors `PartialBanner`) plus a "Waiting to verify…" overflow option in the verifying banner. `NudgeBanner` "Still checking" now records pending with a reason instead of just silencing.
**Rejected:**
- **Reuse `applied_partial`.** Semantically wrong — partial means "I did some of it." Pending means "I did all of it, just can't tell if it worked." Generators write different prose for each, and conflating them would lose the distinction in the customer-facing resolution note and the next-engineer escalation handoff.
- **Add a `pending_reason` column without a new status.** The status field is what the dashboard, banner, and generators all branch on. Hiding pending state in a separate column would proliferate `IF pending_reason IS NOT NULL` checks across every consumer.
- **Cross-session "Follow-ups" dashboard rollup in v1.** Per-session `PendingBanner` is the chat-anchored reminder. Add the dashboard surface only if engineers report losing track across multiple pending sessions in pilot use.
- **Optional follow-up timer ("remind me in 30m").** Out of scope; nice-to-have but not the wedge.
**Consequences:**
- Engineers can park a fix honestly without losing the verifying signal. The state survives across sessions because it's persisted server-side.
- `pending_reason` is preserved as audit trail when the engineer advances pending → success/failed/dismissed; it is not auto-cleared. Intentional — it tells the next reader "we waited for X, then it worked."
- New consumers of `FixStatus` must handle the `applied_pending` case. Currently three: the banner derivation in `AssistantChatPage`, the resolution-note generator, and the escalation-package generator. All three updated in this change.
- Migration `c0f3a4b7e91d` is reversible — downgrade rewrites pending rows back to `applied_partial` and copies `pending_reason` into `partial_notes` if the partial slot was empty, then drops the column.
---
## 2026-04-30 — Allow `escalated_to_id` to send chat messages in claimed sessions
**Context:** During browser QA, clicking "Get AI analysis" on the magic-moment screen returned `POST /ai-sessions/{id}/chat → 400`. The senior tech who claimed the session is stored as `escalated_to_id` on `AISession`, not `user_id` (which remains the junior who created the session). `unified_chat_service.send_chat_message` queried `WHERE ai_sessions.user_id = :user_id`, so the senior's ID never matched and the endpoint rejected the request.
**Decision:** Extend the ownership check in `send_chat_message` to `OR ai_sessions.escalated_to_id = :user_id` using SQLAlchemy `or_()`. This is the minimal, correct fix: the session model already has a semantically valid "also owns" field for the claiming senior; extending the WHERE clause makes that ownership real.
**Rejected:**
- **Transfer `user_id` to the senior on claim.** Breaks the audit trail — `user_id` is the originating engineer throughout the session lifecycle. Any query scoped to "sessions this engineer worked on" would silently lose the junior's history.
- **A separate `can_send_message` service method.** Adds indirection with no benefit for v1. One `or_()` line in the existing query is sufficient.
- **Checking a role/permission flag instead.** Role gating (engineer/admin) already happens at the claim endpoint. The chat-send check is about session ownership, not role. Mixing the two concerns would be confusing.
**Consequences:**
- Seniors can send AI briefings and continue chat work in sessions they have claimed. Core escalation pickup flow unblocked.
- Any future caller of `send_chat_message` should be aware that "user_id or escalated_to_id" is the ownership rule. The service-level check is the single enforcement point.
- `user_id` remains the originating engineer for all audit, history, and analytics queries. No data migration needed.
---
## 2026-04-29 — Consolidate the three per-escalation AI calls into one structured generation
**Context:** A single user-initiated escalation currently triggers three separate Sonnet calls, all summarizing the same source material (session state, steps taken, "what we know") from slightly different angles:
1. `_build_escalation_package_enhanced` — runs in the background `enrich_escalation_async` task, builds a rich JSON payload that's saved to `ai_session.escalation_package`.
2. `_generate_ai_assessment` — also background, returns the magic-moment screen fields (`likely_cause`, `suggested_steps[]`, `confidence`).
3. `generate_status_update` — engineer-triggered when they click "Ticket Notes" / "Client Update" / "Email Draft" in the conclude modal, generates audience-specific PSA prose.
The user surfaced the smell: the engineer is *typically* generating a status update during the escalate flow, so the AI assessment work is being done twice with overlapping context and the engineer's PSA prose is being thrown away. Live test on 2026-04-29 also showed that bumping the assessment timeout 15s → 45s did NOT fix the empty-placeholder bug — meaning the architectural smell is also a demo blocker.
**Decision:** ONE structured AI call per escalation that produces a single payload covering both the magic-moment screen's diagnostic fields AND the PSA-ready prose. Persist to `SessionHandoff`. The conclude modal's "Ticket Notes" button reads from the saved prose instead of calling the model. "Client Update" and "Email Draft" buttons trigger a cheap Haiku transformation over the saved prose (tone shift only, not a re-summarization).
Proposed payload shape (final form decided during implementation):
```json
{
"summary_prose": "<PSA-flavored ticket-notes paragraph>",
"what_we_know": ["<one-liner>"],
"likely_cause": "<one sentence>",
"suggested_steps": ["<short step>"],
"confidence": "low | medium | high",
"audience_variants": {"client_update": null, "email_draft": null}
}
```
`audience_variants` filled lazily on first user request, cached.
**Rejected:**
- **Just bumping the timeout further.** Already tried 5s → 15s → 45s. The architectural redundancy is the real cost — even if Sonnet completed reliably, three calls per escalation is wasteful and creates three places where state can diverge.
- **Reusing the engineer's status update content as the AI assessment.** User's first instinct, but: status updates aren't always generated (engineer has to click), they're audience-specific (so you'd pick which one to copy), and they're prose without the structured fields the magic-moment screen needs. The right consolidation is the OTHER direction — generate ONE structured payload that the status-update buttons consume.
- **Switching the assessment to Haiku for speed.** Faster but solves only the latency symptom, not the redundancy. Doesn't help the conclude modal's status-update buttons.
**Consequences:**
- Magic-moment screen populates in ~5s instead of 25s+ (work happens in the foreground escalate path, not in a background task that races with the senior's pickup).
- Token spend per escalation drops by ~60% — one Sonnet call replaces two; the third (audience variants) becomes Haiku.
- Engineer's "Ticket Notes" button is instant — no model round-trip.
- Schema enforcement matters. The current `_generate_ai_assessment` returns freeform prose that the frontend stuffs into `assessment_text` because the structured fields aren't reliably parseable. The new call must use Anthropic's structured output / tool-use to enforce the schema.
- Migration concern: `ai_session.escalation_package` JSON column has live data on existing sessions. Keep it READABLE for backward compatibility; just stop *writing* the enhanced payload from `enrich_escalation_async`. If downstream queue summaries depend on it, dual-write the basic snapshot.
- Test fixtures (`test_handoff_manager.py`, `test_session_handoffs_api.py`) currently stub `_generate_ai_assessment` via `AsyncMock`. Updating the stubs is part of the rename.
- The frontend SSE assessment-ready subscription (added in `0f00ee5`) stays as-is — it just listens for the new event payload.
---
## 2026-04-28 — Tag the task-lane state with an owner chatId
**Context:** A recurring bug — every time the user returned to test escalation work, creating a new session would flash the previous session's task-lane data (questions, actions, "Tasks" pill counts) before the new session's AI response landed. The first attempt to fix it (`8914391`) added initializer-time guards (`incomingPrefill || isPickup`) that skipped the sessionStorage restore on mount. That covered exactly two entry paths and missed every other case: in-place URL navigation, mid-flight pickup, HMR re-runs, and the gap between `setActiveChatId(B)` and the AI response that finally populates B's questions/actions. The persistence effect made it worse by writing `{chatId: activeChatId, questions: activeQuestions}` — at any moment where activeChatId had flipped before the questions were updated, sessionStorage was stamped with `{chatId: B, questions: [A's data]}` and a subsequent restore would happily render A's data for B.
The root cause was that `activeQuestions` / `activeActions` / `showTaskLane` were three independent state slices implicitly assumed to be in sync with `activeChatId`. The synchronization was by convention, not by structure. Every code path that mutated them had to remember to call `resetSessionDerivedState` first; missing one created stale UI.
**Decision:** Add a `taskLaneOwnerChatId` state that records *which chatId the in-memory questions/actions belong to*, set at every site that populates them (sendPrefill, selectChat, handleSend, handleTaskSubmit, handleResumeNew, refreshFacts, handleApplyFix), cleared in `resetSessionDerivedState`. The persistence effect writes ownerChatId as the chatId tag. Render is gated on `taskLaneOwnerChatId === activeChatId` and ANDed into all three render conditions (toolbar Tasks button, narrow-viewport floating drawer, main side panel). The mount-time `skipTaskLaneRestore` guard stays as belt-and-braces for the prefill/pickup entry-flash window, which the owner-gate alone doesn't cover.
**Rejected:**
- **More entry-path guards.** That's whack-a-mole — the next path nobody anticipated will reproduce the bug. The owner-gate makes the bug structurally impossible regardless of which path triggers it.
- **Combining the four state slices into a single tagged object.** Cleaner long-term but a bigger refactor with more touch points. The owner-tracking approach gets the structural guarantee with a minimal diff and keeps the existing setState patterns.
- **Inlining the comparison at every render site.** Works but proliferates the comparison; one named derived value (`taskLaneIsForActiveChat`) reads better and groups the gate with the persistence-effect / state declarations as a named concept.
**Consequences:**
- Stale task-lane data is structurally unable to display. The lane is hidden during any window where `ownerChatId !== activeChatId`, no matter what mutation path got you there.
- Adding new sites that populate `activeQuestions` / `activeActions` requires also setting `taskLaneOwnerChatId`. The pattern is documented in the commit message and visible in every existing populate site as a paired call.
- The mount-time `skipTaskLaneRestore` guard is now redundant in steady-state but kept for the few-hundred-ms flash window between component mount and the first sendPrefill / selectChat effect. Deleting it would re-introduce a (smaller) flash without strong reason.
- Future task-lane state slices (e.g. `facts`, `activeFix`) follow the same pattern: gate their visibility on the owner check via the existing render conditions. Tagging more slices with their own `*OwnerChatId` is a future refactor if the slices diverge.
---
## 2026-04-24 — Adopt dual-agent handoff system (`.ai/` + `CLAUDE.md` + `AGENTS.md`)
**Context:** Claude Code hits session and weekly usage limits. Work stalls when the primary agent is locked out. Needed a structured way for OpenAI Codex to resume where Claude left off without losing architectural truth or drifting across sessions.
**Decision:** Split the old CLAUDE.md into `.ai/PROJECT_CONTEXT.md` (stable repo truth), agent-specific root files (`CLAUDE.md`, `AGENTS.md`) with a shared protocol block, and a small handoff toolkit (`CURRENT_TASK.md`, `HANDOFF.md`, `TODO.md`, `DECISIONS.md`, `SESSION_LOG.md`, `README.md`). Previous CLAUDE.md snapshotted in commit `e110fed` before the migration.
**Rejected:**
- Single symlinked CLAUDE.md/AGENTS.md — diverges silently, hides agent-specific tooling differences.
- Putting GitNexus/gstack content in AGENTS.md — Codex doesn't have those tools; would mislead the resume agent.
- Keeping the old CLAUDE.md as-is and adding AGENTS.md alongside it — duplicated truth, drift guaranteed.
**Consequences:**
- First read for either agent: `.ai/PROJECT_CONTEXT.md` + `.ai/CURRENT_TASK.md` + `.ai/HANDOFF.md`.
- Architectural changes in the repo require updating PROJECT_CONTEXT.md, not the root agent files.
- Git trailers differ per agent (`Claude Opus 4.7` vs `Codex`) — preserved in each root file.
- Legacy `SESSION-HANDOFF.md` deleted in the same commit; superseded by `.ai/HANDOFF.md`.