Files

Michael Chihlas db446e1fd6 docs(handoff): PR #193 all 10 review findings resolved + 2 decisions

Findings doc gets a per-finding RESOLUTION section; HANDOFF resume point moves to
"re-push + merge" and corrects the false Task 16/17 "done" record; CURRENT_TASK
updated; two architectural decisions logged (real ai_build columns replacing the
meta convention; ad-hoc walk restored); SESSION_LOG entry added.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-09 15:56:03 -04:00

35 KiB

Raw Blame History

DECISIONS.md

Append-only architectural decision log. Newest entries at the top. Entry format:
## YYYY-MM-DD — <short title>
**Context:** why this came up
**Decision:** what we chose
**Rejected:** what we didn't choose and why
**Consequences:** what this means going forward

2026-06-09 — L1 ai_build context lives in columns, not a hidden `meta` walked_path entry

Context: PR #193 review found that the intake category was smuggled into the ai_build session's walked_path as a fake {"node_type":"meta","category":...} entry that every consumer had to remember to skip. Most didn't: it made an otherwise-empty walk truthy (junk pending proposals reached the review queue), pushed the depth cap off by one (counted as a real step), and rendered as a blank row in the escalations UI. Compounding it, AI-generated nodes carried no id, but the advance protocol keys on node_id — so the walk could never advance past the first question (the headline feature was non-functional end-to-end).

Decision: Add real category, problem_text, and pending_node columns to l1_walk_sessions (migration 61dda4f615c6) and delete the meta-entry convention entirely. Intake stores category/problem_text on the session; /next-node reads them off the row (no ticket re-fetch, no walked_path scan). The server assigns every node a uuid4().hex[:8] id (ai_tree_builder._assign_id) — never the model. pending_node persists the served-but-unanswered node so a refresh / StrictMode double-mount replays it instead of firing a fresh paid LLM call.

Rejected: Symptom-level strip-meta fixes (filter the meta entry at each consumer). Smaller diff, but leaves the landmine convention in place for the next consumer to trip over — contrary to the project principle (correct architecture over minimal diff). Asking the LLM to invent node ids: not stable, not trustworthy.

Consequences: walked_path now holds only real steps. Adding a new consumer no longer requires knowing about a hidden entry. WalkSessionResponse exposes category/problem_text (escalations UI shows the real problem). The meta node_type and _strip_meta are gone.

2026-06-09 — Keep the L1 ad-hoc walk fallback (don't drop it)

Context: The Phase 2A intake rewrite dropped the else: start_adhoc_session(...) branch, leaving start_adhoc_session with zero callers and the out_of_scope prompt offering only Escalate/Cancel — while L1CategoriesPage copy still promised "Disabled categories fall back to an ad-hoc walk or escalation." A capability silently regressed.

Decision: Restore it (review Finding 5 option a). Intake honors adhoc=True (a new IntakeRequest field → "adhoc" outcome) and the out_of_scope prompt gained a "Walk it ad-hoc" button. This preserves the pre-existing free-form-walk capability and keeps the settings copy honest.

Rejected: Dropping ad-hoc and fixing the copy. It removes a capability techs had, for a problem class (out-of-scope) where a free-form walk is the natural fallback before escalation. Cheaper, but a product regression dressed as cleanup.

Consequences: start_adhoc_session has a caller again. The walker renders adhoc sessions via its existing non-ai_build branch (free-form notes, no AI tree).

2026-05-29 — Single source of truth for plan-tier taxonomy (derive admin UI + validation from `plan_limits`)

Context: A prod report ("AI sessions aren't working") traced to the owner account having no paid plan (AI is plan-gated), compounded by a real bug: the admin "Change Plan" dropdown (AccountDetailPage.tsx:443-445) still offered the dead team slug (renamed to enterprise in migration 4ce3e594cb87, 2026-05-07) and omitted starter/enterprise. Selecting "Team" 400s against the hardcoded allow-list in admin.py:994. The dropdown was missed during the 2026-05-07 taxonomy reconciliation because the allowed-plan list is hand-duplicated across ≥6 backend + frontend sites. Second taxonomy-drift incident.

Decision: Option B — make plan_limits the single source of truth: admin dropdown + pricing/checkout derive plan options from a plans endpoint (filter is_public, order by sort_order, label from display_name), and backend validation checks against actual plan_limits rows rather than a hardcoded tuple. Implementation deferred (active work is on another branch); fully specced in TODO.md. A trivial dropdown-options fix may land first to unblock the admin tool.

Rejected: Option A (patch only the AccountDetailPage dropdown). Fixes the symptom but leaves the duplication that has now caused two drift incidents — and there is no outage forcing a minimal diff (bug is admin-only and was already worked around via direct Pro assignment). Conflicts with the repo principle "prefer correct architecture over minimal diff."

Consequences: New plan tiers become a data change (a plan_limits row) instead of a multi-file code edit; UI and validation can no longer drift from the catalog. Requires a public-plans read endpoint (or extending billing state) consumed by the admin UI + pricing page. The 'team' visibility string (Tree.visibility / StepLibrary.visibility) is a separate domain and is explicitly out of scope.

2026-05-28 — Scope Anthropic structured outputs to flat-array JSON only

Context: Optimizing the existing Claude API usage (no model change). The Anthropic path in generate_json (ai_provider.py) had no equivalent to the Gemini path's response_mime_type="application/json" — it prompted for JSON and relied on downstream defenses: _strip_markdown_fences (ai_fix), parse_llm_json (knowledge_flywheel), and _try_repair_json (kb_conversion, which balances unclosed braces on truncated output). Anthropic structured outputs (output_config.format with a JSON schema) guarantee valid, parseable JSON and would eliminate those band-aids. The question was which of the four generate_json call sites can adopt it.

Structured outputs has hard schema limits: no recursive schemas, and every object must set additionalProperties: false (so the schema must enumerate exactly the fields the model emits — a superset is impossible, an omission makes a field unproducible). Tracing the call sites against those limits:

kb_conversion → output is {title, description, nodes: [...]} / {...steps[], intake_form[]} — flat arrays, references by next_node_id/id, no nesting. Expressible.
ai_fix → returns a fixed node that is itself a subtree; _find_node_by_id recurses node["children"] and the prompt requires decision nodes to have ≥2 children. Recursive, arbitrary depth.
knowledge_flywheel flow-gen → emits tree_structure, a decision-tree root with nested children/options, persisted as an opaque blob.
knowledge_flywheel enhancement → flat new_nodes[] + modified_options[]; expressible but low-frequency and only fence-stripped.

Decision: Apply structured outputs to flat-array outputs only — i.e. kb_conversion. Wired via an optional schema= param on AIProvider.generate_json (None = legacy prompt-only behavior; Anthropic maps it to output_config.format, Gemini ignores it), with the two KB schemas + _schema_for_target_type() in kb_conversion_service.py, gated behind settings.AI_KB_CONVERT_STRUCTURED_OUTPUT (default False) pending a live constrained-decoding smoke-test in staging. The robustness fixes that motivated the work — _extract_text_from_response (skip non-text blocks, log max_tokens/refusal, raise on no-text) — live in the shared provider, so all four callers already benefit regardless of schema adoption.

Rejected:

Forcing schemas on ai_fix / flow-gen. Their outputs are recursive/nested decision trees; a bounded-depth schema would reject valid deeper trees and break generation. Wrong architecture for marginal/zero benefit (flow-gen's tree is stored as a blob, never schema-validated downstream).
Wiring the flywheel enhancement site. Flat and technically expressible, but low call frequency and only fence-stripping today — marginal benefit against the risk of a blind (un-live-tested) additionalProperties: false schema.
Deleting the fence-strip / repair helpers now. _strip_markdown_fences / parse_llm_json must stay — they protect the recursive paths that can't use schemas. Only _try_repair_json (kb-only) becomes removable, and only after the flag is validated in staging.

Consequences:

Structured outputs is the tool for flat JSON; recursive decision-tree outputs are excluded by design. New flat-JSON generate_json callers can opt in via schema=; recursive ones should not.
AI_KB_CONVERT_STRUCTURED_OUTPUT must be smoke-tested against the live model (both target types) before production enablement. Open risk: whether Anthropic accepts optional (non-required) fields — if not, the schemas need every field in required with nullable types. The flag makes this fully reversible.
Deferred cleanup: once the flag is validated, remove only _try_repair_json from the kb_conversion Anthropic path; leave the fence-strippers.
Work lives on branch feat/ai-structured-outputs (commits 84a02a5, 1388357), based on design/l1-workspace.

2026-05-13 — Session expiration policy: 3d idle / 14d absolute defaults + per-account override

Context: User report: "I login to ResolutionFlow and never have to log back in." Investigation found refresh tokens at REFRESH_TOKEN_EXPIRE_DAYS=7 with JTI rotation (security.py:36) — every /auth/refresh minted a fresh 7-day window. Net effect: a sliding 7-day session with no absolute cap. Visit once a week, logged in forever. Acceptable for pilot but not for MSP buyers whose SOC2 / cyber-insurance auditors require enforced session timeouts. Required for the same Phase O launch readiness as the other gates already in flight.

Decision: Two-window model snapshotted into the refresh JWT at login. Defaults to Strict (3-day idle, 14-day absolute), bounded by env-var system min/max. Per-account override via two new accounts columns (NULL = use system default). Owner-only GET/PATCH /accounts/me/security endpoint with effective-value validation (partial-override case caught at the app layer because the DB CHECK can't see Settings). Sibling POST /accounts/me/security/revoke-sessions for all|others-scoped bulk revocation. Frontend: Strict/Standard/Custom presets, active-users list (name + email + last-login-ago), differentiated SessionExpiryToast (idle = warning amber with "Stay signed in" → /auth/refresh; absolute = info cyan, informational only), cyan info-tone banner on /login?reason=session_expired, auto-redirect after scope=all bulk-revoke. Error-detail taxonomy on the wire: session_expired_idle, session_expired_absolute, invalid_refresh_token. Grandfather path: legacy refresh tokens (no auth_time claim) get one free rotation under the new policy. Atomic-revoke-then-check on /auth/refresh so absolute-expired tokens can't be replayed.

8 commits on feat/session-expiration-policy branch (92fa3bc → c7cd711), ~1300 LoC backend + frontend including 28 backend tests. Plan + design review at docs/plans/2026-05-13-session-expiration-policy.md (initial design score 4/10 → final 9/10 via /plan-design-review; 7 design decisions locked).

Rejected:

Idle-only or absolute-only enforcement. Idle without absolute is the current broken state (sliding forever). Absolute without idle is too strict — kicks users out daily.
Hard cutover on deploy (SECRET_KEY rotation). Forces every pilot to log in again immediately; high support cost. Grandfather path is friendlier and adds ~50 lines of code.
Distinguish session_revoked_by_admin from invalid_refresh_token on the wire for users whose sessions were killed via bulk-revoke. Requires tracking revocation reason per refresh_tokens row. Not worth the complexity for v1 — affected users see they're logged out, same as any other revoke.
Per-user device list with per-device revoke. Refresh tokens don't carry device/user-agent metadata today. Account-wide bulk revoke covers the breach-response use case; per-device is a follow-up if pilots ask.
"Loose" preset (90d). Strict default suggests we shouldn't ship a one-click loose option. Owners who want a loose policy can use Custom and own the choice explicitly.
Always-required idle_minutes+absolute_minutes (XOR-NULL invariant). Forces owners who only want to override idle to also re-declare the absolute window, leaking the system default into account data. Partial overrides allowed; validated at the app layer against current defaults.
Reveal-on-Custom UI for the minute inputs. Hidden-by-default-reveal-on-radio shifts page layout when Custom is selected. Always-visible-but-disabled is more stable and previews the Custom interaction.
Modal-stays-open-success-state for scope=all bulk-revoke. User preferred auto-redirect-with-toast (more standard SaaS pattern); the toast acts as the success acknowledgment before /login loads.

Consequences:

"Logged in forever" is fixed. Every user sees a hard 14-day re-auth at minimum (3-day idle in practice for typical usage).
Account owners get a complete self-service surface for policy + bulk session control. New /account/security route, owner-gated.
Audit-log entries on both mutations: account.session_policy_update and account.sessions_revoked_bulk. SOC2-ready.
Frontend idle_expires_at + absolute_expires_at flow through the entire auth surface (Token, OAuthCallbackResponse, authStore, persistence). useAuthSessionExpiry hook is the single source for "is the session about to end."
Future improvements (filed as follow-ups in plan §9): per-user device list (requires refresh_tokens.last_used_at column), super-admin global ceiling UI, per-user policy. None block current shipping.
Cyan info-tone banner on /login is the first of its kind in the app; sets precedent for future neutral system messages.

2026-05-07 — Per-email allowlist (`INTERNAL_TESTER_EMAILS`) for self-serve soft cutover

Context: Phase O Task 46 ("internal validation pass") needed a way to exercise the full self-serve flow against the prod backend before flipping SELF_SERVE_ENABLED=true for everyone. The plan doc described the mechanism but the backend support was never built — flagged in SESSION_LOG.md as a code blocker. Stripe live-mode setup is also gated on having a working internal-tester path in prod test mode.

Decision: Comma-separated allowlist INTERNAL_TESTER_EMAILS parsed by a Pydantic field_validator into a normalized lowercase list. Two helpers on Settings: is_internal_tester(email) (case-insensitive membership check) and is_self_serve_active_for(email) (returns SELF_SERVE_ENABLED OR is_internal_tester(email)). Both endpoints that gate on the global flag now call the helper:

/config/public accepts optional auth via new get_current_user_optional dep; returns self_serve_enabled=true for allowlisted authenticated callers; anonymous calls always see the global flag.
/auth/register allows allowlisted emails to register without an invite code.

Rejected:

Custom header X-Internal-Tester-Email for anonymous flows. Spoofable. The auth/register-payload checks are sufficient because the user has to OWN the email to register or log in.
Separate allowlists per surface (INTERNAL_PRICING_TESTERS, INTERNAL_OAUTH_TESTERS). Premature splitting. The Phase O use case is "this small set of people can see the new flow"; one variable handles it. If finer granularity emerges, split then.
Database table for the allowlist. Env var matches the spec from the plan doc and fits the soft-cutover lifecycle — list is small, changes infrequently, lives alongside other deployment-time config.

Consequences:

Stripe internal validation can run end-to-end in prod test mode without flipping the global flag.
Anonymous callers always see the global flag — the allowlist never leaks via unauthenticated request content. Three regression tests in test_config_public.py enforce this.
INTERNAL_TESTER_EMAILS plumbed through docker-compose.dev.yml and documented in backend/.env.example. Railway prod env will need the same var set during Phase O cutover.

2026-05-07 — Reconcile plan tier taxonomy (rename `team` → `enterprise`, add `starter`)

Context: PR #162 left a real architectural gap. Marketing surface (PricingPage, Stripe products) was wired for Starter / Pro / Enterprise while backend was on free / pro / team. plan_billing.plan FK referenced plan_limits.plan so the BillingPlan schema's Literal["pro", "starter", "team", "enterprise"] could accept values that violated the FK. plan_billing was unseeded in dev, so no checkout could complete. Subscription.plan.in_(["pro", "team"]) paid-plan checks wouldn't recognize enterprise. Self-serve cutover was blocked at the data layer.

Decision: Reconcile to a single taxonomy — backend slugs become free / pro / starter / enterprise, matching the marketing surface and Stripe products. Migration 4ce3e594cb87:

Defensive UPDATE subscriptions SET plan='enterprise' WHERE plan='team' (dev had zero such rows; safety for any prod stragglers).
Rename the plan_limits.plan='team' row to 'enterprise'.
Insert a starter row with caps interpolated between free and pro: max_trees=10, max_sessions=75, max_users=1, max_ai_builds_per_month=15, no KB Accelerator, no custom branding, no priority support.

Code rename across schemas, Subscription paid-plan/has_pro_entitlement checks, admin endpoints, frontend useSubscription.isPaidPlan. Resource visibility (Tree.visibility='team', StepLibrary.visibility='team') is a separate domain and intentionally untouched — that string means "shared with my account" and has nothing to do with the subscription tier.

New backend/scripts/sync_stripe_plan_ids.py — idempotent upsert of plan_billing rows from Stripe products by exact name match (ResolutionFlow Starter / Pro / Enterprise). Picks the active monthly recurring price for tiers that have one. Annual fields stay NULL by design — annual pricing is intentionally out of scope for the soft cutover ("want to be able to exit if necessary without breaching any terms").

Rejected:

Map marketing names to existing slugs (Option A from the discussion). Smallest diff but means PricingPage cards have to translate enterprise → team at render time, and "Starter" can't exist as a real backend tier — it'd have to be hidden or dropped. Kicks the can.
Add starter only, keep team slug as cosmetic enterprise (Option C). Mixed taxonomy across layers — slug-vs-display-name divergence guarantees confusion in 6 months. Compromise that's worse than either pure choice.
Annual pricing in this iteration. User's explicit constraint: skip annual to keep exit-flexibility. Schema columns (annual_price_cents, stripe_annual_price_id) preserved as nullable for future re-enable.
Auto-archive the existing Enterprise $500/mo test-mode price. Done manually via Stripe MCP after un-setting the product's default_price first. Spec says Enterprise is sales-led with no catalog price.

Consequences:

plan_billing table is now seedable and seeded. Test-mode plan_billing populated for all 3 tiers via sync_stripe_plan_ids.py. Live mode runs the same script after manual Dashboard setup of products + prices.
New consumers of Subscription.plan literal must use ("free", "pro", "starter", "enterprise"). Three call sites already updated. Backend-wide grep is the safety net for new ones.
Subscription.is_paid and has_pro_entitlement now include starter — Starter is a paid tier with a real $19.99/mo price.
86/86 passing across the subscription/billing/plan/invite/admin sweep after the rename.
Test fixtures: conftest.py plan_limits seed updated to the new taxonomy. _seed_plan_limits helper in test_plans_public.py is now a true upsert so tests can override max_users even when conftest seeded the canonical value.

2026-05-07 — Standardize backend Python on 3.12

Context: Runtime facts had drifted from docs. The backend Dockerfiles and running dev container were already on Python 3.12, GitHub CI had just been updated to 3.12, but project docs still said Python 3.11 and Gitea CI relied on the runner's ambient Python.

Decision: Treat Python 3.12 as the backend standard. Pin local pyenv via .python-version to 3.12.13, matching the current python:3.12-slim container patch level. Add explicit Python 3.12 setup to Gitea CI and keep GitHub CI on Python 3.12.

Rejected: Moving Docker/runtime back to Python 3.11. The application was already building and running on 3.12, so reverting the runtime would add churn without a product or dependency reason.

Consequences: Native backend work should use backend/venv created from Python 3.12.13. Future docs/CI/runtime changes should preserve Python 3.12 unless a deliberate upgrade decision is recorded.

2026-04-30 — Add `applied_pending` non-terminal status to suggested fixes

Context: The verifying banner forces a synchronous verdict — worked / didn't / partial — but a lot of real MSP fixes are async. Engineer ran the script but is waiting on the client to power-cycle, AD replication, an O365 license sync. With only the existing outcomes, the engineer either leaves the banner stale (eroding the verifying signal) or guesses wrong (corrupting outcome data). User flagged the gap directly. Today's NudgeBanner "Still checking" button just silences the nudge — it doesn't tell the system anything.

Decision: Add a fourth, non-terminal outcome applied_pending, parallel to applied_partial. Required pending_reason Text column stores the "what are you waiting on?" reason. Outcome endpoint allows pending → {success, failed, partial, dismissed} transitions; pending stamps applied_at but NOT verified_at (it's parked, not verified). Resolution-note generator frames the fix as provisional (no closure language); escalation-package generator surfaces pending verification as the leading hypothesis with a reference to what's being waited on. Frontend exposes the state via a new PendingBanner component (info-tone, mirrors PartialBanner) plus a "Waiting to verify…" overflow option in the verifying banner. NudgeBanner "Still checking" now records pending with a reason instead of just silencing.

Rejected:

Reuse applied_partial. Semantically wrong — partial means "I did some of it." Pending means "I did all of it, just can't tell if it worked." Generators write different prose for each, and conflating them would lose the distinction in the customer-facing resolution note and the next-engineer escalation handoff.
Add a pending_reason column without a new status. The status field is what the dashboard, banner, and generators all branch on. Hiding pending state in a separate column would proliferate IF pending_reason IS NOT NULL checks across every consumer.
Cross-session "Follow-ups" dashboard rollup in v1. Per-session PendingBanner is the chat-anchored reminder. Add the dashboard surface only if engineers report losing track across multiple pending sessions in pilot use.
Optional follow-up timer ("remind me in 30m"). Out of scope; nice-to-have but not the wedge.

Consequences:

Engineers can park a fix honestly without losing the verifying signal. The state survives across sessions because it's persisted server-side.
pending_reason is preserved as audit trail when the engineer advances pending → success/failed/dismissed; it is not auto-cleared. Intentional — it tells the next reader "we waited for X, then it worked."
New consumers of FixStatus must handle the applied_pending case. Currently three: the banner derivation in AssistantChatPage, the resolution-note generator, and the escalation-package generator. All three updated in this change.
Migration c0f3a4b7e91d is reversible — downgrade rewrites pending rows back to applied_partial and copies pending_reason into partial_notes if the partial slot was empty, then drops the column.

2026-04-30 — Allow `escalated_to_id` to send chat messages in claimed sessions

Context: During browser QA, clicking "Get AI analysis" on the magic-moment screen returned POST /ai-sessions/{id}/chat → 400. The senior tech who claimed the session is stored as escalated_to_id on AISession, not user_id (which remains the junior who created the session). unified_chat_service.send_chat_message queried WHERE ai_sessions.user_id = :user_id, so the senior's ID never matched and the endpoint rejected the request.

Decision: Extend the ownership check in send_chat_message to OR ai_sessions.escalated_to_id = :user_id using SQLAlchemy or_(). This is the minimal, correct fix: the session model already has a semantically valid "also owns" field for the claiming senior; extending the WHERE clause makes that ownership real.

Rejected:

Transfer user_id to the senior on claim. Breaks the audit trail — user_id is the originating engineer throughout the session lifecycle. Any query scoped to "sessions this engineer worked on" would silently lose the junior's history.
A separate can_send_message service method. Adds indirection with no benefit for v1. One or_() line in the existing query is sufficient.
Checking a role/permission flag instead. Role gating (engineer/admin) already happens at the claim endpoint. The chat-send check is about session ownership, not role. Mixing the two concerns would be confusing.

Consequences:

Seniors can send AI briefings and continue chat work in sessions they have claimed. Core escalation pickup flow unblocked.
Any future caller of send_chat_message should be aware that "user_id or escalated_to_id" is the ownership rule. The service-level check is the single enforcement point.
user_id remains the originating engineer for all audit, history, and analytics queries. No data migration needed.

2026-04-29 — Consolidate the three per-escalation AI calls into one structured generation

Context: A single user-initiated escalation currently triggers three separate Sonnet calls, all summarizing the same source material (session state, steps taken, "what we know") from slightly different angles:

_build_escalation_package_enhanced — runs in the background enrich_escalation_async task, builds a rich JSON payload that's saved to ai_session.escalation_package.
_generate_ai_assessment — also background, returns the magic-moment screen fields (likely_cause, suggested_steps[], confidence).
generate_status_update — engineer-triggered when they click "Ticket Notes" / "Client Update" / "Email Draft" in the conclude modal, generates audience-specific PSA prose.

The user surfaced the smell: the engineer is typically generating a status update during the escalate flow, so the AI assessment work is being done twice with overlapping context and the engineer's PSA prose is being thrown away. Live test on 2026-04-29 also showed that bumping the assessment timeout 15s → 45s did NOT fix the empty-placeholder bug — meaning the architectural smell is also a demo blocker.

Decision: ONE structured AI call per escalation that produces a single payload covering both the magic-moment screen's diagnostic fields AND the PSA-ready prose. Persist to SessionHandoff. The conclude modal's "Ticket Notes" button reads from the saved prose instead of calling the model. "Client Update" and "Email Draft" buttons trigger a cheap Haiku transformation over the saved prose (tone shift only, not a re-summarization).

Proposed payload shape (final form decided during implementation):

{
  "summary_prose": "<PSA-flavored ticket-notes paragraph>",
  "what_we_know": ["<one-liner>"],
  "likely_cause": "<one sentence>",
  "suggested_steps": ["<short step>"],
  "confidence": "low | medium | high",
  "audience_variants": {"client_update": null, "email_draft": null}
}

audience_variants filled lazily on first user request, cached.

Rejected:

Just bumping the timeout further. Already tried 5s → 15s → 45s. The architectural redundancy is the real cost — even if Sonnet completed reliably, three calls per escalation is wasteful and creates three places where state can diverge.
Reusing the engineer's status update content as the AI assessment. User's first instinct, but: status updates aren't always generated (engineer has to click), they're audience-specific (so you'd pick which one to copy), and they're prose without the structured fields the magic-moment screen needs. The right consolidation is the OTHER direction — generate ONE structured payload that the status-update buttons consume.
Switching the assessment to Haiku for speed. Faster but solves only the latency symptom, not the redundancy. Doesn't help the conclude modal's status-update buttons.

Consequences:

Magic-moment screen populates in ~5s instead of 25s+ (work happens in the foreground escalate path, not in a background task that races with the senior's pickup).
Token spend per escalation drops by ~60% — one Sonnet call replaces two; the third (audience variants) becomes Haiku.
Engineer's "Ticket Notes" button is instant — no model round-trip.
Schema enforcement matters. The current _generate_ai_assessment returns freeform prose that the frontend stuffs into assessment_text because the structured fields aren't reliably parseable. The new call must use Anthropic's structured output / tool-use to enforce the schema.
Migration concern: ai_session.escalation_package JSON column has live data on existing sessions. Keep it READABLE for backward compatibility; just stop writing the enhanced payload from enrich_escalation_async. If downstream queue summaries depend on it, dual-write the basic snapshot.
Test fixtures (test_handoff_manager.py, test_session_handoffs_api.py) currently stub _generate_ai_assessment via AsyncMock. Updating the stubs is part of the rename.
The frontend SSE assessment-ready subscription (added in 0f00ee5) stays as-is — it just listens for the new event payload.

2026-04-28 — Tag the task-lane state with an owner chatId

Context: A recurring bug — every time the user returned to test escalation work, creating a new session would flash the previous session's task-lane data (questions, actions, "Tasks" pill counts) before the new session's AI response landed. The first attempt to fix it (8914391) added initializer-time guards (incomingPrefill || isPickup) that skipped the sessionStorage restore on mount. That covered exactly two entry paths and missed every other case: in-place URL navigation, mid-flight pickup, HMR re-runs, and the gap between setActiveChatId(B) and the AI response that finally populates B's questions/actions. The persistence effect made it worse by writing {chatId: activeChatId, questions: activeQuestions} — at any moment where activeChatId had flipped before the questions were updated, sessionStorage was stamped with {chatId: B, questions: [A's data]} and a subsequent restore would happily render A's data for B.

The root cause was that activeQuestions / activeActions / showTaskLane were three independent state slices implicitly assumed to be in sync with activeChatId. The synchronization was by convention, not by structure. Every code path that mutated them had to remember to call resetSessionDerivedState first; missing one created stale UI.

Decision: Add a taskLaneOwnerChatId state that records which chatId the in-memory questions/actions belong to, set at every site that populates them (sendPrefill, selectChat, handleSend, handleTaskSubmit, handleResumeNew, refreshFacts, handleApplyFix), cleared in resetSessionDerivedState. The persistence effect writes ownerChatId as the chatId tag. Render is gated on taskLaneOwnerChatId === activeChatId and ANDed into all three render conditions (toolbar Tasks button, narrow-viewport floating drawer, main side panel). The mount-time skipTaskLaneRestore guard stays as belt-and-braces for the prefill/pickup entry-flash window, which the owner-gate alone doesn't cover.

Rejected:

More entry-path guards. That's whack-a-mole — the next path nobody anticipated will reproduce the bug. The owner-gate makes the bug structurally impossible regardless of which path triggers it.
Combining the four state slices into a single tagged object. Cleaner long-term but a bigger refactor with more touch points. The owner-tracking approach gets the structural guarantee with a minimal diff and keeps the existing setState patterns.
Inlining the comparison at every render site. Works but proliferates the comparison; one named derived value (taskLaneIsForActiveChat) reads better and groups the gate with the persistence-effect / state declarations as a named concept.

Consequences:

Stale task-lane data is structurally unable to display. The lane is hidden during any window where ownerChatId !== activeChatId, no matter what mutation path got you there.
Adding new sites that populate activeQuestions / activeActions requires also setting taskLaneOwnerChatId. The pattern is documented in the commit message and visible in every existing populate site as a paired call.
The mount-time skipTaskLaneRestore guard is now redundant in steady-state but kept for the few-hundred-ms flash window between component mount and the first sendPrefill / selectChat effect. Deleting it would re-introduce a (smaller) flash without strong reason.
Future task-lane state slices (e.g. facts, activeFix) follow the same pattern: gate their visibility on the owner check via the existing render conditions. Tagging more slices with their own *OwnerChatId is a future refactor if the slices diverge.

2026-04-24 — Adopt dual-agent handoff system (`.ai/` + `CLAUDE.md` + `AGENTS.md`)

Context: Claude Code hits session and weekly usage limits. Work stalls when the primary agent is locked out. Needed a structured way for OpenAI Codex to resume where Claude left off without losing architectural truth or drifting across sessions.

Decision: Split the old CLAUDE.md into .ai/PROJECT_CONTEXT.md (stable repo truth), agent-specific root files (CLAUDE.md, AGENTS.md) with a shared protocol block, and a small handoff toolkit (CURRENT_TASK.md, HANDOFF.md, TODO.md, DECISIONS.md, SESSION_LOG.md, README.md). Previous CLAUDE.md snapshotted in commit e110fed before the migration.

Rejected:

Single symlinked CLAUDE.md/AGENTS.md — diverges silently, hides agent-specific tooling differences.
Putting GitNexus/gstack content in AGENTS.md — Codex doesn't have those tools; would mislead the resume agent.
Keeping the old CLAUDE.md as-is and adding AGENTS.md alongside it — duplicated truth, drift guaranteed.

Consequences:

First read for either agent: .ai/PROJECT_CONTEXT.md + .ai/CURRENT_TASK.md + .ai/HANDOFF.md.
Architectural changes in the repo require updating PROJECT_CONTEXT.md, not the root agent files.
Git trailers differ per agent (Claude Opus 4.7 vs Codex) — preserved in each root file.
Legacy SESSION-HANDOFF.md deleted in the same commit; superseded by .ai/HANDOFF.md.

35 KiB Raw Blame History

DECISIONS.md

2026-06-09 — L1 ai_build context lives in columns, not a hidden meta walked_path entry

2026-06-09 — Keep the L1 ad-hoc walk fallback (don't drop it)

2026-05-29 — Single source of truth for plan-tier taxonomy (derive admin UI + validation from plan_limits)

2026-05-28 — Scope Anthropic structured outputs to flat-array JSON only

2026-05-13 — Session expiration policy: 3d idle / 14d absolute defaults + per-account override

2026-05-07 — Per-email allowlist (INTERNAL_TESTER_EMAILS) for self-serve soft cutover

2026-05-07 — Reconcile plan tier taxonomy (rename team → enterprise, add starter)

2026-05-07 — Standardize backend Python on 3.12

2026-04-30 — Add applied_pending non-terminal status to suggested fixes

2026-04-30 — Allow escalated_to_id to send chat messages in claimed sessions

2026-04-29 — Consolidate the three per-escalation AI calls into one structured generation

2026-04-28 — Tag the task-lane state with an owner chatId

2026-04-24 — Adopt dual-agent handoff system (.ai/ + CLAUDE.md + AGENTS.md)

35 KiB

Raw Blame History

2026-06-09 — L1 ai_build context lives in columns, not a hidden `meta` walked_path entry

2026-05-29 — Single source of truth for plan-tier taxonomy (derive admin UI + validation from `plan_limits`)

2026-05-07 — Per-email allowlist (`INTERNAL_TESTER_EMAILS`) for self-serve soft cutover

2026-05-07 — Reconcile plan tier taxonomy (rename `team` → `enterprise`, add `starter`)

2026-04-30 — Add `applied_pending` non-terminal status to suggested fixes

2026-04-30 — Allow `escalated_to_id` to send chat messages in claimed sessions

2026-04-24 — Adopt dual-agent handoff system (`.ai/` + `CLAUDE.md` + `AGENTS.md`)