Files
resolutionflow/.ai/DECISIONS.md
Michael Chihlas 3f04911070
All checks were successful
CI / frontend (push) Successful in 6m40s
Mirror to GitHub / mirror (push) Successful in 7s
CI / e2e (push) Successful in 10m7s
CI / backend (push) Successful in 10m34s
feat(billing): plan taxonomy reconciliation + Stripe sync + internal-tester allowlist (#164)
Co-authored-by: Michael Chihlas <michael@resolutionflow.com>
Co-committed-by: Michael Chihlas <michael@resolutionflow.com>
2026-05-11 05:07:07 +00:00

21 KiB

DECISIONS.md

Append-only architectural decision log. Newest entries at the top. Entry format:

## YYYY-MM-DD — <short title>
**Context:** why this came up
**Decision:** what we chose
**Rejected:** what we didn't choose and why
**Consequences:** what this means going forward

2026-05-07 — Per-email allowlist (INTERNAL_TESTER_EMAILS) for self-serve soft cutover

Context: Phase O Task 46 ("internal validation pass") needed a way to exercise the full self-serve flow against the prod backend before flipping SELF_SERVE_ENABLED=true for everyone. The plan doc described the mechanism but the backend support was never built — flagged in SESSION_LOG.md as a code blocker. Stripe live-mode setup is also gated on having a working internal-tester path in prod test mode.

Decision: Comma-separated allowlist INTERNAL_TESTER_EMAILS parsed by a Pydantic field_validator into a normalized lowercase list. Two helpers on Settings: is_internal_tester(email) (case-insensitive membership check) and is_self_serve_active_for(email) (returns SELF_SERVE_ENABLED OR is_internal_tester(email)). Both endpoints that gate on the global flag now call the helper:

  • /config/public accepts optional auth via new get_current_user_optional dep; returns self_serve_enabled=true for allowlisted authenticated callers; anonymous calls always see the global flag.
  • /auth/register allows allowlisted emails to register without an invite code.

Rejected:

  • Custom header X-Internal-Tester-Email for anonymous flows. Spoofable. The auth/register-payload checks are sufficient because the user has to OWN the email to register or log in.
  • Separate allowlists per surface (INTERNAL_PRICING_TESTERS, INTERNAL_OAUTH_TESTERS). Premature splitting. The Phase O use case is "this small set of people can see the new flow"; one variable handles it. If finer granularity emerges, split then.
  • Database table for the allowlist. Env var matches the spec from the plan doc and fits the soft-cutover lifecycle — list is small, changes infrequently, lives alongside other deployment-time config.

Consequences:

  • Stripe internal validation can run end-to-end in prod test mode without flipping the global flag.
  • Anonymous callers always see the global flag — the allowlist never leaks via unauthenticated request content. Three regression tests in test_config_public.py enforce this.
  • INTERNAL_TESTER_EMAILS plumbed through docker-compose.dev.yml and documented in backend/.env.example. Railway prod env will need the same var set during Phase O cutover.

2026-05-07 — Reconcile plan tier taxonomy (rename teamenterprise, add starter)

Context: PR #162 left a real architectural gap. Marketing surface (PricingPage, Stripe products) was wired for Starter / Pro / Enterprise while backend was on free / pro / team. plan_billing.plan FK referenced plan_limits.plan so the BillingPlan schema's Literal["pro", "starter", "team", "enterprise"] could accept values that violated the FK. plan_billing was unseeded in dev, so no checkout could complete. Subscription.plan.in_(["pro", "team"]) paid-plan checks wouldn't recognize enterprise. Self-serve cutover was blocked at the data layer.

Decision: Reconcile to a single taxonomy — backend slugs become free / pro / starter / enterprise, matching the marketing surface and Stripe products. Migration 4ce3e594cb87:

  1. Defensive UPDATE subscriptions SET plan='enterprise' WHERE plan='team' (dev had zero such rows; safety for any prod stragglers).
  2. Rename the plan_limits.plan='team' row to 'enterprise'.
  3. Insert a starter row with caps interpolated between free and pro: max_trees=10, max_sessions=75, max_users=1, max_ai_builds_per_month=15, no KB Accelerator, no custom branding, no priority support.

Code rename across schemas, Subscription paid-plan/has_pro_entitlement checks, admin endpoints, frontend useSubscription.isPaidPlan. Resource visibility (Tree.visibility='team', StepLibrary.visibility='team') is a separate domain and intentionally untouched — that string means "shared with my account" and has nothing to do with the subscription tier.

New backend/scripts/sync_stripe_plan_ids.py — idempotent upsert of plan_billing rows from Stripe products by exact name match (ResolutionFlow Starter / Pro / Enterprise). Picks the active monthly recurring price for tiers that have one. Annual fields stay NULL by design — annual pricing is intentionally out of scope for the soft cutover ("want to be able to exit if necessary without breaching any terms").

Rejected:

  • Map marketing names to existing slugs (Option A from the discussion). Smallest diff but means PricingPage cards have to translate enterpriseteam at render time, and "Starter" can't exist as a real backend tier — it'd have to be hidden or dropped. Kicks the can.
  • Add starter only, keep team slug as cosmetic enterprise (Option C). Mixed taxonomy across layers — slug-vs-display-name divergence guarantees confusion in 6 months. Compromise that's worse than either pure choice.
  • Annual pricing in this iteration. User's explicit constraint: skip annual to keep exit-flexibility. Schema columns (annual_price_cents, stripe_annual_price_id) preserved as nullable for future re-enable.
  • Auto-archive the existing Enterprise $500/mo test-mode price. Done manually via Stripe MCP after un-setting the product's default_price first. Spec says Enterprise is sales-led with no catalog price.

Consequences:

  • plan_billing table is now seedable and seeded. Test-mode plan_billing populated for all 3 tiers via sync_stripe_plan_ids.py. Live mode runs the same script after manual Dashboard setup of products + prices.
  • New consumers of Subscription.plan literal must use ("free", "pro", "starter", "enterprise"). Three call sites already updated. Backend-wide grep is the safety net for new ones.
  • Subscription.is_paid and has_pro_entitlement now include starter — Starter is a paid tier with a real $19.99/mo price.
  • 86/86 passing across the subscription/billing/plan/invite/admin sweep after the rename.
  • Test fixtures: conftest.py plan_limits seed updated to the new taxonomy. _seed_plan_limits helper in test_plans_public.py is now a true upsert so tests can override max_users even when conftest seeded the canonical value.

2026-05-07 — Standardize backend Python on 3.12

Context: Runtime facts had drifted from docs. The backend Dockerfiles and running dev container were already on Python 3.12, GitHub CI had just been updated to 3.12, but project docs still said Python 3.11 and Gitea CI relied on the runner's ambient Python.

Decision: Treat Python 3.12 as the backend standard. Pin local pyenv via .python-version to 3.12.13, matching the current python:3.12-slim container patch level. Add explicit Python 3.12 setup to Gitea CI and keep GitHub CI on Python 3.12.

Rejected: Moving Docker/runtime back to Python 3.11. The application was already building and running on 3.12, so reverting the runtime would add churn without a product or dependency reason.

Consequences: Native backend work should use backend/venv created from Python 3.12.13. Future docs/CI/runtime changes should preserve Python 3.12 unless a deliberate upgrade decision is recorded.

2026-04-30 — Add applied_pending non-terminal status to suggested fixes

Context: The verifying banner forces a synchronous verdict — worked / didn't / partial — but a lot of real MSP fixes are async. Engineer ran the script but is waiting on the client to power-cycle, AD replication, an O365 license sync. With only the existing outcomes, the engineer either leaves the banner stale (eroding the verifying signal) or guesses wrong (corrupting outcome data). User flagged the gap directly. Today's NudgeBanner "Still checking" button just silences the nudge — it doesn't tell the system anything.

Decision: Add a fourth, non-terminal outcome applied_pending, parallel to applied_partial. Required pending_reason Text column stores the "what are you waiting on?" reason. Outcome endpoint allows pending → {success, failed, partial, dismissed} transitions; pending stamps applied_at but NOT verified_at (it's parked, not verified). Resolution-note generator frames the fix as provisional (no closure language); escalation-package generator surfaces pending verification as the leading hypothesis with a reference to what's being waited on. Frontend exposes the state via a new PendingBanner component (info-tone, mirrors PartialBanner) plus a "Waiting to verify…" overflow option in the verifying banner. NudgeBanner "Still checking" now records pending with a reason instead of just silencing.

Rejected:

  • Reuse applied_partial. Semantically wrong — partial means "I did some of it." Pending means "I did all of it, just can't tell if it worked." Generators write different prose for each, and conflating them would lose the distinction in the customer-facing resolution note and the next-engineer escalation handoff.
  • Add a pending_reason column without a new status. The status field is what the dashboard, banner, and generators all branch on. Hiding pending state in a separate column would proliferate IF pending_reason IS NOT NULL checks across every consumer.
  • Cross-session "Follow-ups" dashboard rollup in v1. Per-session PendingBanner is the chat-anchored reminder. Add the dashboard surface only if engineers report losing track across multiple pending sessions in pilot use.
  • Optional follow-up timer ("remind me in 30m"). Out of scope; nice-to-have but not the wedge.

Consequences:

  • Engineers can park a fix honestly without losing the verifying signal. The state survives across sessions because it's persisted server-side.
  • pending_reason is preserved as audit trail when the engineer advances pending → success/failed/dismissed; it is not auto-cleared. Intentional — it tells the next reader "we waited for X, then it worked."
  • New consumers of FixStatus must handle the applied_pending case. Currently three: the banner derivation in AssistantChatPage, the resolution-note generator, and the escalation-package generator. All three updated in this change.
  • Migration c0f3a4b7e91d is reversible — downgrade rewrites pending rows back to applied_partial and copies pending_reason into partial_notes if the partial slot was empty, then drops the column.

2026-04-30 — Allow escalated_to_id to send chat messages in claimed sessions

Context: During browser QA, clicking "Get AI analysis" on the magic-moment screen returned POST /ai-sessions/{id}/chat → 400. The senior tech who claimed the session is stored as escalated_to_id on AISession, not user_id (which remains the junior who created the session). unified_chat_service.send_chat_message queried WHERE ai_sessions.user_id = :user_id, so the senior's ID never matched and the endpoint rejected the request.

Decision: Extend the ownership check in send_chat_message to OR ai_sessions.escalated_to_id = :user_id using SQLAlchemy or_(). This is the minimal, correct fix: the session model already has a semantically valid "also owns" field for the claiming senior; extending the WHERE clause makes that ownership real.

Rejected:

  • Transfer user_id to the senior on claim. Breaks the audit trail — user_id is the originating engineer throughout the session lifecycle. Any query scoped to "sessions this engineer worked on" would silently lose the junior's history.
  • A separate can_send_message service method. Adds indirection with no benefit for v1. One or_() line in the existing query is sufficient.
  • Checking a role/permission flag instead. Role gating (engineer/admin) already happens at the claim endpoint. The chat-send check is about session ownership, not role. Mixing the two concerns would be confusing.

Consequences:

  • Seniors can send AI briefings and continue chat work in sessions they have claimed. Core escalation pickup flow unblocked.
  • Any future caller of send_chat_message should be aware that "user_id or escalated_to_id" is the ownership rule. The service-level check is the single enforcement point.
  • user_id remains the originating engineer for all audit, history, and analytics queries. No data migration needed.

2026-04-29 — Consolidate the three per-escalation AI calls into one structured generation

Context: A single user-initiated escalation currently triggers three separate Sonnet calls, all summarizing the same source material (session state, steps taken, "what we know") from slightly different angles:

  1. _build_escalation_package_enhanced — runs in the background enrich_escalation_async task, builds a rich JSON payload that's saved to ai_session.escalation_package.
  2. _generate_ai_assessment — also background, returns the magic-moment screen fields (likely_cause, suggested_steps[], confidence).
  3. generate_status_update — engineer-triggered when they click "Ticket Notes" / "Client Update" / "Email Draft" in the conclude modal, generates audience-specific PSA prose.

The user surfaced the smell: the engineer is typically generating a status update during the escalate flow, so the AI assessment work is being done twice with overlapping context and the engineer's PSA prose is being thrown away. Live test on 2026-04-29 also showed that bumping the assessment timeout 15s → 45s did NOT fix the empty-placeholder bug — meaning the architectural smell is also a demo blocker.

Decision: ONE structured AI call per escalation that produces a single payload covering both the magic-moment screen's diagnostic fields AND the PSA-ready prose. Persist to SessionHandoff. The conclude modal's "Ticket Notes" button reads from the saved prose instead of calling the model. "Client Update" and "Email Draft" buttons trigger a cheap Haiku transformation over the saved prose (tone shift only, not a re-summarization).

Proposed payload shape (final form decided during implementation):

{
  "summary_prose": "<PSA-flavored ticket-notes paragraph>",
  "what_we_know": ["<one-liner>"],
  "likely_cause": "<one sentence>",
  "suggested_steps": ["<short step>"],
  "confidence": "low | medium | high",
  "audience_variants": {"client_update": null, "email_draft": null}
}

audience_variants filled lazily on first user request, cached.

Rejected:

  • Just bumping the timeout further. Already tried 5s → 15s → 45s. The architectural redundancy is the real cost — even if Sonnet completed reliably, three calls per escalation is wasteful and creates three places where state can diverge.
  • Reusing the engineer's status update content as the AI assessment. User's first instinct, but: status updates aren't always generated (engineer has to click), they're audience-specific (so you'd pick which one to copy), and they're prose without the structured fields the magic-moment screen needs. The right consolidation is the OTHER direction — generate ONE structured payload that the status-update buttons consume.
  • Switching the assessment to Haiku for speed. Faster but solves only the latency symptom, not the redundancy. Doesn't help the conclude modal's status-update buttons.

Consequences:

  • Magic-moment screen populates in ~5s instead of 25s+ (work happens in the foreground escalate path, not in a background task that races with the senior's pickup).
  • Token spend per escalation drops by ~60% — one Sonnet call replaces two; the third (audience variants) becomes Haiku.
  • Engineer's "Ticket Notes" button is instant — no model round-trip.
  • Schema enforcement matters. The current _generate_ai_assessment returns freeform prose that the frontend stuffs into assessment_text because the structured fields aren't reliably parseable. The new call must use Anthropic's structured output / tool-use to enforce the schema.
  • Migration concern: ai_session.escalation_package JSON column has live data on existing sessions. Keep it READABLE for backward compatibility; just stop writing the enhanced payload from enrich_escalation_async. If downstream queue summaries depend on it, dual-write the basic snapshot.
  • Test fixtures (test_handoff_manager.py, test_session_handoffs_api.py) currently stub _generate_ai_assessment via AsyncMock. Updating the stubs is part of the rename.
  • The frontend SSE assessment-ready subscription (added in 0f00ee5) stays as-is — it just listens for the new event payload.

2026-04-28 — Tag the task-lane state with an owner chatId

Context: A recurring bug — every time the user returned to test escalation work, creating a new session would flash the previous session's task-lane data (questions, actions, "Tasks" pill counts) before the new session's AI response landed. The first attempt to fix it (8914391) added initializer-time guards (incomingPrefill || isPickup) that skipped the sessionStorage restore on mount. That covered exactly two entry paths and missed every other case: in-place URL navigation, mid-flight pickup, HMR re-runs, and the gap between setActiveChatId(B) and the AI response that finally populates B's questions/actions. The persistence effect made it worse by writing {chatId: activeChatId, questions: activeQuestions} — at any moment where activeChatId had flipped before the questions were updated, sessionStorage was stamped with {chatId: B, questions: [A's data]} and a subsequent restore would happily render A's data for B.

The root cause was that activeQuestions / activeActions / showTaskLane were three independent state slices implicitly assumed to be in sync with activeChatId. The synchronization was by convention, not by structure. Every code path that mutated them had to remember to call resetSessionDerivedState first; missing one created stale UI.

Decision: Add a taskLaneOwnerChatId state that records which chatId the in-memory questions/actions belong to, set at every site that populates them (sendPrefill, selectChat, handleSend, handleTaskSubmit, handleResumeNew, refreshFacts, handleApplyFix), cleared in resetSessionDerivedState. The persistence effect writes ownerChatId as the chatId tag. Render is gated on taskLaneOwnerChatId === activeChatId and ANDed into all three render conditions (toolbar Tasks button, narrow-viewport floating drawer, main side panel). The mount-time skipTaskLaneRestore guard stays as belt-and-braces for the prefill/pickup entry-flash window, which the owner-gate alone doesn't cover.

Rejected:

  • More entry-path guards. That's whack-a-mole — the next path nobody anticipated will reproduce the bug. The owner-gate makes the bug structurally impossible regardless of which path triggers it.
  • Combining the four state slices into a single tagged object. Cleaner long-term but a bigger refactor with more touch points. The owner-tracking approach gets the structural guarantee with a minimal diff and keeps the existing setState patterns.
  • Inlining the comparison at every render site. Works but proliferates the comparison; one named derived value (taskLaneIsForActiveChat) reads better and groups the gate with the persistence-effect / state declarations as a named concept.

Consequences:

  • Stale task-lane data is structurally unable to display. The lane is hidden during any window where ownerChatId !== activeChatId, no matter what mutation path got you there.
  • Adding new sites that populate activeQuestions / activeActions requires also setting taskLaneOwnerChatId. The pattern is documented in the commit message and visible in every existing populate site as a paired call.
  • The mount-time skipTaskLaneRestore guard is now redundant in steady-state but kept for the few-hundred-ms flash window between component mount and the first sendPrefill / selectChat effect. Deleting it would re-introduce a (smaller) flash without strong reason.
  • Future task-lane state slices (e.g. facts, activeFix) follow the same pattern: gate their visibility on the owner check via the existing render conditions. Tagging more slices with their own *OwnerChatId is a future refactor if the slices diverge.

2026-04-24 — Adopt dual-agent handoff system (.ai/ + CLAUDE.md + AGENTS.md)

Context: Claude Code hits session and weekly usage limits. Work stalls when the primary agent is locked out. Needed a structured way for OpenAI Codex to resume where Claude left off without losing architectural truth or drifting across sessions.

Decision: Split the old CLAUDE.md into .ai/PROJECT_CONTEXT.md (stable repo truth), agent-specific root files (CLAUDE.md, AGENTS.md) with a shared protocol block, and a small handoff toolkit (CURRENT_TASK.md, HANDOFF.md, TODO.md, DECISIONS.md, SESSION_LOG.md, README.md). Previous CLAUDE.md snapshotted in commit e110fed before the migration.

Rejected:

  • Single symlinked CLAUDE.md/AGENTS.md — diverges silently, hides agent-specific tooling differences.
  • Putting GitNexus/gstack content in AGENTS.md — Codex doesn't have those tools; would mislead the resume agent.
  • Keeping the old CLAUDE.md as-is and adding AGENTS.md alongside it — duplicated truth, drift guaranteed.

Consequences:

  • First read for either agent: .ai/PROJECT_CONTEXT.md + .ai/CURRENT_TASK.md + .ai/HANDOFF.md.
  • Architectural changes in the repo require updating PROJECT_CONTEXT.md, not the root agent files.
  • Git trailers differ per agent (Claude Opus 4.7 vs Codex) — preserved in each root file.
  • Legacy SESSION-HANDOFF.md deleted in the same commit; superseded by .ai/HANDOFF.md.