Files

Michael Chihlas d51e95cdfa docs(plans): add escalation-mode wedge design + test plan

Captures the GTM thesis, premises, reduced-scope engineering plan, locked UI
specs, and embedded review report for the Escalation Mode wedge — output of
/office-hours, /plan-eng-review, /plan-design-review, and /codex review.

Codex review surfaced two corrections we applied:
- two-metric framing (manual baseline vs in-product time-to-first-action)
- claim role gate moved in-scope (was deferred TODO)

TODO updates: peer-tech escalation + claim role gate captured (the latter then
moved in-scope by the codex pass).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-27 15:18:46 -04:00

26 KiB

Raw Blame History

Design: ResolutionFlow GTM — Escalation-Mode-First Wedge

Generated by /office-hours on 2026-04-26 Branch: main Repo: chihlasm/resolutionflow Status: APPROVED Mode: Startup

Problem Statement

ResolutionFlow is a multi-tenant SaaS troubleshooting platform for MSPs, currently in Go-to-Market Validation (pre-PMF). The backend is feature-complete (55+ endpoints, 100+ tests, FlowPilot telemetry baseline accruing). The product has users but no paying customers.

The blocker is not engineering completeness. The blocker is the absence of a sharp GTM story tied to a number a buyer can verify. The session reframed the wedge twice before landing on the real one.

What ResolutionFlow actually is: the structuring layer between conversational AI and the way MSP techs work tickets. AI is great at producing answers; it is bad at producing workflow-shaped output. ResolutionFlow gives the tech the AI they already trust (Claude/GPT) but organizes the output into actionable structured steps, records the session, captures customer-specific context, and turns the result into PSA-formatted ticket notes — and optionally a runbook — without the tech writing anything.

Positioning line: "the senior engineer looking over your shoulder."

Demand Evidence

The founder is the first user. Senior Systems Engineer at an MSP, losing ~20 hours/week to cross-domain interruptions (systems engineer pulled into networking problems and vice versa). At least 4 interruptions per day, with the time cost concentrated in the gap between AI-conversation output and MSP-ticket workflow.

This is solving-your-own-problem demand evidence — strongest possible signal at this stage. The 20 hrs/week figure is the founder's own time, not a hypothetical. Every MSP shop with a senior tech and a junior tech has a version of this problem.

Telemetry signal (Phase 0.5 baseline accruing): captured flows pile up but are not being re-used. This says capture works, retrieval doesn't — which means the "hours-saved-via-re-use" number isn't yet generatable from existing data. The GTM-grade ROI story needs a different metric until re-use lands: minutes recovered per escalation, generated by Approach A below.

Status Quo

MSP techs today resolve tickets via three workarounds:

AI in a tab. Junior tech opens Claude or ChatGPT, pastes the problem, gets a wall of prose, parses it into action items in their head, executes, repeats. AI does the diagnostic work. The tech does all the structure-extraction and ticket-note-writing afterward.
Tribal knowledge. Junior tech pings senior in Slack. Senior tech is interrupted (4+ times/day per the founder's own data). Context handoff is verbal and lossy.
Stale runbooks. Half-maintained Notion / IT Glue / SharePoint pages that nobody trusts because they're 18 months out of date and don't match the current customer environment.

The cost of these workarounds for the founder personally: ~20 hours per week of senior-tech time lost. For a 5-tech MSP, the equivalent is 1 full FTE worth of senior-engineer hours leaking into context-switching and tab-hopping.

Target User & Narrowest Wedge

Target user: Senior Systems Engineer at a small-to-mid MSP (5-20 techs). The founder is exemplar #1. Buying authority is shared between senior tech (champion) and MSP owner (signs the check).

Narrowest paid wedge: Escalation Mode. Single sharp feature. When a junior tech escalates a ticket they were working in FlowPilot, the senior tech opens the ticket and sees the entire structured session state — every step the junior tried, every dead end, every command output — instead of starting with "tell me what you tried" for five minutes.

Why this is the wedge:

Two metrics, not one (revised after /codex review 2026-04-27):
- Manual baseline (the Assignment, weeks 0-2): senior tech stopwatches the next 5 escalations. T1 (first diagnostic action) − T0 (open ticket) under today's verbal-handoff workflow. This is the "what you currently lose" number.
- In-product metric (telemetry, week 3+): time-to-first-action after claim, derived from ai_session_step rows where created_at > SessionHandoff.claimed_at AND user_id = SessionHandoff.claimed_by. This is the "what it is now with structured handoff" number.
- The savings claim = manual baseline − in-product metric. Quote both explicitly in pilot conversations. Do NOT roll the in-product number alone into "minutes recovered" — that's an apples-to-oranges miscount Codex caught in the cross-model review.
Single-feature demo: a 2-minute Loom shows the magic moment — junior hits escalate, senior window opens with full structured context. No theory required.
Cross-buyer story: sells to senior tech (less interruption) AND owner (junior techs resolve faster, take more accounts).
Hours-saved math is simple: 4-5 minutes per escalation × 15-30 escalations per week per senior tech = 1-2 hours/week recovered per senior. At $80-150/hr fully-loaded senior tech cost, the tool pays for itself with one customer.

Constraints

One-founder shop. Cannot run three concurrent product narratives. Sequence matters more than scope.
Pre-PMF runway implied. 4-8 week build cycles before talking to a buyer are expensive. Approach A's 1-2 week timeline is the binding constraint.
Existing architecture is mostly aligned. FlowPilot, unified_chat_service, FlowProposal, ConnectWise PSA integration — most of the pieces exist. Risk is positioning and UX, not capability.
PSA copilot competition is real. ConnectWise / Autotask / Halo are racing to ship AI features. The wedge has to be sharp because we lose on distribution.

Premises

The five load-bearing claims this design rests on, all confirmed in session:

Diagnostic AI is commoditized. ResolutionFlow does not compete on "AI solves the ticket faster." That race is over. ChatGPT/Claude already won.
The structuring layer is the wedge. AI conversational output is too dense and unstructured for active troubleshooting. ResolutionFlow's value is organizing that output into actionable, separable, recorded steps.
Escalation context is the killer feature. "Junior hits escalate, senior gets full structured context in 30 seconds instead of 5 minutes" is the sharpest demoable moment in the entire product surface.
First paying customer is bottom-up, prosumer-flavored. Senior tech at a small MSP, $20-50/seat/month, monthly billing. Owner-targeted enterprise pricing waits until 5+ paying shops establish baseline ROI numbers.
Distribution is MSP communities, not paid SaaS ads. r/msp, MSPGeek, RocketMSP, PSA marketplace listings. The channel matches the buyer.

Approaches Considered

Approach A: Escalation Mode first (REDUCED SCOPE per /plan-eng-review)

Lead the GTM with the killer feature. Polish the escalate-with-context handoff: junior tech mid-session hits escalate, senior tech window opens with full structured session state. 2-min demo Loom. Pilot with 3 MSPs in the founder's network (capped at 3 to preserve build capacity for B). Metric: minutes recovered per escalation.

SCOPE REDUCTION (2026-04-27 eng review): ~80% of Approach A is already built. The original 2-3 week estimate assumed greenfield. Codebase audit confirms:

What the doc said "build"	What actually exists
Session-state serialization	`ai_session.escalation_package` (JSONB), `SessionHandoff.snapshot`
Senior-tech inbox	EscalationQueuePage.tsx + EscalationQueue.tsx
Claim workflow	handoff_manager.py:123 claim_session()
API surface	session_handoffs.py — POST /handoff, /claim, GET queue
AI assessment for senior	`_generate_ai_assessment()` in handoff_manager
PSA round-trip	`escalation_package_markdown`, `escalation_package_external_id`

Real engineering scope (~6-9 days):

Notification dual-path (4-5 days). notification_sent flag is a dead column — never written. Wire two channels in handoff_manager.create_handoff:
- Email (existing EmailService.send_notification_email) — handles offline seniors.
- WebSocket / SSE push to the EscalationQueue for live demo magic moment.
- Set notification_sent=true after dispatch confirmation.
- Graceful degradation: handoff still created if notification raises (regression test required).
Hero metric endpoint (~2 hours). New GET /api/v1/analytics/escalation-metrics, account-scoped, role-gated to require_engineer_or_admin. Computes minutes recovered per escalation by querying:
```
ai_session_step.created_at (first row by senior_tech_user_id where created_at > SessionHandoff.claimed_at)
minus
SessionHandoff.claimed_at
```
Returns a rolling-30-day average per account. No schema change.
UX polish on EscalationQueue + receiving-engineer view (2-3 days). Confirm the magic-moment screen lands when senior clicks claim. Add an unread indicator on the queue. Wire optimistic insert when SSE event arrives.
Loom + landing page copy (1-2 days). Non-engineering. Outside this plan's scope but required for the GTM in week 3.

Test plan: 100% coverage of new paths — 13 tests including 4 e2e and 1 regression (graceful-degradation when notification dispatch raises). Test plan artifact at ~/.gstack/projects/chihlasm-resolutionflow/abc-main-eng-review-test-plan-20260427-000000.md.

Risk: Low. Single feature, single metric, architecture-aligned. The dual-path notification is the only mildly novel surface; both halves use existing infra.

Reuses: services/handoff_manager.py, services/escalation_package_generator.py, models/session_handoff.py, models/ai_session.py, services/notification_service.py, models/notification_log.py, EmailService, EscalationQueuePage + EscalationQueue.

UI Specifications (locked by /plan-design-review 2026-04-27)

Magic-moment screen (new, after Pick Up click): dedicated handoff-context view that loads BEFORE the regular FlowPilot session view, then dissolves on first senior action. Four sections, single frame:

Problem summary (top, 2-3 lines): junior's framing. Bricolage Grotesque h2.
What's been tried (left or middle column): structured list of dead_ends_flagged[] and steps_attempted[] from escalation_package JSONB. Card-flat surface, IBM Plex.
AI assessment (right column): ai_assessment_data rendered as 3 fields — likely_cause, suggested_steps[], confidence. accent-dim badge for confidence.
Start here (primary CTA, electric-blue, ≥44px touch target): opens FlowPilot session at the most-likely-next-step. Senior typing or clicking anywhere triggers 200ms fade-out and FlowPilot view fades in. Re-openable via "Show handoff context" ghost button in FlowPilot toolbar.

Hero metric ("minutes recovered per escalation"): lives in TWO places:

Queue stat-card (above EscalationQueue list on /escalations): compact, "X.X hrs saved this month" + "click for details" affordance. Refreshes on queue load.
Dedicated /analytics/escalations page (owner-facing): trend chart (4-week rolling), per-tech breakdown, per-problem-domain segmentation. Engineer-or-admin role-gated.

Real-time arrival visual (when WebSocket pushes a new escalation):

New card slides in from above the list, 200ms ease-out CSS transition.
Browser tab title prefixes with " (1) " / " (N) " when tab is backgrounded; clears on focus.
No sound. MUST respect prefers-reduced-motion: reduce (slide-in collapses to instant fade-in).

Unread state: subtle 6px dot in top-right corner of card for escalations the current senior has never opened. Dot fades on first hover or click.

Race-condition (two seniors click Pick Up simultaneously): loser sees a toast "Already claimed by [name] 2s ago" via existing @/lib/toast; the card flashes the winner's name in the meta row for 1s, then dissolves from the loser's view via optimistic update + WebSocket reconciliation.

Unread state (Codex correction 2026-04-27): dot indicator clears on open, claim, or explicit dismiss — NOT on hover. Hover-to-clear is a bad proxy for acknowledgment because incidental mouse movement creates false clears.

Notification routing (Codex finding 2026-04-27): v1 fans out the email + push to all engineer-or-admin role users in the same account_id as the SessionHandoff. No on-call/round-robin logic in v1. If pilots ask for routing, capture as v2 TODO. The first senior to claim wins; everyone else's notification self-resolves on WebSocket reconciliation.

Notification delivery model (Codex correction 2026-04-27): drop the notification_sent: bool flag from v1. Replace with per-channel delivery rows in a new notification_log table (already exists — reuse, don't add a new model) keyed by (handoff_id, channel, recipient_user_id, status) where status ∈ {queued, sent, failed, suppressed}. This makes partial-success and per-channel retry visible. If the existing notification_log schema doesn't match, defer the per-channel persistence to a v2 TODO and v1 logs delivery attempts to the existing telemetry stream instead. Do NOT keep the dead boolean.

"Start here" CTA (Codex correction 2026-04-27): opens the FlowPilot session at the latest known state (the AI's most recent agent_message + the current pending_task_lane). Surface ai_assessment_data.suggested_steps[] as a list of chips below the chat input — clicking a chip prefills the input. Do NOT invent a "jump to most-likely-next-step" capability that doesn't exist in the session model.

/claim role gate (Codex correction 2026-04-27, IN-SCOPE for v1): add require_engineer_or_admin dep on POST /handoffs/{id}/claim. Originally deferred to TODO during eng review; Codex correctly flagged it as wedge-relevant because the race-condition story depends on auth gating. ~30 min change. Removed from TODO.md.

A11y requirements (mandatory before pilot ship):

Keyboard: Tab order through queue cards; Enter on focused card opens it; Pick Up button is a reachable target; Esc closes the handoff-context overlay.
ARIA: role="region" + aria-live="polite" on the queue list (announces arrivals); aria-label="N escalations awaiting pickup" on the heading; the slide-in animation must not announce twice (debounce live-region updates).
Pick Up button: bump from py-2 to py-2.5 to clear the 44px touch-target floor.
Color contrast: confidence-badge text on accent-dim background must be ≥4.5:1 (verify against DESIGN-SYSTEM.md tokens).

DS token discipline: every new piece must use card-flat, accent-dim/accent-text, text-muted-foreground, bg-card/bg-elevated, IBM Plex / Bricolage / JetBrains, explicit transition property lists (never transition: all). No glass, no blur, no gradient surfaces. Electric-blue accent reserved for interactive elements only.

Mobile responsive: deferred to post-pilot TODO. Pre-PMF wedge target is desktop; MSP techs work on laptops/desktops in shop environments.

Deferred to TODO.md (out of scope for v1 wedge):

Peer-tech escalates colleague's session (currently session-owner-only)
Role gate on POST /claim (currently any authenticated user in account)

Approach B: Full Structured Resolution loop (split B1 + B2)

End-to-end demo: tech opens FlowPilot, structure appears in side panel as AI responds, ticket notes auto-populate at end, optional runbook capture for reusable patterns. Tells the full "senior engineer over your shoulder" story.

B1 — Side panel + PSA-formatted ticket notes (ships first):

Structured side panel that surfaces parsed AI markers as live actionable steps while the conversation runs.
PSA-formatted ticket-notes exporter (ConnectWise first; Autotask/Halo later).
Effort: M (~3 weeks).

B2 — Runbook offer-and-save (gated on pilot demand):

"Save this resolution as a flow?" prompt at session end, with auto-drafted runbook from the structured session state.
Effort: S (~1 week). Don't build until at least 2 pilot customers explicitly ask for it.
Risk: Medium. The structured-output panel quality is the whole demo. If it looks dumb, the demo dies.
Reuses: FlowPilot, unified_chat_service, FlowProposal, ConnectWise PSA integration.

Approach C: Senior-Tech Time-Saved Counter

Continuous measurement layer underneath A and B. Every session contributes an estimated minutes-saved number. Owner-facing dashboard quotes "this month your shop saved N hours of senior-tech time." Sells to MSP owner with verifiable ROI.

Effort: S (~1 week + ongoing measurement methodology refinement).
Risk: Medium-low. Methodology has to be defensible. If numbers look made-up, trust dies fast.
Reuses: FlowPilot telemetry, session metadata, account-scoped analytics.

Recommended Approach

A first (1-2 weeks), then B (3-4 weeks after A ships), with C running underneath both as a continuous backdrop.

Sequence rationale:

A is the sharpest possible 2-minute demo. Single feature, single metric, buyer-verifiable in their own data. Get it in front of 5 MSPs in week 3.
B is the depth play. Once Approach A has produced first-pilot signal, Approach B's full structured-resolution loop becomes the "what we ship next" that retains pilots and converts them to paid.
C compounds across both. Every session under A or B contributes to the time-saved counter. By week 6 there are real numbers to put in front of an MSP owner — turning a senior-tech-led pilot into an owner-signed contract.

This sequence is non-negotiable. Building B before A is the classic pre-PMF trap of perfecting product before validating GTM. Building C alone is measurement without a demo to anchor it.

Pricing

Pilot pricing (first 3-5 customers): $39/seat/month, monthly billing, month-to-month. Anchored against IT Glue (~~$29/tech), Hudu (~~$25/tech), Liongard (~$3/endpoint). The premium over IT Glue/Hudu reflects the active-session value (vs. their static-runbook value) — 30% above the runbook-only category.

Customer #6+ pricing is an Open Question (revisit after 3 pilots produce real hours-saved data; price up if the per-seat ROI is over $200/seat/mo).

Open Questions

Free-tier shape. Should the time-saved counter be free forever as a distribution lever, with paid for the structuring + escalation? Land-and-expand pattern. Decide after 3 pilot conversions.
PSA-marketplace timing. ConnectWise Marketplace listing requires partnership onboarding (~6-week cycle). Submit application week 5; expect listing live by week 11. Don't gate launch on it.
Customer #6+ pricing. Revisit after 3 pilot customers produce verifiable hours-saved numbers.

Deferred (YAGNI until 10 paying customers)

HIPAA / SOC2 audit positioning. Pre-PMF is too early; revisit when a regulated- vertical MSP asks for it explicitly.
Multi-PSA depth (Autotask, Halo). ConnectWise alone covers ~40% of the SMB MSP market and is sufficient for first 5-10 customers.
Cross-tenant pattern detection. The data-flywheel-across-shops play is at least 6 months out; building it before single-shop ROI is proven is premature.

Success Criteria (revised for realism)

Week 3: Approach A shipped. 3 MSPs in active free pilot (cap at 3 to preserve B1 build capacity).
Weeks 3-6: Pilot management dominates. B1 build is paused; founder runs pilot calls, captures bug reports, iterates UX. Stripe seat-based billing is set up in week 5.
Week 6: First verbal commit from a pilot customer. Verified minutes-recovered-per-escalation number from at least 2 pilots.
Week 8: First paid customer (procurement cycles run 4-6 weeks even at small MSPs; 2 weeks from verbal commit to signed contract is realistic). Time-saved counter (Approach C) producing dashboard-quality data.
Week 11: B1 (side panel + PSA notes) shipped. 3-5 paying customers. First MSP-owner-led conversation. ConnectWise Marketplace listing live.
Quarter end: $5K MRR or 10 paying customers, whichever comes first. Loom demos posted publicly to r/msp and MSPGeek.

Distribution Plan (week-by-week cadence)

Week 3: Escalation Mode demo Loom posted. r/msp launch post.
Week 4: MSPGeek Discord AMA scheduled. RocketMSP newsletter pitch sent.
Week 5: ConnectWise Marketplace listing application submitted. Stripe billing live for paid conversion.
Week 6: First "guest on Inside MSP podcast" outreach. Second r/msp post (case study from a pilot, anonymized).
Week 7-8: Pilot conversion calls. First paying customer.
Week 9-11: B1 ships. Owner-targeted demo Loom. Second podcast outreach.

Founder-led pilot: The first 3-5 customers come from the founder's existing MSP network. Treat them as design partners; expect to ship feature requests weekly during pilot. Cap at 3 active pilots until B1 ships.

Tech audience channels: r/msp, r/sysadmin, MSPGeek Discord, RocketMSP newsletter, Inside MSP podcast. Owner audience channels: ConnectWise Marketplace, MSP-focused Substacks, RIA Vendor Roundup.

CI/CD: existing Railway auto-deploy via GitHub mirror. No new pipeline needed.

Dependencies

Session-state serialization (Approach A blocker). Schema design + migration is the longest-lead engineering task. 3-5 days budget. Do this first.
Stripe seat-based billing (week 5 task). No billing infrastructure exists today. ~3-5 days of work for monthly subscriptions + invoice flow. Block on this before week-8 first-paid milestone.
ConnectWise PSA integration depth. Sufficient for ticket-notes auto-export (Approach B1). Autotask and Halo wait until first 5 paying ConnectWise customers.
Authentication. Existing JWT + role hierarchy is sufficient for senior-tech inbox view; no new auth work needed.

Risks and Kill-Switch

Risk: Session-state serialization design churn. If the schema needs to change after pilot feedback, every saved session has to migrate. Mitigation: keep schema versioned and forward-compatible from day 1.
Risk: Pilot-to-paid conversion slower than 4-6 weeks. MSP procurement is notoriously slow. Mitigation: get verbal commits in writing; price as month-to-month with no annual contract to lower the buying friction.
Risk: ConnectWise ships an equivalent feature in their 2026.x release. Mitigation: lead the marketing on "we're independent of your PSA" — works with any PSA, not just ConnectWise. The founder's PSA-agnostic FlowPilot is an asset here.
Kill-switch criterion: if 0 of 3 pilots produce a verifiable hours-saved-per-week number above 1.0 by week 8, revisit the wedge. The product may need to pivot to deterministic-ops territory (Read 1 from the session) or be repositioned. Don't sink another quarter into the current GTM story without this number.

The Assignment

This week, before any code:

Time-track the next 5 escalations in your shop manually. For each, capture:

Time the senior tech opens the ticket
Time the senior tech takes their first diagnostic action (not counting the verbal "tell me what you tried" warm-up)
The delta — that's the wasted time per escalation today

Average those 5 numbers. That's the hero stat in your first sales conversation: "Senior techs at our shop wasted N minutes per escalation just getting up to speed. We built the thing that takes that to zero."

Don't try to pull this from telemetry — the doc itself notes that retrieval/re-use data isn't queryable yet. Manual stopwatch on the next 5 escalations is the fastest path to a defensible number.

This is the assignment because it forces the GTM story into the same time-zone as the build, and it's a one-day effort that compounds for every conversation afterward.

What I noticed about how you think

You contradicted my framing twice in the same session and the second contradiction was sharper than the first. Most founders agree with the diagnostic and walk out with a polished version of what they came in with. You said "I'm just questioning if flows are even the way to go" — and that sentence reset the entire wedge. That's craft.
"The senior engineer looking over your shoulder" came out of you spontaneously, not as a prepared pitch. That's the line. Use it. It survives because it's emotional truth (every junior tech has had this, every senior tech has been this), not constructed marketing copy.
You're solving your own problem with your own time. 20 hrs/week isn't a hypothetical user pain — it's your Tuesday. Founders who solve their own pain ship sharper products because the feedback loop is instant.
The escalation feature emerged from your description, not mine. I was busy cataloging documentation pains. You said "junior to senior escalation? no worries there either" almost as an afterthought. That afterthought is the wedge. Pay attention to which features you describe casually versus which you push hard on — the casual ones are sometimes where the truth lives.

GSTACK REVIEW REPORT

Review	Trigger	Why	Runs	Status	Findings
CEO Review	`/plan-ceo-review`	Scope & strategy	0	—	not run
Codex Review	`/codex review`	Independent 2nd opinion	1	INFO	12 findings, 6 applied, 1 partial, 5 rejected
Eng Review	`/plan-eng-review`	Architecture & tests (required)	1	CLEAR (PLAN)	2 issues, 0 critical gaps, scope reduced
Design Review	`/plan-design-review`	UI/UX gaps	1	CLEAR (FULL)	score 6/10 → 9/10, 8 decisions
DX Review	`/plan-devex-review`	Developer experience gaps	0	—	not run

CODEX: 12 findings reviewed. Applied: 2-metric framing (#2), notification routing spec (#3), per-channel delivery model (#4), unread-state fix (#11), Start-here CTA reframe (#9), claim role gate moved in-scope (#8). Rejected: full scope reduction to PSA-brief-only (#6/7/12 — user kept queue UI as demo hero). Partial: scope concern (#5) acknowledged in eng review's email-first/polling-fallback. Misread: #1, #10.
CROSS-MODEL: Claude (eng + design reviews) and Codex agree on 6/12 findings. The major disagreement was scope — Codex argued for cutting the queue UI, user rejected. Both agree on metric definition, notification routing, claim auth gating.
UNRESOLVED: 0
VERDICT: ENG + DESIGN CLEARED, CODEX REVIEWED — ready to implement.

26 KiB Raw Blame History Unescape Escape