Files
resolutionflow/docs/cockpit/2026-04-01-msp-assistant-harness-plan-claude.md
chihlasm 81f8aa0074 docs: add MSP assistant harness super plan (claude synthesis)
Merges MSP_Assistant_Harness_Implementation_Plan.docx with the
brainstorming design spec into a single executable plan. Resolves
all open questions from the original docx, expands scope to include
backend changes, and adds a 35-step phased execution order.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 21:11:26 +00:00

20 KiB

MSP Assistant Harness — Super Plan

Date: 2026-04-01 Status: Approved — ready to execute Sources: MSP_Assistant_Harness_Implementation_Plan.docx (v2.0) + 2026-04-01-msp-assistant-harness-design.md (brainstorming session)


Goal

Reframe /assistant from a generic AI chat surface into a live MSP triage cockpit. An engineer arrives with an open ticket; the page immediately reads as their operational tool — not an AI chatbot that's been adapted for IT work.

The change is a UI and data layer reframe. The existing session, branching, PSA, and conclude architecture is preserved and extended, not rebuilt.


What Phase 0 Resolved

The brainstorming session (2026-04-01) locked these decisions. They are not open questions.

Question Decision
Layout structure Stacked zones: incident header → work zone → (drag handle) → conversation log → compose
Incident header style Single row, explicit micro-labels above each field, per-field edit
Work zone left panel Ordered step checklist (✓ / → / ○)
Work zone right panel Two stacked mini-panels: FlowPilot Asks (top) + What We Know (bottom)
Chat zone treatment Drag-resizable split, compact you: / fp: prefix style, darker background
Chat collapsibility Not collapsible — drag handle gives control
Scope Includes all required backend changes, not UI-layer only
Conclude modal Fully redesigned as structured handoff artifact
Page label "FlowPilot" (not "AI Assistant")
"New Chat" label "New Case"
"Conclude" label "Close Case"
Hypothesis language "Hypothesis" (direct, not softened to "working theory")
What We Know editability Engineer-editable + AI-appended
Header field population Intake form + AI-inferred mid-session + manual engineer override

Cockpit Layout

┌─────────────────────────────────────────────────────────────┐
│  [Left sidebar — Case History, unchanged]                   │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  INCIDENT HEADER (single row, labelled fields)        │  │
│  │  CLIENT      DEVICE         CATEGORY    HYPOTHESIS    │  │
│  │  Contoso ✏   jsmith-04 ✏   DNS/Net ✏   Cache fail ✏ │  │
│  │                                   [CW #48291][Resolve⋯]│  │
│  ├───────────────────────┬───────────────────────────────┤  │
│  │                       │  ▸ FLOWPILOT ASKS (amber)     │  │
│  │  STEPS (~55%)         │  Did nslookup time out?       │  │
│  │  ✓ Ping 8.8.8.8       │  [Time out] [Wrong IP] [Both] │  │
│  │  → nslookup ←active   ├───────────────────────────────┤  │
│  │  ○ Flush DNS          │  WHAT WE KNOW                 │  │
│  │  ○ Check NIC          │  ✓ Gateway reachable          │  │
│  │                       │  ✗ DNS 1.1.1.1 — timeout      │  │
│  │  [⚡ Generate Script]  │  ? DNS 8.8.8.8 — pending      │  │
│  ├───────────────────────┴───── ≡ drag handle ───────────┤  │
│  │  CONVERSATION LOG (compact, darker bg)                │  │
│  │  you:  Can't resolve external DNS, internal fine       │  │
│  │  fp:   Ping test passed. Run nslookup google.com.      │  │
│  │  you:  Timed out on 1.1.1.1 too.                       │  │
│  ├───────────────────────────────────────────────────────┤  │
│  │  Describe next finding or ask FlowPilot...    [Send]  │  │
│  └───────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Non-Goals

  • No redesign of /pilot (FlowPilot session page) — separate page, untouched
  • No rebuild of session, branching, or PSA architecture
  • No new data model for conversations — conversation_messages JSONB unchanged
  • No mobile-first redesign — mobile degrades cleanly, desktop is primary
  • No generic "assistant polish" that does not tighten the harness

Backend Changes

B1 — Alembic migration 071

File: backend/alembic/versions/071_add_triage_fields_to_ai_sessions.py

Add to ai_sessions:

Column Type Notes
client_name VARCHAR(255) MSP client for incident header
asset_name VARCHAR(255) Device / user being worked on
issue_category VARCHAR(100) Human-readable category ("DNS / Networking")
triage_hypothesis TEXT Working hypothesis — AI-updated + editable
evidence_items JSONB What We Know list — persisted for resume

evidence_items schema: [{ "text": str, "status": "confirmed" | "ruled_out" | "pending" }]

Note: existing problem_domain is an internal classifier slug and is unchanged. issue_category is the human-readable display label. Both coexist.

B2 — Updated schemas (backend/app/schemas/ai_session.py)

New TriageUpdate:

class TriageUpdate(BaseModel):
    client_name: str | None = None
    asset_name: str | None = None
    issue_category: str | None = None
    triage_hypothesis: str | None = None
    evidence_items: list[dict] | None = None  # appends to existing list

Updated ChatMessageResponse:

class ChatMessageResponse(BaseModel):
    # ... existing fields unchanged ...
    triage_update: TriageUpdate | None = None

Updated QuestionItem — add quick-reply options:

class QuestionItem(BaseModel):
    text: str
    context: str = ""
    options: list[str] | None = None  # quick-reply labels; null → free-text input

Updated ResolveSessionRequest / EscalateSessionRequest:

root_cause: str | None = None
steps_taken: list[str] | None = None
recommendations: str | None = None

B3 — New PATCH /ai-sessions/{id}/triage endpoint

PATCH /ai-sessions/{session_id}/triage
Auth: require_engineer_or_admin
Body: { client_name?, asset_name?, issue_category?, triage_hypothesis?, evidence_items? }
Response: { id, client_name, asset_name, issue_category, triage_hypothesis, evidence_items }

Called on every manual header field edit. Partial update — only supplied fields are written.

B4 — New POST /ai-sessions/{id}/handoff-draft endpoint

POST /ai-sessions/{session_id}/handoff-draft
Auth: require_engineer_or_admin
Response: StreamingResponse (text/event-stream)

Streams structured handoff JSON built from session context:

{ "root_cause": "...", "resolution": "...", "steps_taken": ["..."], "recommendations": "..." }

Uses: problem_summary, triage_hypothesis, evidence_items, last 20 conversation_messages, saved task lane state.

Called immediately on conclude modal open — engineer can edit while stream fills in.

B5 — unified_chat_service.py — triage extraction

After each AI response, extract triage signals and return as triage_update.

Recommended approach: Add a [TRIAGE_UPDATE] structured marker to the system prompt, following the existing [QUESTIONS] / [ACTIONS] / [FORK] marker pattern. The AI emits the block only when it has new signal:

[TRIAGE_UPDATE]
client_name: Contoso Ltd
issue_category: DNS / Networking
triage_hypothesis: Corrupted DNS cache on NIC
evidence_items:
  - confirmed: Gateway 192.168.1.1 reachable
  - ruled_out: DNS 1.1.1.1 — timeout
[/TRIAGE_UPDATE]

Service parses this, strips it from display_content, auto-PATCHes the session record, and returns triage_update in the response.

B6 — resolution_output_generator.py — accept structured fields

Update _build_session_context() to incorporate root_cause, steps_taken, and recommendations when supplied, producing richer psa_ticket_notes and client_summary outputs.

B7 — Session detail response — expose new triage fields

GET /ai-sessions/{id} (and the session list item) must return the 5 new fields so the frontend can restore header state on session load and resume.


Frontend Changes

F1 — AssistantChatPage.tsx — cockpit layout refactor

Replace current layout (sidebar + chat column + TaskLane right rail) with the stacked cockpit structure.

New state:

  • triageMeta: TriageMeta{ client_name, asset_name, issue_category, triage_hypothesis, evidence_items }
  • workZoneHeight: number — persisted to localStorage('rf-assistant-work-zone-height')

On session load / resume: populate triageMeta from session response new fields.

On AI response: if response.triage_update is non-null, merge into triageMeta (partial — preserve existing non-null values unless AI explicitly overwrites).

Work zone layout: left StepsPanel + right column with FlowPilotAsks stacked above WhatWeKnow.

Chat zone layout: compact ConversationLog below drag handle, independent scroll.

F2 — New IncidentHeader.tsx

frontend/src/components/assistant/IncidentHeader.tsx

Props: triageMeta: TriageMeta, psaTicketId: string | null, sessionId: string, onFieldSave(field, value), onResolve(), onOverflow()

  • Single-row bar with micro-labels (CLIENT / DEVICE / CATEGORY / HYPOTHESIS)
  • Each field: icon visible on hover → opens inline EditPopover (text input + Save/Cancel)
  • On Save: calls aiSessionsApi.updateTriage(sessionId, { [field]: value })
  • Empty fields: muted placeholder ("Unknown client", "No device specified", etc.)
  • Right side: PSA ticket badge (if linked) + Resolve button + overflow menu

F3 — Refactored StepsPanel.tsx (from TaskLane)

frontend/src/components/assistant/StepsPanel.tsx

Preserves all TaskLane data logic and persistence. Changes rendering only:

State Icon Style
Completed Strikethrough, muted, green icon
Active Blue left border, white text, full opacity
Pending Muted text

Script generation CTA: shown at bottom when active step command references "script" or AI has flagged it.

TaskLane.tsx can remain for now (no renames required in this phase) — StepsPanel is a new component that consumes the same activeActions prop.

F4 — New FlowPilotAsks.tsx

frontend/src/components/assistant/FlowPilotAsks.tsx

Props: questions: QuestionItem[], onAnswer(answer: string)

  • Renders first unanswered question
  • question.options non-null → button row; clicking calls onAnswer(option)
  • question.options null → compact text input + Send
  • onAnswer calls parent's handleSend with the answer string
  • Hidden entirely when questions is empty

F5 — New WhatWeKnow.tsx

frontend/src/components/assistant/WhatWeKnow.tsx

Props: items: EvidenceItem[], onAdd(text, status), onEdit(index, text, status)

  • Evidence list: confirmed (green) / ruled out (red) / ? pending (muted)
  • "+ Add finding" inline entry at bottom
  • Click any item to edit inline
  • State lives in AssistantChatPage (triageMeta.evidence_items), synced to backend via PATCH /triage

F6 — Drag-resizable split

Thin handle bar between work zone and conversation log. On drag: update workZoneHeight in state, persist to localStorage. On mount: restore, default 55%.

F7 — Compact ConversationLog rendering

Replace current full <ChatMessage> bubbles in the log zone with a compact list: you: ... / fp: ... prefix style, tighter line height, no avatars. ChatMessage can still be used for rich content (forks, suggested flows) in a compact variant.

F8 — Redesigned ConcludeSessionModal.tsx

On open:

  1. Call aiSessionsApi.getHandoffDraft(sessionId) (streaming) — fields fill in as stream arrives
  2. Render: outcome selector (Resolved / Escalated / Parked)
  3. Render 4 structured editable fields: Root Cause, Resolution, Steps Taken, Recommendations
  4. Render output destination checkboxes: Post to CW note / Save to KB / Send client summary
  5. Confirm → call resolve/escalate/pause with enriched request body including structured fields

F9 — MSP-native language pass

Old New
"AI Assistant" (page title, meta) "FlowPilot"
"New Chat" "New Case"
"Messages" "Conversation Log"
"Task Lane" (panel label) "Steps"
"Conclude" "Close Case"
"Chat history" (sidebar label) "Case History"
Compose placeholder "Describe finding, paste log output, or ask FlowPilot..."

F10 — New API methods (aiSessions.ts)

updateTriage(sessionId: string, fields: Partial<TriageMeta>): Promise<TriageMeta>
getHandoffDraft(sessionId: string): AsyncGenerator<HandoffDraftChunk>

F11 — New types (types/ai-session.ts)

interface TriageMeta {
  client_name: string | null
  asset_name: string | null
  issue_category: string | null
  triage_hypothesis: string | null
  evidence_items: EvidenceItem[]
}

interface EvidenceItem {
  text: string
  status: 'confirmed' | 'ruled_out' | 'pending'
}

interface TriageUpdate extends Partial<TriageMeta> {}

// Extend existing:
interface QuestionItem {
  text: string
  context: string
  options?: string[]  // new
}

Phased Execution Order

Phase 1 — Backend Foundation

  1. Write migration 071 — add 5 columns to ai_sessions
  2. Run alembic upgrade head, verify columns
  3. Update AISession model with new mapped columns
  4. Add TriageUpdate schema, extend QuestionItem, extend ChatMessageResponse
  5. Extend ResolveSessionRequest / EscalateSessionRequest with structured fields
  6. Add PATCH /{id}/triage endpoint
  7. Add POST /{id}/handoff-draft streaming endpoint
  8. Update GET /ai-sessions/{id} response to include new triage fields
  9. Update resolution_output_generator._build_session_context() to use structured fields
  10. Run backend tests — pytest --override-ini="addopts="

Phase 2 — Triage Extraction (AI layer)

  1. Add [TRIAGE_UPDATE] marker to unified_chat_service.py system prompt
  2. Implement _parse_triage_update_marker() in the service
  3. Auto-PATCH session on non-null triage_update
  4. Add options generation instructions to [QUESTIONS] system prompt section
  5. Verify extraction in a live session

Phase 3 — New Frontend Types + API

  1. Add TriageMeta, EvidenceItem, TriageUpdate to types/ai-session.ts
  2. Extend QuestionItem type
  3. Add updateTriage() and getHandoffDraft() to aiSessions.ts

Phase 4 — New Work Zone Components

  1. Build IncidentHeader.tsx with EditPopover
  2. Build StepsPanel.tsx
  3. Build FlowPilotAsks.tsx
  4. Build WhatWeKnow.tsx

Phase 5 — Page Layout Refactor

  1. Refactor AssistantChatPage.tsx — implement stacked cockpit layout
  2. Wire triageMeta state, session load population, triage_update merge
  3. Implement drag-resizable split with localStorage persistence
  4. Compact ConversationLog rendering

Phase 6 — Handoff Modal + Language Pass

  1. Redesign ConcludeSessionModal.tsx — structured handoff form
  2. MSP-native language pass across all assistant components
  3. Update <PageMeta> title

Phase 7 — QA + Hardening

  1. npx tsc -b — fix any TypeScript errors
  2. npm run build — production build clean
  3. Functional regression: all chat flows, session switching, conclude/resume
  4. Harness feel test: cockpit within 3 seconds?
  5. Mobile viewport check
  6. Stress test: 50+ messages, 10+ steps, long outputs

Risks and Mitigations

Risk Mitigation
[TRIAGE_UPDATE] marker extraction is unreliable — AI doesn't emit it consistently Gate Phase 2 on a pass/fail test with 5 real sessions before wiring it to the header. Fall back to Option B (post-response Haiku pass) if needed.
Header fields feel fabricated — AI guesses wrong client or hypothesis Show confidence-aware placeholder copy ("FlowPilot is building context…") until a field has real data. Never invent.
Task lane visual promotion breaks established chat patterns Keep all send/respond behavior intact. Change hierarchy only. Verify every task-lane state transition manually.
Handoff modal exposes weak underlying summaries Reuse existing ResolutionOutputGenerator output where possible. Add guardrail copy for empty fields.
Mobile loses compose or step access Test responsive layout as a first-class deliverable in Phase 7, not a final sweep. Enforce scroll isolation between all zones.
tsc -b errors after component refactor Run npx tsc -b after every phase. Trace unused imports/props immediately — don't batch (lesson #92).

Test Plan

Harness Feel (primary, subjective)

  • Does the page read as an MSP triage cockpit within 3 seconds on first load?
  • Is the active step obvious without reading chat?
  • Do FlowPilot Asks quick-reply buttons work and update the step list?
  • Does the incident header update mid-session as AI learns context?
  • Drag handle, refresh — does split restore?
  • Does the conclude modal look like a case handoff or a chat closure?

Functional Regression

  • New session (no PSA) — header degrades gracefully
  • New session (with CW ticket) — header populates from ticket data
  • Send message → triage_update updates header
  • Click quick-reply button → answer submitted, step advances
  • Add finding to What We Know → persisted via PATCH
  • Edit header field via → saved and survives refresh
  • Conclude as Resolved → handoff draft fills modal → post to CW note
  • Conclude as Escalated → same
  • Pause and resume → triage header restores from saved session fields
  • Session switching (currentChatRef guard) — no stale state
  • Image paste, forks, suggested flows — all still work

MSP Scenarios (from docx)

  1. Single-user endpoint issue (basic triage flow, script generation)
  2. M365 / tenant-wide issue (multi-user context, issue category)
  3. Network / VPN outage (asset targeting, hypothesis tracking)
  4. Escalation and resume (session persistence, structured handoff)

Edge Cases

  • 50+ messages — layout hierarchy stays intact
  • 10+ steps — step panel scrolls, compose remains accessible
  • Long issue titles / hypothesis text — header truncates gracefully
  • Missing PSA context — placeholder copy, not blank fields
  • Narrow mobile viewport — all zones reachable

Backend Checks

# Migration
alembic upgrade head
psql -U postgres -d resolutionflow -c "\d ai_sessions" | grep -E "client_name|asset_name|issue_category|triage_hypothesis|evidence_items"

# Triage PATCH
curl -X PATCH http://localhost:8000/ai-sessions/{id}/triage \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"client_name":"Test Client","triage_hypothesis":"Cache corruption"}'

# Handoff draft stream
curl -X POST http://localhost:8000/ai-sessions/{id}/handoff-draft \
  -H "Authorization: Bearer $TOKEN"

Critical Files

File Change
backend/alembic/versions/071_add_triage_fields_to_ai_sessions.py New migration
backend/app/models/ai_session.py Add 5 new mapped columns
backend/app/schemas/ai_session.py TriageUpdate, QuestionItem.options, extended request/response schemas
backend/app/api/endpoints/ai_sessions.py PATCH /triage, POST /handoff-draft
backend/app/services/unified_chat_service.py [TRIAGE_UPDATE] marker extraction, auto-PATCH
backend/app/services/resolution_output_generator.py Structured fields in context builder
frontend/src/types/ai-session.ts TriageMeta, EvidenceItem, TriageUpdate; extend QuestionItem
frontend/src/api/aiSessions.ts updateTriage(), getHandoffDraft()
frontend/src/pages/AssistantChatPage.tsx Full cockpit layout refactor
frontend/src/components/assistant/IncidentHeader.tsx New
frontend/src/components/assistant/StepsPanel.tsx New (from TaskLane logic)
frontend/src/components/assistant/FlowPilotAsks.tsx New
frontend/src/components/assistant/WhatWeKnow.tsx New
frontend/src/components/assistant/ConcludeSessionModal.tsx Redesigned