diff --git a/docs/cockpit/2026-04-01-msp-assistant-harness-plan-claude.md b/docs/cockpit/2026-04-01-msp-assistant-harness-plan-claude.md new file mode 100644 index 00000000..821fe5c7 --- /dev/null +++ b/docs/cockpit/2026-04-01-msp-assistant-harness-plan-claude.md @@ -0,0 +1,470 @@ +# MSP Assistant Harness — Super Plan +**Date:** 2026-04-01 +**Status:** Approved — ready to execute +**Sources:** `MSP_Assistant_Harness_Implementation_Plan.docx` (v2.0) + `2026-04-01-msp-assistant-harness-design.md` (brainstorming session) + +--- + +## Goal + +Reframe `/assistant` from a generic AI chat surface into a **live MSP triage cockpit**. An engineer arrives with an open ticket; the page immediately reads as their operational tool — not an AI chatbot that's been adapted for IT work. + +The change is a UI and data layer reframe. The existing session, branching, PSA, and conclude architecture is preserved and extended, not rebuilt. + +--- + +## What Phase 0 Resolved + +The brainstorming session (2026-04-01) locked these decisions. They are not open questions. + +| Question | Decision | +|----------|----------| +| Layout structure | Stacked zones: incident header → work zone → (drag handle) → conversation log → compose | +| Incident header style | Single row, explicit micro-labels above each field, per-field `✏` edit | +| Work zone left panel | Ordered step checklist (✓ / → / ○) | +| Work zone right panel | Two stacked mini-panels: FlowPilot Asks (top) + What We Know (bottom) | +| Chat zone treatment | Drag-resizable split, compact `you:` / `fp:` prefix style, darker background | +| Chat collapsibility | Not collapsible — drag handle gives control | +| Scope | Includes all required backend changes, not UI-layer only | +| Conclude modal | Fully redesigned as structured handoff artifact | +| Page label | "FlowPilot" (not "AI Assistant") | +| "New Chat" label | "New Case" | +| "Conclude" label | "Close Case" | +| Hypothesis language | "Hypothesis" (direct, not softened to "working theory") | +| What We Know editability | Engineer-editable + AI-appended | +| Header field population | Intake form + AI-inferred mid-session + manual engineer override | + +--- + +## Cockpit Layout + +``` +┌─────────────────────────────────────────────────────────────┐ +│ [Left sidebar — Case History, unchanged] │ +│ ┌───────────────────────────────────────────────────────┐ │ +│ │ INCIDENT HEADER (single row, labelled fields) │ │ +│ │ CLIENT DEVICE CATEGORY HYPOTHESIS │ │ +│ │ Contoso ✏ jsmith-04 ✏ DNS/Net ✏ Cache fail ✏ │ │ +│ │ [CW #48291][Resolve⋯]│ │ +│ ├───────────────────────┬───────────────────────────────┤ │ +│ │ │ ▸ FLOWPILOT ASKS (amber) │ │ +│ │ STEPS (~55%) │ Did nslookup time out? │ │ +│ │ ✓ Ping 8.8.8.8 │ [Time out] [Wrong IP] [Both] │ │ +│ │ → nslookup ←active ├───────────────────────────────┤ │ +│ │ ○ Flush DNS │ WHAT WE KNOW │ │ +│ │ ○ Check NIC │ ✓ Gateway reachable │ │ +│ │ │ ✗ DNS 1.1.1.1 — timeout │ │ +│ │ [⚡ Generate Script] │ ? DNS 8.8.8.8 — pending │ │ +│ ├───────────────────────┴───── ≡ drag handle ───────────┤ │ +│ │ CONVERSATION LOG (compact, darker bg) │ │ +│ │ you: Can't resolve external DNS, internal fine │ │ +│ │ fp: Ping test passed. Run nslookup google.com. │ │ +│ │ you: Timed out on 1.1.1.1 too. │ │ +│ ├───────────────────────────────────────────────────────┤ │ +│ │ Describe next finding or ask FlowPilot... [Send] │ │ +│ └───────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────┘ +``` + +--- + +## Non-Goals + +- No redesign of `/pilot` (FlowPilot session page) — separate page, untouched +- No rebuild of session, branching, or PSA architecture +- No new data model for conversations — `conversation_messages` JSONB unchanged +- No mobile-first redesign — mobile degrades cleanly, desktop is primary +- No generic "assistant polish" that does not tighten the harness + +--- + +## Backend Changes + +### B1 — Alembic migration `071` + +File: `backend/alembic/versions/071_add_triage_fields_to_ai_sessions.py` + +Add to `ai_sessions`: + +| Column | Type | Notes | +|--------|------|-------| +| `client_name` | `VARCHAR(255)` | MSP client for incident header | +| `asset_name` | `VARCHAR(255)` | Device / user being worked on | +| `issue_category` | `VARCHAR(100)` | Human-readable category ("DNS / Networking") | +| `triage_hypothesis` | `TEXT` | Working hypothesis — AI-updated + editable | +| `evidence_items` | `JSONB` | What We Know list — persisted for resume | + +`evidence_items` schema: `[{ "text": str, "status": "confirmed" | "ruled_out" | "pending" }]` + +Note: existing `problem_domain` is an internal classifier slug and is unchanged. `issue_category` is the human-readable display label. Both coexist. + +### B2 — Updated schemas (`backend/app/schemas/ai_session.py`) + +**New `TriageUpdate`:** +```python +class TriageUpdate(BaseModel): + client_name: str | None = None + asset_name: str | None = None + issue_category: str | None = None + triage_hypothesis: str | None = None + evidence_items: list[dict] | None = None # appends to existing list +``` + +**Updated `ChatMessageResponse`:** +```python +class ChatMessageResponse(BaseModel): + # ... existing fields unchanged ... + triage_update: TriageUpdate | None = None +``` + +**Updated `QuestionItem`** — add quick-reply options: +```python +class QuestionItem(BaseModel): + text: str + context: str = "" + options: list[str] | None = None # quick-reply labels; null → free-text input +``` + +**Updated `ResolveSessionRequest` / `EscalateSessionRequest`:** +```python +root_cause: str | None = None +steps_taken: list[str] | None = None +recommendations: str | None = None +``` + +### B3 — New `PATCH /ai-sessions/{id}/triage` endpoint + +``` +PATCH /ai-sessions/{session_id}/triage +Auth: require_engineer_or_admin +Body: { client_name?, asset_name?, issue_category?, triage_hypothesis?, evidence_items? } +Response: { id, client_name, asset_name, issue_category, triage_hypothesis, evidence_items } +``` + +Called on every manual header field edit. Partial update — only supplied fields are written. + +### B4 — New `POST /ai-sessions/{id}/handoff-draft` endpoint + +``` +POST /ai-sessions/{session_id}/handoff-draft +Auth: require_engineer_or_admin +Response: StreamingResponse (text/event-stream) +``` + +Streams structured handoff JSON built from session context: +```json +{ "root_cause": "...", "resolution": "...", "steps_taken": ["..."], "recommendations": "..." } +``` + +Uses: `problem_summary`, `triage_hypothesis`, `evidence_items`, last 20 `conversation_messages`, saved task lane state. + +Called immediately on conclude modal open — engineer can edit while stream fills in. + +### B5 — `unified_chat_service.py` — triage extraction + +After each AI response, extract triage signals and return as `triage_update`. + +**Recommended approach:** Add a `[TRIAGE_UPDATE]` structured marker to the system prompt, following the existing `[QUESTIONS]` / `[ACTIONS]` / `[FORK]` marker pattern. The AI emits the block only when it has new signal: + +``` +[TRIAGE_UPDATE] +client_name: Contoso Ltd +issue_category: DNS / Networking +triage_hypothesis: Corrupted DNS cache on NIC +evidence_items: + - confirmed: Gateway 192.168.1.1 reachable + - ruled_out: DNS 1.1.1.1 — timeout +[/TRIAGE_UPDATE] +``` + +Service parses this, strips it from `display_content`, auto-PATCHes the session record, and returns `triage_update` in the response. + +### B6 — `resolution_output_generator.py` — accept structured fields + +Update `_build_session_context()` to incorporate `root_cause`, `steps_taken`, and `recommendations` when supplied, producing richer `psa_ticket_notes` and `client_summary` outputs. + +### B7 — Session detail response — expose new triage fields + +`GET /ai-sessions/{id}` (and the session list item) must return the 5 new fields so the frontend can restore header state on session load and resume. + +--- + +## Frontend Changes + +### F1 — `AssistantChatPage.tsx` — cockpit layout refactor + +Replace current layout (sidebar + chat column + TaskLane right rail) with the stacked cockpit structure. + +**New state:** +- `triageMeta: TriageMeta` — `{ client_name, asset_name, issue_category, triage_hypothesis, evidence_items }` +- `workZoneHeight: number` — persisted to `localStorage('rf-assistant-work-zone-height')` + +**On session load / resume:** populate `triageMeta` from session response new fields. + +**On AI response:** if `response.triage_update` is non-null, merge into `triageMeta` (partial — preserve existing non-null values unless AI explicitly overwrites). + +**Work zone layout:** left `StepsPanel` + right column with `FlowPilotAsks` stacked above `WhatWeKnow`. + +**Chat zone layout:** compact `ConversationLog` below drag handle, independent scroll. + +### F2 — New `IncidentHeader.tsx` + +``` +frontend/src/components/assistant/IncidentHeader.tsx +``` + +Props: `triageMeta: TriageMeta`, `psaTicketId: string | null`, `sessionId: string`, `onFieldSave(field, value)`, `onResolve()`, `onOverflow()` + +- Single-row bar with micro-labels (CLIENT / DEVICE / CATEGORY / HYPOTHESIS) +- Each field: `✏` icon visible on hover → opens inline `EditPopover` (text input + Save/Cancel) +- On Save: calls `aiSessionsApi.updateTriage(sessionId, { [field]: value })` +- Empty fields: muted placeholder ("Unknown client", "No device specified", etc.) +- Right side: PSA ticket badge (if linked) + Resolve button + `⋯` overflow menu + +### F3 — Refactored `StepsPanel.tsx` (from `TaskLane`) + +``` +frontend/src/components/assistant/StepsPanel.tsx +``` + +Preserves all `TaskLane` data logic and persistence. Changes rendering only: + +| State | Icon | Style | +|-------|------|-------| +| Completed | `✓` | Strikethrough, muted, green icon | +| Active | `→` | Blue left border, white text, full opacity | +| Pending | `○` | Muted text | + +Script generation CTA: shown at bottom when active step `command` references "script" or AI has flagged it. + +`TaskLane.tsx` can remain for now (no renames required in this phase) — `StepsPanel` is a new component that consumes the same `activeActions` prop. + +### F4 — New `FlowPilotAsks.tsx` + +``` +frontend/src/components/assistant/FlowPilotAsks.tsx +``` + +Props: `questions: QuestionItem[]`, `onAnswer(answer: string)` + +- Renders first unanswered question +- `question.options` non-null → button row; clicking calls `onAnswer(option)` +- `question.options` null → compact text input + Send +- `onAnswer` calls parent's `handleSend` with the answer string +- Hidden entirely when `questions` is empty + +### F5 — New `WhatWeKnow.tsx` + +``` +frontend/src/components/assistant/WhatWeKnow.tsx +``` + +Props: `items: EvidenceItem[]`, `onAdd(text, status)`, `onEdit(index, text, status)` + +- Evidence list: `✓` confirmed (green) / `✗` ruled out (red) / `?` pending (muted) +- "+ Add finding" inline entry at bottom +- Click any item to edit inline +- State lives in `AssistantChatPage` (`triageMeta.evidence_items`), synced to backend via `PATCH /triage` + +### F6 — Drag-resizable split + +Thin handle bar between work zone and conversation log. On drag: update `workZoneHeight` in state, persist to `localStorage`. On mount: restore, default `55%`. + +### F7 — Compact `ConversationLog` rendering + +Replace current full `` bubbles in the log zone with a compact list: `you: ...` / `fp: ...` prefix style, tighter line height, no avatars. `ChatMessage` can still be used for rich content (forks, suggested flows) in a compact variant. + +### F8 — Redesigned `ConcludeSessionModal.tsx` + +On open: +1. Call `aiSessionsApi.getHandoffDraft(sessionId)` (streaming) — fields fill in as stream arrives +2. Render: outcome selector (Resolved / Escalated / Parked) +3. Render 4 structured editable fields: Root Cause, Resolution, Steps Taken, Recommendations +4. Render output destination checkboxes: Post to CW note / Save to KB / Send client summary +5. Confirm → call resolve/escalate/pause with enriched request body including structured fields + +### F9 — MSP-native language pass + +| Old | New | +|-----|-----| +| "AI Assistant" (page title, meta) | "FlowPilot" | +| "New Chat" | "New Case" | +| "Messages" | "Conversation Log" | +| "Task Lane" (panel label) | "Steps" | +| "Conclude" | "Close Case" | +| "Chat history" (sidebar label) | "Case History" | +| Compose placeholder | "Describe finding, paste log output, or ask FlowPilot..." | + +### F10 — New API methods (`aiSessions.ts`) + +```typescript +updateTriage(sessionId: string, fields: Partial): Promise +getHandoffDraft(sessionId: string): AsyncGenerator +``` + +### F11 — New types (`types/ai-session.ts`) + +```typescript +interface TriageMeta { + client_name: string | null + asset_name: string | null + issue_category: string | null + triage_hypothesis: string | null + evidence_items: EvidenceItem[] +} + +interface EvidenceItem { + text: string + status: 'confirmed' | 'ruled_out' | 'pending' +} + +interface TriageUpdate extends Partial {} + +// Extend existing: +interface QuestionItem { + text: string + context: string + options?: string[] // new +} +``` + +--- + +## Phased Execution Order + +### Phase 1 — Backend Foundation +1. Write migration `071` — add 5 columns to `ai_sessions` +2. Run `alembic upgrade head`, verify columns +3. Update `AISession` model with new mapped columns +4. Add `TriageUpdate` schema, extend `QuestionItem`, extend `ChatMessageResponse` +5. Extend `ResolveSessionRequest` / `EscalateSessionRequest` with structured fields +6. Add `PATCH /{id}/triage` endpoint +7. Add `POST /{id}/handoff-draft` streaming endpoint +8. Update `GET /ai-sessions/{id}` response to include new triage fields +9. Update `resolution_output_generator._build_session_context()` to use structured fields +10. Run backend tests — `pytest --override-ini="addopts="` + +### Phase 2 — Triage Extraction (AI layer) +11. Add `[TRIAGE_UPDATE]` marker to `unified_chat_service.py` system prompt +12. Implement `_parse_triage_update_marker()` in the service +13. Auto-PATCH session on non-null `triage_update` +14. Add `options` generation instructions to `[QUESTIONS]` system prompt section +15. Verify extraction in a live session + +### Phase 3 — New Frontend Types + API +16. Add `TriageMeta`, `EvidenceItem`, `TriageUpdate` to `types/ai-session.ts` +17. Extend `QuestionItem` type +18. Add `updateTriage()` and `getHandoffDraft()` to `aiSessions.ts` + +### Phase 4 — New Work Zone Components +19. Build `IncidentHeader.tsx` with `EditPopover` +20. Build `StepsPanel.tsx` +21. Build `FlowPilotAsks.tsx` +22. Build `WhatWeKnow.tsx` + +### Phase 5 — Page Layout Refactor +23. Refactor `AssistantChatPage.tsx` — implement stacked cockpit layout +24. Wire `triageMeta` state, session load population, `triage_update` merge +25. Implement drag-resizable split with `localStorage` persistence +26. Compact `ConversationLog` rendering + +### Phase 6 — Handoff Modal + Language Pass +27. Redesign `ConcludeSessionModal.tsx` — structured handoff form +28. MSP-native language pass across all assistant components +29. Update `` title + +### Phase 7 — QA + Hardening +30. `npx tsc -b` — fix any TypeScript errors +31. `npm run build` — production build clean +32. Functional regression: all chat flows, session switching, conclude/resume +33. Harness feel test: cockpit within 3 seconds? +34. Mobile viewport check +35. Stress test: 50+ messages, 10+ steps, long outputs + +--- + +## Risks and Mitigations + +| Risk | Mitigation | +|------|-----------| +| `[TRIAGE_UPDATE]` marker extraction is unreliable — AI doesn't emit it consistently | Gate Phase 2 on a pass/fail test with 5 real sessions before wiring it to the header. Fall back to Option B (post-response Haiku pass) if needed. | +| Header fields feel fabricated — AI guesses wrong client or hypothesis | Show confidence-aware placeholder copy ("FlowPilot is building context…") until a field has real data. Never invent. | +| Task lane visual promotion breaks established chat patterns | Keep all send/respond behavior intact. Change hierarchy only. Verify every task-lane state transition manually. | +| Handoff modal exposes weak underlying summaries | Reuse existing `ResolutionOutputGenerator` output where possible. Add guardrail copy for empty fields. | +| Mobile loses compose or step access | Test responsive layout as a first-class deliverable in Phase 7, not a final sweep. Enforce scroll isolation between all zones. | +| `tsc -b` errors after component refactor | Run `npx tsc -b` after every phase. Trace unused imports/props immediately — don't batch (lesson #92). | + +--- + +## Test Plan + +### Harness Feel (primary, subjective) +- Does the page read as an MSP triage cockpit within 3 seconds on first load? +- Is the active step obvious without reading chat? +- Do FlowPilot Asks quick-reply buttons work and update the step list? +- Does the incident header update mid-session as AI learns context? +- Drag handle, refresh — does split restore? +- Does the conclude modal look like a case handoff or a chat closure? + +### Functional Regression +- New session (no PSA) — header degrades gracefully +- New session (with CW ticket) — header populates from ticket data +- Send message → `triage_update` updates header +- Click quick-reply button → answer submitted, step advances +- Add finding to What We Know → persisted via PATCH +- Edit header field via `✏` → saved and survives refresh +- Conclude as Resolved → handoff draft fills modal → post to CW note +- Conclude as Escalated → same +- Pause and resume → triage header restores from saved session fields +- Session switching (currentChatRef guard) — no stale state +- Image paste, forks, suggested flows — all still work + +### MSP Scenarios (from docx) +1. Single-user endpoint issue (basic triage flow, script generation) +2. M365 / tenant-wide issue (multi-user context, issue category) +3. Network / VPN outage (asset targeting, hypothesis tracking) +4. Escalation and resume (session persistence, structured handoff) + +### Edge Cases +- 50+ messages — layout hierarchy stays intact +- 10+ steps — step panel scrolls, compose remains accessible +- Long issue titles / hypothesis text — header truncates gracefully +- Missing PSA context — placeholder copy, not blank fields +- Narrow mobile viewport — all zones reachable + +### Backend Checks +```bash +# Migration +alembic upgrade head +psql -U postgres -d resolutionflow -c "\d ai_sessions" | grep -E "client_name|asset_name|issue_category|triage_hypothesis|evidence_items" + +# Triage PATCH +curl -X PATCH http://localhost:8000/ai-sessions/{id}/triage \ + -H "Authorization: Bearer $TOKEN" \ + -d '{"client_name":"Test Client","triage_hypothesis":"Cache corruption"}' + +# Handoff draft stream +curl -X POST http://localhost:8000/ai-sessions/{id}/handoff-draft \ + -H "Authorization: Bearer $TOKEN" +``` + +--- + +## Critical Files + +| File | Change | +|------|--------| +| `backend/alembic/versions/071_add_triage_fields_to_ai_sessions.py` | New migration | +| `backend/app/models/ai_session.py` | Add 5 new mapped columns | +| `backend/app/schemas/ai_session.py` | `TriageUpdate`, `QuestionItem.options`, extended request/response schemas | +| `backend/app/api/endpoints/ai_sessions.py` | `PATCH /triage`, `POST /handoff-draft` | +| `backend/app/services/unified_chat_service.py` | `[TRIAGE_UPDATE]` marker extraction, auto-PATCH | +| `backend/app/services/resolution_output_generator.py` | Structured fields in context builder | +| `frontend/src/types/ai-session.ts` | `TriageMeta`, `EvidenceItem`, `TriageUpdate`; extend `QuestionItem` | +| `frontend/src/api/aiSessions.ts` | `updateTriage()`, `getHandoffDraft()` | +| `frontend/src/pages/AssistantChatPage.tsx` | Full cockpit layout refactor | +| `frontend/src/components/assistant/IncidentHeader.tsx` | New | +| `frontend/src/components/assistant/StepsPanel.tsx` | New (from TaskLane logic) | +| `frontend/src/components/assistant/FlowPilotAsks.tsx` | New | +| `frontend/src/components/assistant/WhatWeKnow.tsx` | New | +| `frontend/src/components/assistant/ConcludeSessionModal.tsx` | Redesigned |