Files
resolutionflow/docs/superpowers/specs/2026-05-28-l1-workspace-design.md
Michael Chihlas 07a29f630a docs(design): revise L1 spec after review (sessions, adhoc, OAuth, seat enforcement)
Restructure walked_path off FlowProposal onto new l1_walk_sessions table
(each L1 walk has its own path; proposal carries only the validation bit).
Add adhoc walk variant for live calls when no KB content exists, with a
dedicated BuildAbortedNoKB screen offering ad-hoc/escalate/near-miss
options. Introduce SUGGEST_THRESHOLD below MATCH_THRESHOLD so near-miss
flows surface as suggestions instead of triggering a 10s build. Define
empty-state dashboard mode for first-run accounts. Spec the Microsoft
Graph OAuth flow concretely (multi-tenant app, redirect callback, token
refresh). Add seat enforcement for both L1 and engineer tracks via shared
helper (engineer enforcement was missing in current code). Make audit
policy explicit (resolve/escalate only, not per-step). Add session
lifecycle (concurrent sessions, browser-close recovery, 24h abandonment).
Clarify KB doc visibility is owner/engineer only (L1s see citations in
walker, not /account/kb directly). Acknowledge escalation notification
noise as v1 limitation with targeted notification deferred to v2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 10:51:57 -04:00

68 KiB
Raw Blame History

L1 Workspace — Design Spec

Date: 2026-05-28 Status: Draft (pending implementation plan) Audience for this doc: engineers + reviewers building the L1 workspace feature


1. Summary

Introduce a dedicated L1 helpdesk workspace as a new seat tier in ResolutionFlow. L1 techs walk customers through yes/no decision trees on inbound tickets and phone calls. The platform either matches an existing authored flow, reuses an outcome-validated AI draft, or builds a fresh decision tree in real time from the MSP's ingested knowledge base. Drafts that resolve a call become "outcome-validated" and surface first in the engineer review queue for promotion to authored flows. KB ingestion supports manual upload plus three MSP-native connectors: IT Glue, Hudu, and Microsoft SharePoint/OneDrive.

This re-introduces the original deterministic tree-walker UX — which had been deprecated in favor of chat-primary FlowPilot — and repositions it as a frontline-tier product surface distinct from the engineer chat surface.


2. Motivation

The current ResolutionFlow product funnels every user — regardless of skill tier — into a single chat-primary surface (AssistantChatPage mounted at /pilot). The chat is excellent for engineers but is the wrong primitive for L1 helpdesk staff who:

  • Take inbound phone calls and need a fast, deterministic click-through UX
  • Resolve simple, recurring problems (password resets, mailbox connection issues, VPN disconnects, printer queue clears, etc.)
  • Are not authorized to escalate complex issues themselves; they hand off to engineers

A tree-walker UX serves this audience natively. The substrate already exists in the codebase — decision-tree data model, authoring tools, RAG, KB Accelerator, escalation packaging — but no first-class L1 surface ties it together. This spec defines that surface and the supporting AI/KB pipeline.


3. Users & roles

3.1 Role hierarchy

super_admin > owner > engineer > l1_tech > viewer

l1_tech is added to the account_role enum. Permissions enforced via app/core/permissions.py and app/api/deps.py.

3.2 What L1 can do

  • Use the /l1/* surface
  • Open tickets from their queue (PSA-fed or internal)
  • Intake walk-in/phone-call problems (creates a ticket as a side effect)
  • Walk authored flows and AI-built FlowProposal drafts
  • Resolve or escalate a session
  • View their own AI drafts list (read-only — outcome tags shown)

3.3 What L1 cannot do

  • See the chat surface (/pilot) — sidebar hidden, route 403s
  • Author or edit flows
  • See /review-queue or /escalations (engineer inboxes)
  • See team analytics (only /analytics/me)
  • Promote AI drafts (engineers/owners only, via existing review queue)
  • Configure KB connectors (owner-only)

3.4 Engineer L1 coverage

Engineers do NOT see the L1 surface by default. Owners can toggle users.can_cover_l1 = true on individual engineer users. Engineers with that flag (and all owners/super_admins) see an "L1 Workspace" entry in their sidebar. Clicking it puts them in /l1/* with a sticky banner: "Covering L1 — actions logged as coverage." Coverage actions are audit-logged with acting_as = 'l1_coverage'.

Backend dep: require_l1_or_coverage = l1_tech | (engineer AND can_cover_l1) | owner | super_admin.

This mirrors the existing orthogonal-flag pattern (is_team_admin) — no new architectural concept.

3.5 Billing data model

  • accounts.l1_seats_purchased INTEGER NOT NULL DEFAULT 0 (new column)
  • Existing accounts.seats_purchased continues to represent engineer seats
  • New Stripe SKU placeholder for L1 seat; actual pricing set in Stripe dashboard out-of-band

3.6 Seat enforcement (L1 + engineer together)

Important context surfaced during spec review: there is currently no seat-limit enforcement in the codebase. subscription.seat_limit is stored from Stripe webhook payloads and surfaced in API responses, but no endpoint blocks invites when the limit is reached. To avoid shipping L1 with enforcement while engineer seats remain unbounded (inconsistent SKU story), this spec adds enforcement for both tracks as part of v1.

Shared helper: services/seat_enforcement.py:

def check_seat_available(
    account: Account,
    subscription: Subscription,
    role: Literal['engineer', 'l1_tech'],
    db: AsyncSession,
) -> SeatCheckResult

Counts active users in the account at the given role, compares against the subscription's role-specific limit (seat_limit for engineer, l1_seat_limit for L1). Returns {available: bool, current: int, limit: int}.

Enforcement points:

  • POST /api/v1/invites (invite create) — blocks with 402 Payment Required (or 422 with code seat_limit_exceeded) when the target role's seats are full. Body: {current, limit, role, upgrade_url: <stripe customer portal link>}.
  • Invite accept (/api/v1/accept-invite) — re-checks at acceptance time (race-condition guard).
  • Role change on existing user (e.g., promoting viewer to engineer) — same check before commit.
  • Admin "assign role" UI — pre-checks seat availability and disables the option when full.

Grandfathering: any account currently over-seated (existing inviting beyond the limit was technically allowed before) is not retroactively kicked. The enforcement applies from migration-time forward — existing over-seated accounts get a banner prompting upgrade or seat removal but functionality is preserved until they invite a new user.

Frontend: /admin/users and /account/users show a seat counter widget for each role (3 / 5 engineer seats used · 2 / 5 L1 seats used). When a count exceeds the limit, the widget renders amber with a tooltip explaining grandfathering.


4. Architecture overview

4.1 New components

Frontend:

  • pages/l1/L1Dashboard.tsx — landing page; ticket queue + describe-the-problem intake. Two modes (empty-state + active).
  • pages/l1/L1WalkPage.tsx — purpose-built walker with two internal variants: tree (flow/proposal) and adhoc (note-taking).
  • pages/l1/L1NoKBScreen.tsx — BuildAbortedNoKB screen with three CTAs (adhoc / escalate / use near-miss).
  • pages/l1/L1DraftsPage.tsx — read-only list of the L1's AI drafts and promotion status.
  • pages/l1/L1TicketsPage.tsx — full-page queue (PSA + internal merged).
  • components/l1/L1CoverageBanner.tsx — slim banner shown to engineer-coverers.
  • components/l1/SuggestPrompt.tsx — inline near-miss suggestion ("Use this flow / Build new").
  • components/admin/SeatCounterWidget.tsx — engineer + L1 seat usage counts on /admin/users and /account/users.

Backend:

  • services/match_or_build.py — orchestrator (RAG match → fallback to AI build)
  • services/ai_tree_builder.py — real-time AI tree generation via Anthropic
  • services/kb_connectors/ package — base, registry, encryption, plus itglue.py, hudu.py, microsoft_graph.py
  • services/kb_ingestion_writer.py — shared writer used by manual upload + all connectors
  • services/kb_ingestion_scheduler.py — APScheduler job, max_instances=1, per-connector sync
  • services/internal_ticket_service.py — CRUD + status transitions for the no-PSA fallback
  • services/l1_session_service.py — walking-session lifecycle
  • api/endpoints/l1.py — L1-role endpoints
  • api/endpoints/kb_connectors.py — KB connector config endpoints (owner-only for write)

Reused / extended:

  • services/rag_service.py — flow & KB matching (existing)
  • services/flow_matching_engine.py — existing
  • services/escalation_package_generator.py — extended to include walked path, AI draft pointer, KB citations
  • models/FlowProposal — new columns (see §5)
  • New models/L1WalkSession — per-session state for tree walks and adhoc walks (see §5.3)
  • services/psa/ — already supports ticket create + reassign across CW/Autotask/HaloPSA
  • services/embedding_service.py — used by KB ingestion writer
  • New kb_documents + kb_document_chunks tables for RAG-retrievable document storage, separate from the existing kb_imports (which is a document→tree conversion record, not a persistent KB store — see §5)
  • Audit log writer — gains acting_as field

4.2 Data flow — walk-in / phone-call intake

L1 types: "User can't connect Outlook after password reset"
  POST /api/v1/l1/intake
    body: { problem_statement, customer_name?, customer_contact? }
    → create ticket
        - PSA if configured: psa_provider.create_ticket(...)
        - else: internal_tickets row
    → match_or_build(account_id, problem_text, ticket_ref)
        → rag_service.match_flows(...) → top hit; if score ≥ threshold return as 'flow'
        → rag_service.match_proposals(... where validated_by_outcome=true)
                                           → top hit; if score ≥ threshold return as 'proposal'
        → ai_tree_builder.build(problem_text, kb_chunks, nearest_flows)
                                           → persist FlowProposal(source='ai_realtime_l1',
                                                                  linked_ticket_id,
                                                                  linked_ticket_kind,
                                                                  validated_by_outcome=false)
                                           → return as 'proposal'
    → l1_session_service.start(...)
    → return { session_id, target_kind, target_id, intake_type }
  → navigate to /l1/walk/{session_id}

4.3 Data flow — PSA-queue intake

The L1 dashboard polls the L1's PSA queue plus their internal tickets. Clicking a ticket row calls POST /api/v1/l1/tickets/{ticket_ref}/start which is the same match_or_build path (the problem_statement is the ticket subject + description) followed by walker navigation.


5. Data model

All new tenant-isolated tables get RLS policies (account-scoped, WITH CHECK). All TIMESTAMPs are TIMESTAMPTZ. No --rev-id on Alembic; no --autogenerate for enum/RLS work.

5.1 FlowProposal — extended

Existing AI-draft model. Add columns:

Column Type Notes
source VARCHAR(30) NOT NULL 'ai_realtime_l1' | 'kb_accelerator' | 'manual_draft'. Backfill existing rows to 'manual_draft'.
linked_ticket_id VARCHAR(64) NULL PSA id or internal_tickets UUID (stored as text)
linked_ticket_kind VARCHAR(10) NULL 'psa' | 'internal'
validated_by_outcome BOOLEAN NOT NULL DEFAULT FALSE Flipped to true when any L1 walks this proposal to a helpful resolve

Note (revised after spec review): the walked path lives on the session (l1_walk_sessions, §5.3), not the proposal. A single proposal may be walked by multiple L1s over time — each walk has its own path. The proposal carries only the boolean validation signal; engineer review queries the latest validated session's path for context.

Engineer review queue sort:

ORDER BY validated_by_outcome DESC, created_at DESC

5.2 internal_tickets — new

id                        UUID PRIMARY KEY
account_id                UUID NOT NULL  (RLS-scoped)
created_by_user_id        UUID NOT NULL  (the L1 who took the call)
customer_name             VARCHAR(120)
customer_contact          VARCHAR(200) NULL    (email or phone, free text)
problem_statement         TEXT NOT NULL
status                    VARCHAR(30) NOT NULL  -- 'open' | 'walking' | 'resolved' | 'escalated'
flow_id                   UUID NULL FK trees
flow_proposal_id          UUID NULL FK flow_proposals
ai_session_id             UUID NULL FK ai_sessions (set when engineer picks up in chat post-escalation)
assigned_user_id          UUID NULL    (engineer post-escalation)
resolution_notes          TEXT NULL
psa_promoted_ticket_id    VARCHAR(64) NULL   (set if later promoted to PSA)
created_at                TIMESTAMPTZ NOT NULL
updated_at                TIMESTAMPTZ NOT NULL
resolved_at               TIMESTAMPTZ NULL

RLS: account-scoped, WITH CHECK on insert/update.

5.3 l1_walk_sessions — new

Per-session state for an L1 walking a ticket. Supports three session kinds: walking an authored flow, walking an AI-built proposal, or an adhoc walk with no tree (used when no KB content exists and the L1 needs to handle the call manually but still wants the session/ticket/escalation framework).

id                              UUID PRIMARY KEY
account_id                      UUID NOT NULL  (RLS-scoped)
created_by_user_id              UUID NOT NULL  (the L1, or coverage engineer)
acting_as                       VARCHAR(30) NULL  -- 'l1_coverage' when engineer covers; null for native L1
ticket_id                       VARCHAR(64) NOT NULL  -- PSA id or internal_tickets UUID as text
ticket_kind                     VARCHAR(10) NOT NULL  -- 'psa' | 'internal'
session_kind                    VARCHAR(20) NOT NULL  -- 'flow' | 'proposal' | 'adhoc'
flow_id                         UUID NULL FK trees
flow_proposal_id                UUID NULL FK flow_proposals
current_node_id                 VARCHAR(100) NULL  -- node within the tree; null for adhoc
walked_path                     JSONB NOT NULL DEFAULT '[]'::jsonb  -- [{node_id, question, answer, l1_note}]; [] for adhoc
walk_notes                      JSONB NOT NULL DEFAULT '[]'::jsonb  -- free-form notes (adhoc) or supplementary notes (tree walks)
status                          VARCHAR(20) NOT NULL DEFAULT 'active'  -- 'active' | 'resolved' | 'escalated' | 'abandoned'
resolution_notes                TEXT NULL
helpful                         BOOLEAN NULL       -- the "did this work?" answer at resolve time
escalation_reason               TEXT NULL
escalation_reason_category      VARCHAR(30) NULL
started_at                      TIMESTAMPTZ NOT NULL
last_step_at                    TIMESTAMPTZ NOT NULL
resolved_at                     TIMESTAMPTZ NULL

Constraints:

  • CHECK (session_kind = 'flow' AND flow_id IS NOT NULL AND flow_proposal_id IS NULL) OR (session_kind = 'proposal' AND flow_proposal_id IS NOT NULL AND flow_id IS NULL) OR (session_kind = 'adhoc' AND flow_id IS NULL AND flow_proposal_id IS NULL)
  • Soft "abandoned" status: if last_step_at is older than 24h and status is still 'active', a cleanup task flips it to 'abandoned' (preserves data; just gets it off the L1's "Resume in progress" widget).

RLS: account-scoped, WITH CHECK on insert/update.

Why a new table (rather than reusing ai_sessions): ai_sessions is the chat-conversation model — flat message list, no node-state, no flow/proposal linkage. An L1 walk has different state (current node, walked path, walk-kind constraint). Forcing it into ai_sessions would require multiple new nullable columns on a heavily-used model and overload its semantics. Separate table = cleaner separation and lower regression risk.

5.4 kb_connector_configs — new

id                        UUID PRIMARY KEY
account_id                UUID NOT NULL  (RLS-scoped)
provider                  VARCHAR(20) NOT NULL  -- 'itglue' | 'hudu' | 'microsoft_graph'
display_name              VARCHAR(80) NOT NULL
credentials_encrypted     BYTEA NOT NULL        -- Fernet, same pattern as services/psa/encryption.py
is_active                 BOOLEAN NOT NULL DEFAULT TRUE
sync_interval_minutes     INTEGER NOT NULL DEFAULT 360
last_sync_at              TIMESTAMPTZ NULL
last_sync_status          VARCHAR(20) NULL      -- 'success' | 'error' | 'running'
last_sync_error           TEXT NULL
created_by_user_id        UUID NOT NULL
created_at                TIMESTAMPTZ NOT NULL
updated_at                TIMESTAMPTZ NOT NULL
UNIQUE (account_id, provider, display_name)

RLS: account-scoped, WITH CHECK.

5.5 New tables: kb_documents + kb_document_chunks

The existing kb_imports table is a document→tree conversion record (status lifecycle processing | ready | committed | failed, target tree_id) — designed to turn one document into one authored flow. It is NOT a persistent KB document store and does not power RAG retrieval.

The L1 feature needs a separate pair of tables that store ingested docs in RAG-retrievable form:

kb_documents — one row per ingested document:

id                        UUID PRIMARY KEY
account_id                UUID NOT NULL  (RLS-scoped)
source_kind               VARCHAR(20) NOT NULL  -- 'upload' | 'paste' | 'itglue' | 'hudu' | 'microsoft_graph'
source_ref                VARCHAR(200) NULL     -- provider-side document ID for re-sync
connector_config_id       UUID NULL FK kb_connector_configs
title                     VARCHAR(500) NOT NULL
content                   TEXT NOT NULL          -- full post-extraction text
content_hash              VARCHAR(64) NOT NULL   -- sha256 for change-detection
metadata                  JSONB NULL             -- provider-specific (org_id, drive_id, etc.)
last_synced_at            TIMESTAMPTZ NULL
deleted_at                TIMESTAMPTZ NULL       -- soft-delete on connector removal
created_at                TIMESTAMPTZ NOT NULL
updated_at                TIMESTAMPTZ NOT NULL

Unique partial index: (connector_config_id, source_ref) WHERE source_ref IS NOT NULL.

kb_document_chunks — chunks with embeddings, used by rag_service.match_kb_chunks:

id                        UUID PRIMARY KEY
document_id               UUID NOT NULL FK kb_documents ON DELETE CASCADE
account_id                UUID NOT NULL  -- denormalized for RLS
chunk_index               INTEGER NOT NULL
content                   TEXT NOT NULL
embedding                 VECTOR(<dim>) NOT NULL  -- dim matches embedding_service
metadata                  JSONB NULL              -- section title, page number, etc.
created_at                TIMESTAMPTZ NOT NULL
UNIQUE (document_id, chunk_index)

Pgvector index (ivfflat or hnsw) on embedding; choice tuned during implementation.

RLS on both tables: account-scoped, WITH CHECK on insert.

Coexistence with kb_imports: when an L1 (or owner) uploads a doc, the system can populate both — the existing KBImport pipeline produces a draft tree, and the new ingestion writer additionally chunks+embeds the doc into kb_documents for RAG. Both paths share the upload endpoint but write to independent tables. Connectors only write to kb_documents (no auto-tree-conversion from synced docs in v1).

5.6 Other column additions

  • users.can_cover_l1 BOOLEAN NOT NULL DEFAULT FALSE
  • accounts.l1_seats_purchased INTEGER NOT NULL DEFAULT 0
  • audit_logs.acting_as VARCHAR(30) NULL'l1_coverage' when engineer is in coverage mode; null otherwise
  • account_role enum: add 'l1_tech'
  • subscriptions.l1_seat_limit INTEGER NULL (mirrors existing seat_limit which is treated as the engineer limit going forward)

5.6.1 Audit log policy (explicit)

Audit rows are written only at session terminal events — resolve and escalate — not on each step. The walked path is recorded incrementally on l1_walk_sessions.walked_path as it accumulates; the audit row at resolve/escalate captures the frozen final snapshot inline. Mid-walk step-by-step audit logging is not v1 because:

  • MSP IT troubleshooting actions taken via an L1 walk are rarely high-stakes enough to justify the row-volume cost (~520 audit rows per call vs. 1).
  • The walked_path on the session is itself the auditable record for the L1's path through the tree; the session table is account-scoped and retained.
  • If a customer-impacting incident traces back to an L1 walk, the path is recoverable from the session row even when the session is abandoned (cleanup task preserves the row, just flips status).

If higher granularity is needed later (e.g., for compliance-heavy verticals), it's an additive change: subscribe to step events, emit an audit row per step. Not blocking v1.

5.7 Migration ordering

Eight manual Alembic revisions (no --rev-id, no --autogenerate):

  1. Add 'l1_tech' to account_role enum.
  2. Add users.can_cover_l1, accounts.l1_seats_purchased, audit_logs.acting_as.
  3. Extend flow_proposals with new columns + backfill existing rows to source='manual_draft'. Do not add walked_path_snapshot — that column lives on the new sessions table.
  4. Create l1_walk_sessions + RLS policies (account-scoped, WITH CHECK) + check constraint on session_kind combinations.
  5. Create internal_tickets + RLS policies.
  6. Create kb_connector_configs + RLS policies.
  7. Create kb_documents + kb_document_chunks tables + RLS policies + pgvector index on chunks.
  8. Add seat-enforcement support: subscriptions.l1_seat_limit INTEGER NULL (already have seat_limit for engineers — kept as-is and treated as the engineer limit going forward).

Per Lesson on tenant-isolated tables: any service-construction site that creates rows on these tables must pass account_id= explicitly. Grep all Model( sites before merge.


6. Backend services & endpoints

6.1 New services

Module Purpose
services/match_or_build.py Orchestrator. Single async entrypoint match_or_build(account_id, problem_text, ticket_ref) -> MatchOrBuildResult.
services/ai_tree_builder.py Real-time AI tree generation. Anthropic via existing _call_anthropic_cached pattern. Model tier via settings.get_model_for_action('l1_realtime_build'). Output validated against the flow node schema with Pydantic; rejects malformed output.
services/kb_connectors/base.py Abstract KBConnector with test_credentials, list_documents, fetch_content, subscribe_to_changes (optional).
services/kb_connectors/itglue.py IT Glue REST client.
services/kb_connectors/hudu.py Hudu REST client.
services/kb_connectors/microsoft_graph.py Microsoft Graph (SharePoint/OneDrive) client.
services/kb_connectors/registry.py KBConnectorRegistry (mirrors PsaProviderRegistry).
services/kb_connectors/encryption.py Fernet wrapper (or reuse the PSA one if generic).
services/kb_ingestion_writer.py Shared writer: chunk → embed → upsert. Used by manual upload AND connector sync.
services/kb_ingestion_scheduler.py APScheduler interval job, max_instances=1. Sequential per account; concurrency cap = 4 accounts simultaneously.
services/internal_ticket_service.py CRUD + status transitions for internal_tickets.
services/l1_session_service.py Walking-session lifecycle: start (flow/proposal/adhoc), step, notes, resolve, escalate, escalate-without-walk. Owns l1_walk_sessions writes.
services/l1_session_cleanup.py APScheduler job (hourly, max_instances=1) flipping stale active sessions to abandoned after 24h of inactivity.
services/seat_enforcement.py Shared helper used by invite, accept-invite, and role-change paths. Returns SeatCheckResult for engineer + L1 roles consistently.

6.2 Extended services

  • services/escalation_package_generator.py — adds inputs: walked_path, ai_draft_proposal_id, kb_citations. New caller path from l1_session_service.escalate(...).
  • KB Accelerator endpoint — accepts ingested content via the shared kb_ingestion_writer. Manual upload and connector sync share the same persistence path.

6.3 New endpoints

All under require_l1_or_coverage unless noted. Mounted under /api/v1/l1.

Method Path Purpose Auth
GET /l1/queue Merged ticket queue (PSA + internal). Pagination + status filter. require_l1_or_coverage
POST /l1/intake Walk-in intake. Body {problem_statement, customer_name?, customer_contact?, force_build?}. Creates ticket, runs match_or_build. Response is one of: {outcome: 'matched', session_id, session_kind, target_id} · {outcome: 'suggest', suggestion, can_build} (frontend prompts user) · {outcome: 'aborted_no_kb', near_miss?, ticket_ref} (frontend renders BuildAbortedNoKB screen §8.4). require_l1_or_coverage
POST /l1/tickets/{ticket_ref}/start Start walker from an existing ticket. Internally same as intake but skips ticket creation. require_l1_or_coverage
POST /l1/sessions/{id}/step Record an answer (tree walks only). Body {node_id, answer, note?}. Appends to l1_walk_sessions.walked_path. require_l1_or_coverage
POST /l1/sessions/{id}/notes Update walk notes (adhoc walks only). Body {notes: JSONB}. Replaces l1_walk_sessions.walk_notes. Debounced auto-save from frontend. require_l1_or_coverage
POST /l1/sessions/{id}/resolve Close as resolved. Body {resolution_notes, helpful: bool}. Sets validated_by_outcome=true on the proposal when helpful=true AND session_kind='proposal'. Closes the ticket. require_l1_or_coverage
POST /l1/sessions/{id}/escalate Generate escalation package + reassign ticket. Body {reason, reason_category}. require_l1_or_coverage
POST /l1/sessions/adhoc Start an adhoc walk. Body {ticket_ref?, ticket_kind?, problem_statement, customer_name?, customer_contact?}. If ticket_ref omitted, creates a ticket first (PSA or internal). Returns {session_id}. require_l1_or_coverage
POST /l1/escalate-without-walk Escalate immediately without a walking session (used from the BuildAbortedNoKB screen). Body {problem_statement, customer_name?, customer_contact?, reason_category}. Creates ticket + escalated l1_walk_sessions row + escalation package. require_l1_or_coverage
GET /l1/drafts List current user's AI drafts with promotion status. require_l1_or_coverage

KB connector endpoints (/api/v1/kb-connectors):

Method Path Purpose Auth
GET /kb-connectors List configured connectors for account. require_l1_or_above
POST /kb-connectors Create. OAuth handoff for Microsoft Graph; API token entry for IT Glue/Hudu. require_account_owner
DELETE /kb-connectors/{id} Remove (soft-disable). require_account_owner
POST /kb-connectors/{id}/sync Trigger immediate sync (enqueued). require_account_owner
GET /kb-connectors/{id}/status Sync status + doc count + last error. require_l1_or_above

Internal ticket endpoints (/api/v1/internal-tickets):

Method Path Purpose Auth
GET /internal-tickets List (account-scoped). require_l1_or_coverage
GET /internal-tickets/{id} Detail. require_l1_or_coverage
POST /internal-tickets/{id}/promote-to-psa Push to configured PSA, set psa_promoted_ticket_id. require_account_owner

User management additions:

Method Path Purpose Auth
PATCH /users/{id}/coverage Set can_cover_l1 flag. Body {can_cover_l1: bool}. require_account_owner
GET /accounts/me/seats Returns seat usage {engineer: {current, limit}, l1_tech: {current, limit}}. Used by admin/users UIs to render the counter widget. require_engineer_or_admin

Seat-enforcement integration points (no new endpoints — enforcement is inserted into existing flows):

  • POST /api/v1/invites (invite create) — returns 402 Payment Required (or 422 with code: seat_limit_exceeded) when target role has no remaining seats. Body includes {current, limit, role, upgrade_url}.
  • POST /api/v1/accept-invite — race-condition re-check at acceptance time.
  • Role-change endpoints — same check.

7. Frontend surface

7.1 Sidebar — L1 view

LOGO
─────────────
Workspace      /l1
Tickets        /l1/tickets
My Drafts      /l1/drafts
─────────────
Guides         /guides
Account        /account     (filtered — no integrations, no categories)

No /pilot, no /trees, no /flows, no /review-queue, no /escalations, no team analytics. Sidebar.tsx picks the nav array by role.

7.2 Sidebar — engineer coverage view

Engineer's existing sidebar plus a single appended entry "L1 Workspace" → /l1. Shown when canCoverL1 || isOwner || isSuperAdmin.

7.3 /l1 dashboard layout

The dashboard has two modes determined on load: empty-state (account has no flows AND no KB documents) or active (normal state).

Active mode — four vertical zones, single column, max width ~1100px:

  1. Greeting — uppercase tracking date label + Bricolage 700 hero ("Good morning, {firstName}.")
  2. Describe the problem card — large textarea (autofocus on load), optional customer_name + customer_contact fields, single primary CTA "Start walk →" (the only electric-blue element on the page)
  3. Open tickets — section label, count, table rows (merged PSA + internal with origin badges), row hover bg-elevated
  4. Resume in progress — shown when L1 has any session with status='active'. Lists ALL active sessions, not just one, sorted by last_step_at DESC. Each row shows ticket #, customer name, current node summary, "Step N · estimated M" or "Adhoc walk · {len(walk_notes)} notes".

Empty-state mode (first-run experience) — shown when count(flows) == 0 AND count(kb_documents) == 0 for the account:

┌──────────────────────────────────────────────────┐
│  Good morning, {firstName}.                       │
│                                                   │
│  ╔══════════════════════════════════════════════╗ │
│  ║  Your knowledge base is empty                ║ │
│  ║                                              ║ │
│  ║  L1 Workspace works best when your account  ║ │
│  ║  has KB content or authored flows. Right    ║ │
│  ║  now there's nothing to match against.      ║ │
│  ║                                              ║ │
│  ║  [for L1 role:]                              ║ │
│  ║  Ask your admin to:                          ║ │
│  ║  • Upload KB documents                       ║ │
│  ║  • Configure a KB connector (IT Glue, etc.)  ║ │
│  ║  • Or author a flow                          ║ │
│  ║                                              ║ │
│  ║  [for owner/coverage engineer:]              ║ │
│  ║  [ Upload KB content ]  [ Configure connector ]│ │
│  ║                                              ║ │
│  ║  You can still take calls — they'll start    ║ │
│  ║  as ad-hoc walks.                            ║ │
│  ╚══════════════════════════════════════════════╝ │
│                                                   │
│  Describe the problem (still works — will start   │
│  as ad-hoc walk):                                 │
│  [ ... textarea ... ]                             │
│  [ Start ad-hoc walk → ]                          │
└──────────────────────────────────────────────────┘

The empty-state card never blocks intake — an L1 can still take a call and the system gracefully starts an ad-hoc walk (since match_or_build will return aborted_no_kb).

Tailwind v4 tokens: bg-page base, bg-card zones, bg-elevated row hover, electric-blue accent only on primary CTA. No text-secondary. All borders border-default.

7.4 /l1/walk/{sessionId} walker

The walker renders one of two variants based on l1_walk_sessions.session_kind:

  • Tree variant (§7.4.A) — for session_kind in ('flow', 'proposal')
  • Adhoc variant (§7.4.B) — for session_kind = 'adhoc'

Both share the sticky header, persistent Escalate + Resolve buttons, customer info, and the resolve/escalate modals.

7.4.A Tree variant (flow + proposal walks)

Sticky header + two-pane body, full-height (flex chain per Lesson — every ancestor needs flex + flex-1 + min-h-0).

Header:

  • Back arrow + ticket ref + customer name + AI-built badge (when session_kind='proposal')
  • Problem statement line
  • Persistent action buttons: [ Escalate ] [ Resolve ✓ ]

Left pane (main):

  • "Step N · estimated M" label
  • Current node card — large yes/no/answer buttons (min 44px tap target)
  • Optional note textarea below the card (appended to walked_path as l1_note)
  • On a fresh proposal that's still building: shimmer placeholder + "Building from KB… ~10s"

Right pane (transcript):

  • Walked-so-far list (node title + answer chosen)
  • Current step highlight
  • "Source:" section listing KB citations for the current node (proposal walks only)

7.4.B Adhoc variant (no tree)

Same sticky header (no AI-built badge since there's no tree). Single-pane body instead of two-pane:

Header:

  • Back arrow + ticket ref + customer name + "Ad-hoc walk" pill
  • Problem statement line
  • Persistent action buttons: [ Escalate ] [ Resolve ✓ ]

Body:

  • Large notes editor (rich-text-lite — paragraph breaks, bullet lists, no formatting toolbar bloat)
  • Auto-save on debounce (300ms) to l1_walk_sessions.walk_notes via POST /l1/sessions/{id}/notes
  • Subtle saved-state indicator ("Saved 2s ago")
  • Optional "Add a step" button — appends a structured entry {timestamp, content} to walk_notes rather than free prose. Useful for recording sequential actions taken.

Why a separate variant rather than blank tree: the tree pane is built around the question/answer/transcript trio. Forcing an adhoc session through that frame produces a confusing UX (empty transcript pane, no current node). A dedicated note-taking surface respects the L1's actual job in this mode.

7.4.C Resolve modal (both variants)

  • "Did this resolve it?" [ Yes ] [ No ]
  • Resolution notes textarea (pre-filled with the most recent adhoc walk_notes entry if adhoc)
  • Yes + target was proposal → sets validated_by_outcome=true on the proposal
  • Yes + target was flow → no proposal change; flow's hit_count increments (telemetry only)
  • Yes + adhoc → no proposal/flow change; resolution_notes saved on session and ticket
  • No → prompt to escalate instead

7.4.D Escalate modal (both variants)

  • Reason category dropdown: Out of L1 scope · Customer demanding senior · Tree dead-ended · AI tree wrong · No KB available · Other
  • Free-text reason
  • Confirm

7.5 /l1/drafts page

Read-only list, columns: created · problem (truncated) · ticket # · status (pending review / outcome-validated / promoted / retired). Click → read-only detail view showing tree + walked path. No edit affordances.

7.6 /l1/tickets page

Full-page version of the dashboard queue widget. Filter by status, origin (PSA/internal), assigned-to-me.

7.7 Coverage banner

<L1CoverageBanner /> — slim ~32px band, info-cyan-dim background, mounted at the top of all /l1/* pages when !isL1Tech && (canCoverL1 || isOwner || isSuperAdmin):

You're covering L1. Actions logged as coverage. [Switch back →]

The "Switch back" link returns to /.

7.8 Routing

const L1Dashboard = lazyWithRetry(() => import('@/pages/l1/L1Dashboard'))
const L1WalkPage = lazyWithRetry(() => import('@/pages/l1/L1WalkPage'))
const L1DraftsPage = lazyWithRetry(() => import('@/pages/l1/L1DraftsPage'))
const L1TicketsPage = lazyWithRetry(() => import('@/pages/l1/L1TicketsPage'))

Mounted under the / ProtectedRoute branch at:

  • /l1L1Dashboard
  • /l1/walk/:sessionIdL1WalkPage
  • /l1/draftsL1DraftsPage
  • /l1/ticketsL1TicketsPage

Wrapped in L1RouteGuard (403 if not l1_tech AND not coverage-flagged). ProtectedRoute.tsx post-login redirect: L1 users land on /l1 instead of /.

lazyWithRetry, not React.lazy (per existing convention).

7.9 Session lifecycle, concurrency, and recovery

Concurrent sessions: an L1 may have multiple l1_walk_sessions rows with status='active' at the same time. The model imposes no single-session constraint — call patterns vary (one tech juggling two calls; one call drops and is resumed while another comes in; coverage engineer handling overflow). The dashboard's "Resume in progress" widget lists all active sessions ordered by last_step_at DESC.

Browser-close recovery: every POST /l1/sessions/{id}/step and adhoc POST /l1/sessions/{id}/notes writes the incremental state to the server. If the browser closes mid-walk (crash, reload, accidental tab close), revisiting /l1/walk/{sessionId} reloads the session from l1_walk_sessions — current node, walked path so far, notes, customer info — and resumes exactly where the L1 left off. No client-side persistence required.

Abandoned sessions: an APScheduler job (max_instances=1, hourly) flips sessions to status='abandoned' when status='active' AND last_step_at < now() - interval '24 hours'. Preserves the row for audit but removes it from the L1's "Resume in progress" widget. Abandoned sessions still appear in /l1/drafts filtered views if they walked a proposal.

No multi-tab guardrail in v1: if the same L1 opens the same session in two tabs, last-write-wins on walked_path. Acceptable for v1 — multi-tab is rare in helpdesk workflows. v2 could add optimistic-locking on the session row.


8. AI match-or-build pipeline

8.1 Match-or-build algorithm

match_or_build(account_id, problem_text, ticket_ref):
  embedding = embedding_service.embed(problem_text)

  # 1. Match authored flows
  flow_hits = rag_service.match_flows(account_id, embedding, k=5)
  if flow_hits and flow_hits[0].score >= MATCH_THRESHOLD:
      return {kind: 'flow', id: flow_hits[0].flow_id, score: ...}

  # 2. Match outcome-validated proposals only
  proposal_hits = rag_service.match_proposals(
      account_id, embedding, k=5,
      where=validated_by_outcome=true,
  )
  if proposal_hits and proposal_hits[0].score >= MATCH_THRESHOLD:
      return {kind: 'proposal', id: proposal_hits[0].proposal_id, score: ...}

  # 3. Near-miss zone: surface as suggestion, do NOT auto-build
  near_miss = max(
    (h for h in (flow_hits + proposal_hits) if h.score >= SUGGEST_THRESHOLD),
    key=lambda h: h.score,
    default=None,
  )

  # 4. Try to build fresh
  kb_chunks = rag_service.match_kb_chunks(account_id, embedding, k=8)
  if not kb_chunks:
      return {
          kind: 'aborted_no_kb',
          near_miss: near_miss,   # might still be useful as a starting point
      }
  nearest_flows = flow_hits[:3]
  if near_miss:
      # Frontend prompts: "Found a similar flow — use it, or build new?"
      return {kind: 'suggest', suggestion: near_miss, can_build: True}
  proposal = ai_tree_builder.build(
      problem_text, kb_chunks, nearest_flows, account_id, ticket_ref
  )
  return {kind: 'proposal', id: proposal.id, score: None}

Thresholds (per-account configurable):

  • MATCH_THRESHOLD default 0.75 (cosine) — auto-use without asking
  • SUGGEST_THRESHOLD default 0.60 (cosine) — surface as suggestion ("Found a similar flow — use it, or build new?")

Near-miss handling rationale: if a flow scores 0.74 against a 0.75 match threshold, building a fresh AI tree means a 515s wait when there's likely a directly usable flow already authored. Surfacing it as an L1 choice saves the build time and gives the L1 agency. Below SUGGEST_THRESHOLD (0.60), the match is too weak to be worth offering and we fall through to build (or abort).

The "no empty KB build" rule is enforced because an AI tree built on the model's general knowledge — without MSP-specific grounding — risks suggesting unsafe or hallucinated fixes. When this aborts, the frontend renders the BuildAbortedNoKB UX (§8.4).

8.2 AI tree-build details

Model: settings.get_model_for_action('l1_realtime_build'). Recommend Sonnet for v1 (latency-sensitive).

Schema: output validated against the existing flow node schema (matches tree_editor output). Validation failure aborts the build rather than persisting malformed data.

Prompt strategy (per Lesson on prompt anti-parrot — critical):

  • System prompt: role definition + output schema using <placeholder> notation only. Never literal field values.
  • Few-shot examples loaded as user/assistant messages from a separate file, never inline in the system prompt.
  • User message: {problem_statement} + {kb_context: [doc_title, section, content]} + {nearest_flow_summaries} + instruction to cite KB chunks per node.
  • Output includes kb_citations: [{node_id, kb_doc_id, snippet}] for walker's "Source:" pane and engineer review.

Latency: whole-tree-then-return (~515s typical). UX is a shimmer "Building from KB…" placeholder. Streaming node-by-node deferred to v2.

Anthropic SDK config (per Lesson): max_retries=1. Prompt caching enabled on the stable system+few-shot bundle (high cache hit rate expected per account).

Telemetry:

  • l1.match_or_build.duration_ms, l1.match_or_build.outcome (flow_match/proposal_match/built/aborted_no_kb)
  • anthropic.cache events (existing pattern) tagged action=l1_realtime_build
  • l1.tree_build.tokens_in, tokens_out

Anti-parrot guardrail: the existing tests/test_prompt_anti_parrot.py auto-discovers new prompt constants via pattern match on *_PROMPT / *_SCHEMA / *_PROTOCOL / *_FORMAT. No new test required.

8.3 Hallucinated-citation defense

After build, the writer verifies every kb_doc_id in kb_citations exists in the account's KB. Unverified citations are stripped from the walker's "Source:" pane (the node still renders, just without a source). Engineer review surfaces stripped citations as a warning.

8.4 BuildAbortedNoKB UX (live-call graceful degradation)

The L1 is on a phone call when this fires. A generic "error" toast is unacceptable. The frontend renders a dedicated screen instead of navigating into a walker:

┌────────────────────────────────────────────────────┐
│  No knowledge base content yet                     │
│                                                    │
│  We couldn't match an existing flow and there's    │
│  nothing in your KB to build a new one from.       │
│                                                    │
│  You have three options for this call:             │
│                                                    │
│  ┌──────────────────────────────────────────┐    │
│  │  Start an ad-hoc walk                    │ →  │   ← primary CTA
│  │  Take notes, capture the resolution      │    │
│  └──────────────────────────────────────────┘    │
│                                                    │
│  ┌──────────────────────────────────────────┐    │
│  │  Escalate to engineering                 │ →  │
│  │  Reason pre-filled: "No KB available"    │    │
│  └──────────────────────────────────────────┘    │
│                                                    │
│  [ (near_miss present?) ]                         │
│  ┌──────────────────────────────────────────┐    │
│  │  Try this similar flow instead           │ →  │
│  │  "{near_miss.title}" · {score} match     │    │
│  └──────────────────────────────────────────┘    │
│                                                    │
│  ─────────────────────────────────────────         │
│  Tip: ask your admin to upload KB content or       │
│  configure a connector under Account → KB.         │
└────────────────────────────────────────────────────┘

Each option triggers a distinct backend path:

  • Start an ad-hoc walkPOST /l1/sessions/adhoc → creates l1_walk_sessions row with session_kind='adhoc', no flow/proposal. Navigates to /l1/walk/{id} rendering the adhoc walker variant (§7.4.B).
  • EscalatePOST /l1/escalate-without-walk (a thin variant of the session-escalate endpoint that takes no session id; creates an immediately-escalated session record and reassigns the ticket). Pre-fills reason_category='No KB available'.
  • Try similar flow (only when near_miss was returned) → starts a flow session against the suggested flow, same as if matched.

This is the graceful degradation contract: no L1 should ever hit a dead end on a live call.

8.5 Near-miss "Suggest" UX

When match_or_build returns {kind: 'suggest', suggestion: ..., can_build: true}, the intake response triggers an inline prompt on the dashboard (no full-page transition):

┌────────────────────────────────────────────────────┐
│  Found a similar flow                              │
│                                                    │
│  "Outlook can't connect after password reset"      │
│  Match: 67% · last updated 2 weeks ago             │
│                                                    │
│  [ Use this flow ]  [ Build new tree ]             │
└────────────────────────────────────────────────────┘
  • Use this flow → starts a flow session against the suggestion.
  • Build new tree → re-calls match_or_build with force_build=true parameter, bypasses the suggest pass, goes directly to build.

This keeps the L1 in control while saving the 515s build time when there's an obvious starting point.


9. KB ingestion

9.1 Connector interface

class KBConnector(ABC):
    async def test_credentials(self) -> bool
    async def list_documents(self, since: datetime | None) -> AsyncIterator[KBDocRef]
    async def fetch_content(self, ref: KBDocRef) -> KBDocContent
    async def subscribe_to_changes(self) -> AsyncIterator[ChangeEvent]   # optional, no-op v1

Registry dispatches by provider string. Credentials encrypted at rest via Fernet (reuse services/psa/encryption.py pattern).

9.2 Per-connector specifics

IT Glue Hudu Microsoft Graph (SharePoint/OneDrive)
Auth API token (header) API key (header) OAuth 2.0
Ingested types Documents, KB Articles Articles docx, pdf, md, txt
Never ingested Passwords, Configurations, sensitive flex assets Passwords, sensitive items Files in folders matching (secret|confidential|private) heuristic; files with a tenant Sensitivity Label
Filtering Per-org (techs see all client orgs they have permission to) Per-folder Per-site / per-drive (owner picks at config time)
Rate limits ~100/min token bucket ~250/min token bucket Built-in Graph throttling backoff

All three deliver content to kb_ingestion_writer which:

  1. Chunks (paragraph-aware, configurable size with overlap)
  2. Embeds via embedding_service
  3. Upserts into kb_documents keyed on (connector_config_id, source_ref); chunks into kb_document_chunks

Cross-connector conflicts: same doc text appearing in two connectors yields two rows (provider-scoped source_ref). Engineers can dedup manually if needed.

9.2.1 Microsoft Graph OAuth flow (called out — non-trivial)

Unlike IT Glue and Hudu (simple API token entry), Microsoft Graph requires a full OAuth 2.0 flow. This is materially more complex and worth specifying:

Prerequisites:

  • Register a Microsoft Entra ID app for ResolutionFlow. Single-tenant or multi-tenant: multi-tenant so MSPs can authorize against their own M365 tenants.
  • Configured redirect URI: https://resolutionflow.com/api/v1/kb-connectors/microsoft_graph/oauth/callback (plus a localhost variant for dev).
  • Scopes (least privilege): Files.Read.All + Sites.Read.All + offline_access (for refresh token). User must consent at the tenant level (admin consent required if the tenant has restricted user-consent).

Flow:

  1. Owner clicks "Connect SharePoint/OneDrive" on /account/kb-connectors. Frontend calls POST /api/v1/kb-connectors with provider='microsoft_graph' and minimal body (no credentials yet) → backend returns {authorize_url} with state token (signed JWT containing account_id + nonce, ~10min TTL).
  2. Frontend opens authorize_url in a popup (preferred) or full-page redirect. User signs into Microsoft, consents.
  3. Microsoft redirects to ResolutionFlow callback /api/v1/kb-connectors/microsoft_graph/oauth/callback?code=...&state=....
  4. Backend validates state JWT (extracts account_id, verifies nonce). Exchanges code for {access_token, refresh_token, expires_in} via Microsoft token endpoint.
  5. Backend stores both tokens encrypted (Fernet) into kb_connector_configs.credentials_encrypted as a JSON blob {access_token, refresh_token, expires_at, tenant_id}. Sets display_name from the user's M365 tenant name.
  6. Backend returns {success: true} to the popup window which postMessage's the parent and closes.

Site/drive selection: After the initial OAuth, owner picks which SharePoint sites and OneDrive drives to ingest. The connector exposes a discovery endpoint that lists available sites; owner picks. Selection persists in kb_connector_configs.metadata JSONB: {site_ids: [...], drive_ids: [...]}.

Access token refresh: The connector client (services/kb_connectors/microsoft_graph.py) wraps every API call: check expires_at, if within 5min of expiry call refresh endpoint, update stored tokens. Refresh failures (refresh_token expired or revoked) flip kb_connector_configs.last_sync_status='auth_expired' and surface in the connector status UI prompting owner to re-authorize.

Scope creep risk: keep to Files.Read.All + Sites.Read.All. Do not request write scopes, mailbox scopes, or directory scopes even if convenient — read-only KB is the entire value prop.

9.2.2 KB document visibility

Clarification (was ambiguous in initial spec): /account/kb is owner + engineer accessible only. L1s do NOT see KB documents directly — they only see KB content surfaced via walker citations during a walk. This matches the principle that L1 staff are downstream consumers of the knowledge curated by their account's owner/engineers.

Frontend route: /account/kb gated by require_engineer_or_admin. L1 hitting it → redirect to /l1 with toast "KB management is owner/engineer only."

9.3 Sync scheduling

kb_ingestion_scheduler.py runs as APScheduler interval job, max_instances=1. Per cycle:

  1. Query active kb_connector_configs where last_sync_at is older than sync_interval_minutes (default 360 = 6h).
  2. Dispatch per account; concurrency cap = 4 simultaneous accounts.
  3. For each connector: list_documents(since=last_sync_at) → for each ref, fetch_content → write.
  4. Compute the diff between current refs and existing rows (same connector_config_id); soft-delete missing ones via deleted_at.
  5. Update last_sync_at, last_sync_status, last_sync_error.

Must use _admin_session_factory() not get_db() for startup-side and scheduler-side queries (per Lesson on RLS at startup — no app.current_account_id set).

Immediate sync via POST /api/v1/kb-connectors/{id}/sync enqueues a job; scheduler picks it up within ~30s.


10. Escalation flow

  1. L1 clicks Escalate → modal (reason category + optional free text).
  2. POST /api/v1/l1/sessions/{id}/escalate → backend:
    • Calls extended escalation_package_generator.generate(session_id, include_l1_walk=true). Package contents:
      problem_statement, customer_name, customer_contact,
      ticket_ref (PSA id or internal id),
      target_kind ('flow' | 'proposal'), target_id,
      walked_path,
      ai_draft_proposal_id,
      kb_citations,
      escalation_reason, reason_category, l1_user_id
      
    • Creates an ai_session with the package serialized into system context for the chat surface.
    • If PSA-backed: psa_provider.reassign_ticket(ticket_id, to=account.engineer_queue_name). Default 'Tier 2'. Owner configurable in /account/integrations.
    • If internal-backed: internal_tickets.status='escalated', assigned_user_id=null (round-robin assignment is out of scope).
    • Writes notification via existing notification_service — bell badge to all engineers in account.
    • Audit log entry; acting_as reflects whether L1 or coverage-engineer escalated.
  3. Toast on L1 side, return to /l1.
  4. Engineer clicks notification → /pilot/{sessionId} → chat surface renders the package as a sticky "Escalation context" card; engineer continues in chat.

Un-escalate is out of scope. If engineer wants to bounce back, they reassign in PSA manually.

Known limitation — escalation notification noise: "notify all engineers" is intentionally simple for v1 but does not scale. A 20-engineer account will get 20 bell badges per escalation, which trains everyone to ignore them. v2 work (§13) covers targeted notification — on-duty engineer presence, round-robin assignment, or an owner-designated escalation recipients list. Acknowledged as a real product issue, not a hidden one.


11. Internal ticket fallback

When the account has no active PSA provider:

  • Intake creates internal_tickets row instead of a PSA ticket.
  • Queue surface merges PSA + internal with Internal / PSA origin badge.
  • Escalation flips internal_tickets.status='escalated' and assigns engineer (or leaves null for any engineer to claim — v1 behavior).
  • Engineer post-escalation sees the internal ticket as a session; no PSA roundtrip.

Promote to PSA: owner-only action on any internal ticket. Pushes the ticket into the configured PSA provider, sets psa_promoted_ticket_id. Manual; not automatic on PSA-install. Lets MSPs adopt PSA mid-flight without orphaning prior internal tickets.


12. Outcome-validation lifecycle

1. L1 intake → match_or_build → FlowProposal(source='ai_realtime_l1',
                                              validated_by_outcome=false,
                                              linked_ticket_id=...)
                              → L1WalkSession(session_kind='proposal',
                                              flow_proposal_id=...,
                                              status='active')
2. L1 walks → POST /l1/sessions/{id}/step appends to l1_walk_sessions.walked_path
              (NOTE: walked_path lives on the session, not the proposal — multiple L1s
               may walk the same proposal independently)
3. L1 hits Resolve:
     modal: "Did this resolve it?" [Yes] [No] + resolution_notes
4. helpful=true → flow_proposal.validated_by_outcome = true   (set if not already)
                 → l1_walk_sessions.status = 'resolved', helpful = true
                 → ticket closed (PSA or internal)
   helpful=false → flow_proposal.validated_by_outcome unchanged
                  → l1_walk_sessions.status = 'resolved', helpful = false
                  → L1 prompted: "Escalate instead?"
5. Engineer review queue:
     ORDER BY validated_by_outcome DESC, created_at DESC
     - Outcome-validated drafts surface first
     - Review pane shows the most recent helpful=true walk's walked_path as evidence
     - Promote / edit-and-promote / retire
6. Promote → new flow with source='ai_promoted'; original proposal kept with status='promoted'
           → future match_or_build matches the new flow on the flow-match pass

Why validated_by_outcome on the proposal but walked_path on the session: validated_by_outcome is a one-bit signal that aggregates across all walks of a proposal (one L1 saying "this worked" is enough to flag the proposal as worth engineer attention). walked_path is the per-walk evidence and must be kept per-session — multiple paths through the same tree by different L1s tell different stories. Engineer review pulls the LATEST helpful=true session's path as the canonical "this is how it worked" record.


13. Out of scope (v1 non-goals)

  • End-user / self-service portal ("L0" tier).
  • Engineer warm-transfer / live take-over during a call.
  • L1 ↔ engineer real-time chat during a call.
  • Multi-language UI / customer-language toggle in walker.
  • Auto-promote internal tickets to PSA on integration install.
  • AI tree streaming (node-by-node).
  • KB write-back to IT Glue/Hudu/SharePoint (read-only ingestion).
  • Confluence connector.
  • Per-step KB citation editing in engineer review (engineers edit the tree, not citations).
  • Final Stripe pricing SKU (data model supports differential pricing; price set in Stripe dashboard).
  • "Switch to L1 mode" persistent toggle for engineers (coverage flag + banner only).
  • Cancel/un-escalate flow.
  • Round-robin engineer assignment on internal-ticket escalations.
  • Targeted escalation notification (on-duty presence, round-robin, owner-designated recipients) — v1 notifies all engineers; this will not scale past mid-size accounts. v2 work.
  • Quick-select problem shortcuts on the L1 dashboard (top-N common problems as one-click intake buttons). Worth doing in v2 once telemetry reveals which problems dominate. Reduces typing on calls.
  • Rich-text resolution notes with formatting toolbar. v1 is plain text + paragraph breaks only.
  • Multi-tab session locking — last-write-wins on concurrent same-session edits in v1.
  • Step-by-step audit log rows — v1 audits only at resolve/escalate (§5.6.1). Higher granularity is additive later.
  • Bulk KB document delete in /account/kb — per-row delete only in v1.

14. Testing strategy

14.1 Backend (pytest)

  • Unit: match_or_build covers all five paths (flow-match, proposal-match, suggest, built, aborted_no_kb). Assert thresholds work at boundaries (score = MATCH_THRESHOLD, score = SUGGEST_THRESHOLD, etc.).
  • Unit: ai_tree_builder schema validation — assert rejection of malformed Anthropic output before persistence.
  • Unit: each connector's list_documents + fetch_content against recorded HTTP fixtures.
  • Unit: Microsoft Graph OAuth flow — state JWT validation, token exchange, refresh, auth-expired surfacing.
  • Unit: seat_enforcement.check_seat_available — engineer + L1 paths, grandfathered case.
  • Integration: intake → walk(flow) → resolve(helpful=true) → assert flow's hit_count incremented, ticket closed (no proposal change).
  • Integration: intake → walk(proposal) → resolve(helpful=true) → assert FlowProposal.validated_by_outcome=true, l1_walk_sessions.helpful=true, ticket closed.
  • Integration: intake → walk → escalate → assert PSA reassign_ticket invoked, ai_session created with package, audit log entry written ONLY at escalate (not steps), notification dispatched.
  • Integration: intake on empty-KB account → assert outcome='aborted_no_kb' returned, no proposal created.
  • Integration: /l1/sessions/adhoc → walker variant flag set → resolve → ticket closed, no proposal/flow touched.
  • Integration: /l1/escalate-without-walk → escalated session row created, no walked_path, package generated.
  • Integration: KB scheduler — max_instances=1, sequential per-account, soft-delete on removal.
  • Integration: Microsoft Graph refresh-token expiry → last_sync_status='auth_expired' surfaced.
  • Integration: invite past seat limit → 402 returned; accept-invite at limit → 422; role-change at limit → blocked.
  • Integration: grandfathered over-seated account → existing users keep access, new invite blocks.
  • Integration: concurrent session creation by same L1 → both rows persist, dashboard returns both in "Resume in progress" sorted by last_step_at DESC.
  • Integration: session abandonment job — flips status='active' rows with last_step_at < now() - 24h to 'abandoned'.
  • RLS regression (highest priority): l1_tech user in account A cannot read account B's tickets, drafts, KB docs, connector configs, or walk sessions. Added to existing RLS test suite.
  • Anti-parrot: existing CI test auto-discovers new prompt module.

14.2 Frontend

  • Unit: usePermissions — L1 sees L1 paths, blocked from engineer paths. Coverage flag opens L1 paths.
  • Unit: L1WalkPage tree variant — node advance, escalate modal, resolve modal flips validated_by_outcome correctly.
  • Unit: L1WalkPage adhoc variant — notes auto-save (debounced), no node card rendered, resolve uses notes as resolution_notes pre-fill.
  • Unit: L1Dashboard empty-state — renders empty card when flows+KB are both zero; intake still works.
  • Unit: L1Dashboard resume-in-progress — lists multiple active sessions ordered by last_step_at DESC.
  • Unit: L1CoverageBanner — visible for engineer-with-flag on /l1/*, hidden for L1 users.
  • Unit: BuildAbortedNoKB screen — renders three CTAs (with/without near_miss), routes correctly to adhoc/escalate/use-suggestion.
  • Unit: SuggestPrompt component — accepts a suggestion, "Build new tree" re-calls intake with force_build=true.
  • E2E (Playwright, scoped selectors per Lesson):
    • L1 sign-in → dashboard → intake → walker → resolve → verify ticket closed + proposal flagged.
    • L1 on empty-KB account → intake → BuildAbortedNoKB screen → "Start ad-hoc walk" → adhoc walker → resolve.
    • L1 with near-miss → intake → suggest prompt → "Use this flow" → flow walker.
    • L1 browser-close mid-walk → re-open /l1/walk/{id} → state restored.
    • Engineer with can_cover_l1 → sidebar entry visible → click → coverage banner shows → walks a session → audit log records acting_as='l1_coverage'.
    • Owner invites past seat limit → blocked with upgrade prompt.
    • L1 hitting /pilot, /trees/new, /escalations, /account/kb → 403 or redirect.

15. Acceptance criteria (v1 ships when…)

  • L1 role assignable; assigned L1 sees L1 sidebar only; no engineer route reachable.
  • L1 intake creates a ticket (PSA or internal) and lands in walker session — OR renders the BuildAbortedNoKB screen when KB is empty, OR renders the suggest prompt when near-miss exists.
  • Walker handles flow walks, proposal walks, AND adhoc walks (single-pane note-taking variant). All three resolve and escalate correctly.
  • Concurrent sessions supported; browser-close mid-walk recoverable; abandoned sessions auto-flipped after 24h inactivity.
  • First-run empty-state card renders on dashboard when account has no flows AND no KB docs; intake still works (degrades to adhoc).
  • Escalate generates package, reassigns ticket, notifies engineers. Escalate from BuildAbortedNoKB pre-fills reason category.
  • Resolve flips validated_by_outcome on proposals; review queue prioritizes outcome-validated drafts and surfaces the latest helpful walk's path as evidence.
  • All three KB connectors configurable; initial sync + periodic re-sync + soft-delete on removal. Microsoft Graph OAuth flow completes end-to-end including refresh token rotation.
  • AI build refuses cleanly when account KB is empty (returns aborted_no_kb, not an exception).
  • Coverage flag works end-to-end with audit-log tagging (acting_as='l1_coverage').
  • Seat enforcement: invite blocks with structured 402/422 when target-role seats are exhausted, for BOTH L1 and engineer roles.
  • RLS blocks cross-tenant reads on every new table (l1_walk_sessions, internal_tickets, kb_connector_configs, kb_documents, kb_document_chunks).
  • L1 seat count tracked separately from engineer seats; seat counter widget visible in admin/users UI.
  • L1s cannot access /account/kb (owner+engineer only) — confirmed by route guard test.

16. Risks & mitigations

Risk Mitigation
AI builds an unsafe tree Schema validation rejects malformed output. Engineer review is the gate before draft becomes "real" flow. v1 refuses to build when KB is empty.
Hallucinated KB citations Post-build verification that each kb_doc_id exists; unverified citations stripped from walker, surfaced as warning in engineer review.
Duplicate proposals for same problem Validated-proposal match pass deduplicates after one L1 validates; pre-validation dups are tolerated and dedup'd during engineer review.
KB ingestion captures sensitive content Per-connector deny-lists (passwords, sensitive flex assets, MS Graph Sensitivity Labels). Owners exclude specific folders/sites at config. Ingested docs visible only to owners + engineers (NOT L1s) in /account/kb for manual deletion.
AI build latency frustrates customer on call Build-progress UI sets expectation. Escalate button visible from page load. Future: pre-warm builds on PSA-ticket-landed event.
Three connectors is more scope than originally proposed Acknowledged. Each connector is ~12 weeks of work; Microsoft Graph OAuth is the heaviest (§9.2.1). Plan should sequence them and allow shipping with IT Glue + Hudu first if SharePoint slips.
Engineer review queue backlog stalls library growth Validated-proposal match pass means good drafts get reused without engineer review. Backlog only delays the move from 'proposal' to 'flow', not the L1's ability to use validated content.
walked_path JSONB grows unboundedly on long calls with many notes Per-call paths are bounded by tree depth (typically <20 nodes); per-L1 notes are typically short. Real risk only emerges for adhoc walks with verbose note-taking on multi-hour calls. v1 caps walk_notes JSONB at 256 KB at the API layer with a 400 error and "notes too long — consider escalating." Future v2: normalize notes into a separate l1_walk_notes table if size becomes a real issue.
Engineer notification overload at scale Acknowledged — see §10 "Known limitation." v1 notifies all engineers; v2 work covers targeted notification. Mid-size accounts (10+ engineers) will feel this first; flag in onboarding docs.
L1 seat enforcement breaks for accounts grandfathered over their seat count §3.6 specifies non-retroactive enforcement: existing over-seated accounts get a banner but functionality is preserved until next invite. Confirm test coverage for grandfathered state.

17. Naming reference

Layer Value
DB enum (account_role) l1_tech
UI display "L1 Tech" / "L1"
Sidebar entry "L1 Workspace"
URL prefix /l1
Coverage flag column users.can_cover_l1
Coverage audit tag acting_as = 'l1_coverage'
Pricing label "L1 seat"
Stripe SKU Set in Stripe dashboard at launch — data model supports differential pricing now

18. Open implementation decisions (deferred to plan, not blocking design)

  • Specific MATCH_THRESHOLD default value validation (initial 0.75, tune from telemetry post-launch).
  • Specific Anthropic model choice for l1_realtime_build (Sonnet vs Opus — pick based on quality benchmark during plan).
  • Chunk size + overlap for KB ingestion writer (tune in implementation).
  • Engineer queue label default ('Tier 2' vs 'Engineering') — owner-configurable anyway.
  • Exact look of the build-progress shimmer animation — design-system handoff.

These are tuning/UX-polish details, not architectural forks. They land during the writing-plans phase, not here.

Note on scope and phasing

This is a substantive feature: new role, four frontend pages, ~12 endpoints, AI tree-builder, three KB connectors, escalation extensions, and six migrations. The implementation plan will almost certainly phase the work — a reasonable cut is:

  • Phase 1: role + L1 surface against existing authored flows (no AI build, no connectors yet). Validates the seat model, walker UX, escalation, internal ticket fallback, and coverage flag end-to-end.
  • Phase 2: kb_documents schema + AI tree-builder + match-or-build pipeline. Enables real-time AI flows grounded on manually-uploaded KB.
  • Phase 3: the three KB connectors (IT Glue, Hudu, SharePoint/OneDrive). Each is roughly self-contained — can ship one at a time and reorder if a connector blocks.

Phasing is a plan-level decision; the spec captures the full feature.


End of spec.