Files
resolutionflow/docs/superpowers/specs/2026-05-28-l1-workspace-design.md
Michael Chihlas d1cf77cd41 docs(design): L1 workspace feature spec
New seat tier between engineer and viewer. Dedicated /l1 surface
(dashboard + walker + drafts) for first-call helpdesk staff. Walk-in
intake + PSA queue both produce tickets. Match-or-build pipeline
prefers authored flows, then outcome-validated AI drafts, then builds
fresh from KB. Three KB connectors: IT Glue, Hudu, SharePoint/OneDrive.
Escalation via package + PSA reassign, picked up in chat. Engineer
coverage via per-user can_cover_l1 flag with audit-log tagging.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 03:33:32 -04:00

37 KiB
Raw Blame History

L1 Workspace — Design Spec

Date: 2026-05-28 Status: Draft (pending implementation plan) Audience for this doc: engineers + reviewers building the L1 workspace feature


1. Summary

Introduce a dedicated L1 helpdesk workspace as a new seat tier in ResolutionFlow. L1 techs walk customers through yes/no decision trees on inbound tickets and phone calls. The platform either matches an existing authored flow, reuses an outcome-validated AI draft, or builds a fresh decision tree in real time from the MSP's ingested knowledge base. Drafts that resolve a call become "outcome-validated" and surface first in the engineer review queue for promotion to authored flows. KB ingestion supports manual upload plus three MSP-native connectors: IT Glue, Hudu, and Microsoft SharePoint/OneDrive.

This re-introduces the original deterministic tree-walker UX — which had been deprecated in favor of chat-primary FlowPilot — and repositions it as a frontline-tier product surface distinct from the engineer chat surface.


2. Motivation

The current ResolutionFlow product funnels every user — regardless of skill tier — into a single chat-primary surface (AssistantChatPage mounted at /pilot). The chat is excellent for engineers but is the wrong primitive for L1 helpdesk staff who:

  • Take inbound phone calls and need a fast, deterministic click-through UX
  • Resolve simple, recurring problems (password resets, mailbox connection issues, VPN disconnects, printer queue clears, etc.)
  • Are not authorized to escalate complex issues themselves; they hand off to engineers

A tree-walker UX serves this audience natively. The substrate already exists in the codebase — decision-tree data model, authoring tools, RAG, KB Accelerator, escalation packaging — but no first-class L1 surface ties it together. This spec defines that surface and the supporting AI/KB pipeline.


3. Users & roles

3.1 Role hierarchy

super_admin > owner > engineer > l1_tech > viewer

l1_tech is added to the account_role enum. Permissions enforced via app/core/permissions.py and app/api/deps.py.

3.2 What L1 can do

  • Use the /l1/* surface
  • Open tickets from their queue (PSA-fed or internal)
  • Intake walk-in/phone-call problems (creates a ticket as a side effect)
  • Walk authored flows and AI-built FlowProposal drafts
  • Resolve or escalate a session
  • View their own AI drafts list (read-only — outcome tags shown)

3.3 What L1 cannot do

  • See the chat surface (/pilot) — sidebar hidden, route 403s
  • Author or edit flows
  • See /review-queue or /escalations (engineer inboxes)
  • See team analytics (only /analytics/me)
  • Promote AI drafts (engineers/owners only, via existing review queue)
  • Configure KB connectors (owner-only)

3.4 Engineer L1 coverage

Engineers do NOT see the L1 surface by default. Owners can toggle users.can_cover_l1 = true on individual engineer users. Engineers with that flag (and all owners/super_admins) see an "L1 Workspace" entry in their sidebar. Clicking it puts them in /l1/* with a sticky banner: "Covering L1 — actions logged as coverage." Coverage actions are audit-logged with acting_as = 'l1_coverage'.

Backend dep: require_l1_or_coverage = l1_tech | (engineer AND can_cover_l1) | owner | super_admin.

This mirrors the existing orthogonal-flag pattern (is_team_admin) — no new architectural concept.

3.5 Billing data model

  • accounts.l1_seats_purchased INTEGER NOT NULL DEFAULT 0 (new column)
  • Existing accounts.seats_purchased continues to represent engineer seats
  • New Stripe SKU placeholder for L1 seat; actual pricing set in Stripe dashboard out-of-band

4. Architecture overview

4.1 New components

Frontend:

  • pages/l1/L1Dashboard.tsx — landing page; ticket queue + describe-the-problem intake
  • pages/l1/L1WalkPage.tsx — purpose-built walker; yes/no cards, transcript, persistent escalate/resolve
  • pages/l1/L1DraftsPage.tsx — read-only list of the L1's AI drafts and promotion status
  • pages/l1/L1TicketsPage.tsx — full-page queue (PSA + internal merged)
  • components/l1/L1CoverageBanner.tsx — slim banner shown to engineer-coverers

Backend:

  • services/match_or_build.py — orchestrator (RAG match → fallback to AI build)
  • services/ai_tree_builder.py — real-time AI tree generation via Anthropic
  • services/kb_connectors/ package — base, registry, encryption, plus itglue.py, hudu.py, microsoft_graph.py
  • services/kb_ingestion_writer.py — shared writer used by manual upload + all connectors
  • services/kb_ingestion_scheduler.py — APScheduler job, max_instances=1, per-connector sync
  • services/internal_ticket_service.py — CRUD + status transitions for the no-PSA fallback
  • services/l1_session_service.py — walking-session lifecycle
  • api/endpoints/l1.py — L1-role endpoints
  • api/endpoints/kb_connectors.py — KB connector config endpoints (owner-only for write)

Reused / extended:

  • services/rag_service.py — flow & KB matching (existing)
  • services/flow_matching_engine.py — existing
  • services/escalation_package_generator.py — extended to include walked path, AI draft pointer, KB citations
  • models/FlowProposal — new columns (see §5)
  • services/psa/ — already supports ticket create + reassign across CW/Autotask/HaloPSA
  • services/embedding_service.py — used by KB ingestion writer
  • New kb_documents + kb_document_chunks tables for RAG-retrievable document storage, separate from the existing kb_imports (which is a document→tree conversion record, not a persistent KB store — see §5)
  • Audit log writer — gains acting_as field

4.2 Data flow — walk-in / phone-call intake

L1 types: "User can't connect Outlook after password reset"
  POST /api/v1/l1/intake
    body: { problem_statement, customer_name?, customer_contact? }
    → create ticket
        - PSA if configured: psa_provider.create_ticket(...)
        - else: internal_tickets row
    → match_or_build(account_id, problem_text, ticket_ref)
        → rag_service.match_flows(...) → top hit; if score ≥ threshold return as 'flow'
        → rag_service.match_proposals(... where validated_by_outcome=true)
                                           → top hit; if score ≥ threshold return as 'proposal'
        → ai_tree_builder.build(problem_text, kb_chunks, nearest_flows)
                                           → persist FlowProposal(source='ai_realtime_l1',
                                                                  linked_ticket_id,
                                                                  linked_ticket_kind,
                                                                  validated_by_outcome=false)
                                           → return as 'proposal'
    → l1_session_service.start(...)
    → return { session_id, target_kind, target_id, intake_type }
  → navigate to /l1/walk/{session_id}

4.3 Data flow — PSA-queue intake

The L1 dashboard polls the L1's PSA queue plus their internal tickets. Clicking a ticket row calls POST /api/v1/l1/tickets/{ticket_ref}/start which is the same match_or_build path (the problem_statement is the ticket subject + description) followed by walker navigation.


5. Data model

All new tenant-isolated tables get RLS policies (account-scoped, WITH CHECK). All TIMESTAMPs are TIMESTAMPTZ. No --rev-id on Alembic; no --autogenerate for enum/RLS work.

5.1 FlowProposal — extended

Existing AI-draft model. Add columns:

Column Type Notes
source VARCHAR(30) NOT NULL 'ai_realtime_l1' | 'kb_accelerator' | 'manual_draft'. Backfill existing rows to 'manual_draft'.
linked_ticket_id VARCHAR(64) NULL PSA id or internal_tickets UUID (stored as text)
linked_ticket_kind VARCHAR(10) NULL 'psa' | 'internal'
validated_by_outcome BOOLEAN NOT NULL DEFAULT FALSE Flipped to true when L1 resolves and marks helpful=true
walked_path_snapshot JSONB NULL Frozen at resolve/escalate; shape [{node_id, question, answer, l1_note}]

Engineer review queue sort:

ORDER BY validated_by_outcome DESC, created_at DESC

5.2 internal_tickets — new

id                        UUID PRIMARY KEY
account_id                UUID NOT NULL  (RLS-scoped)
created_by_user_id        UUID NOT NULL  (the L1 who took the call)
customer_name             VARCHAR(120)
customer_contact          VARCHAR(200) NULL    (email or phone, free text)
problem_statement         TEXT NOT NULL
status                    VARCHAR(30) NOT NULL  -- 'open' | 'walking' | 'resolved' | 'escalated'
flow_id                   UUID NULL FK trees
flow_proposal_id          UUID NULL FK flow_proposals
ai_session_id             UUID NULL FK ai_sessions (set when engineer picks up in chat post-escalation)
assigned_user_id          UUID NULL    (engineer post-escalation)
resolution_notes          TEXT NULL
psa_promoted_ticket_id    VARCHAR(64) NULL   (set if later promoted to PSA)
created_at                TIMESTAMPTZ NOT NULL
updated_at                TIMESTAMPTZ NOT NULL
resolved_at               TIMESTAMPTZ NULL

RLS: account-scoped, WITH CHECK on insert/update.

5.3 kb_connector_configs — new

id                        UUID PRIMARY KEY
account_id                UUID NOT NULL  (RLS-scoped)
provider                  VARCHAR(20) NOT NULL  -- 'itglue' | 'hudu' | 'microsoft_graph'
display_name              VARCHAR(80) NOT NULL
credentials_encrypted     BYTEA NOT NULL        -- Fernet, same pattern as services/psa/encryption.py
is_active                 BOOLEAN NOT NULL DEFAULT TRUE
sync_interval_minutes     INTEGER NOT NULL DEFAULT 360
last_sync_at              TIMESTAMPTZ NULL
last_sync_status          VARCHAR(20) NULL      -- 'success' | 'error' | 'running'
last_sync_error           TEXT NULL
created_by_user_id        UUID NOT NULL
created_at                TIMESTAMPTZ NOT NULL
updated_at                TIMESTAMPTZ NOT NULL
UNIQUE (account_id, provider, display_name)

RLS: account-scoped, WITH CHECK.

5.4 New tables: kb_documents + kb_document_chunks

The existing kb_imports table is a document→tree conversion record (status lifecycle processing | ready | committed | failed, target tree_id) — designed to turn one document into one authored flow. It is NOT a persistent KB document store and does not power RAG retrieval.

The L1 feature needs a separate pair of tables that store ingested docs in RAG-retrievable form:

kb_documents — one row per ingested document:

id                        UUID PRIMARY KEY
account_id                UUID NOT NULL  (RLS-scoped)
source_kind               VARCHAR(20) NOT NULL  -- 'upload' | 'paste' | 'itglue' | 'hudu' | 'microsoft_graph'
source_ref                VARCHAR(200) NULL     -- provider-side document ID for re-sync
connector_config_id       UUID NULL FK kb_connector_configs
title                     VARCHAR(500) NOT NULL
content                   TEXT NOT NULL          -- full post-extraction text
content_hash              VARCHAR(64) NOT NULL   -- sha256 for change-detection
metadata                  JSONB NULL             -- provider-specific (org_id, drive_id, etc.)
last_synced_at            TIMESTAMPTZ NULL
deleted_at                TIMESTAMPTZ NULL       -- soft-delete on connector removal
created_at                TIMESTAMPTZ NOT NULL
updated_at                TIMESTAMPTZ NOT NULL

Unique partial index: (connector_config_id, source_ref) WHERE source_ref IS NOT NULL.

kb_document_chunks — chunks with embeddings, used by rag_service.match_kb_chunks:

id                        UUID PRIMARY KEY
document_id               UUID NOT NULL FK kb_documents ON DELETE CASCADE
account_id                UUID NOT NULL  -- denormalized for RLS
chunk_index               INTEGER NOT NULL
content                   TEXT NOT NULL
embedding                 VECTOR(<dim>) NOT NULL  -- dim matches embedding_service
metadata                  JSONB NULL              -- section title, page number, etc.
created_at                TIMESTAMPTZ NOT NULL
UNIQUE (document_id, chunk_index)

Pgvector index (ivfflat or hnsw) on embedding; choice tuned during implementation.

RLS on both tables: account-scoped, WITH CHECK on insert.

Coexistence with kb_imports: when an L1 (or owner) uploads a doc, the system can populate both — the existing KBImport pipeline produces a draft tree, and the new ingestion writer additionally chunks+embeds the doc into kb_documents for RAG. Both paths share the upload endpoint but write to independent tables. Connectors only write to kb_documents (no auto-tree-conversion from synced docs in v1).

5.5 Other column additions

  • users.can_cover_l1 BOOLEAN NOT NULL DEFAULT FALSE
  • accounts.l1_seats_purchased INTEGER NOT NULL DEFAULT 0
  • audit_logs.acting_as VARCHAR(30) NULL'l1_coverage' when engineer is in coverage mode; null otherwise
  • account_role enum: add 'l1_tech'

5.6 Migration ordering

Six manual Alembic revisions (no --rev-id, no --autogenerate):

  1. Add 'l1_tech' to account_role enum.
  2. Add users.can_cover_l1, accounts.l1_seats_purchased, audit_logs.acting_as.
  3. Extend flow_proposals with new columns + backfill existing rows to source='manual_draft'.
  4. Create internal_tickets + RLS policies (account-scoped, WITH CHECK).
  5. Create kb_connector_configs + RLS policies.
  6. Create kb_documents + kb_document_chunks tables + RLS policies + pgvector index on chunks.

Per Lesson on tenant-isolated tables: any service-construction site that creates rows on these tables must pass account_id= explicitly. Grep all Model( sites before merge.


6. Backend services & endpoints

6.1 New services

Module Purpose
services/match_or_build.py Orchestrator. Single async entrypoint match_or_build(account_id, problem_text, ticket_ref) -> MatchOrBuildResult.
services/ai_tree_builder.py Real-time AI tree generation. Anthropic via existing _call_anthropic_cached pattern. Model tier via settings.get_model_for_action('l1_realtime_build'). Output validated against the flow node schema with Pydantic; rejects malformed output.
services/kb_connectors/base.py Abstract KBConnector with test_credentials, list_documents, fetch_content, subscribe_to_changes (optional).
services/kb_connectors/itglue.py IT Glue REST client.
services/kb_connectors/hudu.py Hudu REST client.
services/kb_connectors/microsoft_graph.py Microsoft Graph (SharePoint/OneDrive) client.
services/kb_connectors/registry.py KBConnectorRegistry (mirrors PsaProviderRegistry).
services/kb_connectors/encryption.py Fernet wrapper (or reuse the PSA one if generic).
services/kb_ingestion_writer.py Shared writer: chunk → embed → upsert. Used by manual upload AND connector sync.
services/kb_ingestion_scheduler.py APScheduler interval job, max_instances=1. Sequential per account; concurrency cap = 4 accounts simultaneously.
services/internal_ticket_service.py CRUD + status transitions for internal_tickets.
services/l1_session_service.py Walking-session lifecycle: start, step, resolve, escalate. Bridges ai_sessions and the walked target.

6.2 Extended services

  • services/escalation_package_generator.py — adds inputs: walked_path, ai_draft_proposal_id, kb_citations. New caller path from l1_session_service.escalate(...).
  • KB Accelerator endpoint — accepts ingested content via the shared kb_ingestion_writer. Manual upload and connector sync share the same persistence path.

6.3 New endpoints

All under require_l1_or_coverage unless noted. Mounted under /api/v1/l1.

Method Path Purpose Auth
GET /l1/queue Merged ticket queue (PSA + internal). Pagination + status filter. require_l1_or_coverage
POST /l1/intake Walk-in intake. Body {problem_statement, customer_name?, customer_contact?}. Creates ticket, returns {session_id, target_kind, target_id, intake_type}. require_l1_or_coverage
POST /l1/tickets/{ticket_ref}/start Start walker from an existing ticket. Internally same as intake but skips ticket creation. require_l1_or_coverage
POST /l1/sessions/{id}/step Record an answer. Body {node_id, answer, note?}. Appends to walked_path_snapshot. require_l1_or_coverage
POST /l1/sessions/{id}/resolve Close as resolved. Body {resolution_notes, helpful: bool}. Sets validated_by_outcome=true on the proposal when helpful=true AND target was a proposal. Closes the ticket. require_l1_or_coverage
POST /l1/sessions/{id}/escalate Generate escalation package + reassign ticket. Body {reason, reason_category}. require_l1_or_coverage
GET /l1/drafts List current user's AI drafts with promotion status. require_l1_or_coverage

KB connector endpoints (/api/v1/kb-connectors):

Method Path Purpose Auth
GET /kb-connectors List configured connectors for account. require_l1_or_above
POST /kb-connectors Create. OAuth handoff for Microsoft Graph; API token entry for IT Glue/Hudu. require_account_owner
DELETE /kb-connectors/{id} Remove (soft-disable). require_account_owner
POST /kb-connectors/{id}/sync Trigger immediate sync (enqueued). require_account_owner
GET /kb-connectors/{id}/status Sync status + doc count + last error. require_l1_or_above

Internal ticket endpoints (/api/v1/internal-tickets):

Method Path Purpose Auth
GET /internal-tickets List (account-scoped). require_l1_or_coverage
GET /internal-tickets/{id} Detail. require_l1_or_coverage
POST /internal-tickets/{id}/promote-to-psa Push to configured PSA, set psa_promoted_ticket_id. require_account_owner

User management addition:

Method Path Purpose Auth
PATCH /users/{id}/coverage Set can_cover_l1 flag. Body {can_cover_l1: bool}. require_account_owner

7. Frontend surface

7.1 Sidebar — L1 view

LOGO
─────────────
Workspace      /l1
Tickets        /l1/tickets
My Drafts      /l1/drafts
─────────────
Guides         /guides
Account        /account     (filtered — no integrations, no categories)

No /pilot, no /trees, no /flows, no /review-queue, no /escalations, no team analytics. Sidebar.tsx picks the nav array by role.

7.2 Sidebar — engineer coverage view

Engineer's existing sidebar plus a single appended entry "L1 Workspace" → /l1. Shown when canCoverL1 || isOwner || isSuperAdmin.

7.3 /l1 dashboard layout

Three vertical zones, single column, max width ~1100px:

  1. Greeting — uppercase tracking date label + Bricolage 700 hero ("Good morning, {firstName}.")
  2. Describe the problem card — large textarea (autofocus on load), optional customer_name + customer_contact fields, single primary CTA "Start walk →" (the only electric-blue element on the page)
  3. Open tickets — section label, count, table rows (merged PSA + internal with origin badges), row hover bg-elevated
  4. Resume in progress — shown only when L1 has a half-walked session

Tailwind v4 tokens: bg-page base, bg-card zones, bg-elevated row hover, electric-blue accent only on primary CTA. No text-secondary. All borders border-default.

7.4 /l1/walk/{sessionId} walker

Sticky header + two-pane body, full-height (flex chain per Lesson — every ancestor needs flex + flex-1 + min-h-0).

Header:

  • Back arrow + ticket ref + customer name + AI-built badge (when target is proposal)
  • Problem statement line
  • Persistent action buttons: [ Escalate ] [ Resolve ✓ ]

Left pane (main):

  • "Step N · estimated M" label
  • Current node card — large yes/no/answer buttons (min 44px tap target)
  • Optional note textarea below the card (appended to walked_path_snapshot)
  • On a fresh proposal that's still building: shimmer placeholder + "Building from KB… ~10s"

Right pane (transcript):

  • Walked-so-far list (node title + answer chosen)
  • Current step highlight
  • "Source:" section listing KB citations for the current node (proposal walks only)

Resolve modal:

  • "Did this resolve it?" [ Yes ] [ No ]
  • Resolution notes textarea
  • Yes + target was proposal → sets validated_by_outcome=true
  • No → prompt to escalate instead

Escalate modal:

  • Reason category dropdown: Out of L1 scope · Customer demanding senior · Tree dead-ended · AI tree wrong · Other
  • Free-text reason
  • Confirm

7.5 /l1/drafts page

Read-only list, columns: created · problem (truncated) · ticket # · status (pending review / outcome-validated / promoted / retired). Click → read-only detail view showing tree + walked path. No edit affordances.

7.6 /l1/tickets page

Full-page version of the dashboard queue widget. Filter by status, origin (PSA/internal), assigned-to-me.

7.7 Coverage banner

<L1CoverageBanner /> — slim ~32px band, info-cyan-dim background, mounted at the top of all /l1/* pages when !isL1Tech && (canCoverL1 || isOwner || isSuperAdmin):

You're covering L1. Actions logged as coverage. [Switch back →]

The "Switch back" link returns to /.

7.8 Routing

const L1Dashboard = lazyWithRetry(() => import('@/pages/l1/L1Dashboard'))
const L1WalkPage = lazyWithRetry(() => import('@/pages/l1/L1WalkPage'))
const L1DraftsPage = lazyWithRetry(() => import('@/pages/l1/L1DraftsPage'))
const L1TicketsPage = lazyWithRetry(() => import('@/pages/l1/L1TicketsPage'))

Mounted under the / ProtectedRoute branch at:

  • /l1L1Dashboard
  • /l1/walk/:sessionIdL1WalkPage
  • /l1/draftsL1DraftsPage
  • /l1/ticketsL1TicketsPage

Wrapped in L1RouteGuard (403 if not l1_tech AND not coverage-flagged). ProtectedRoute.tsx post-login redirect: L1 users land on /l1 instead of /.

lazyWithRetry, not React.lazy (per existing convention).


8. AI match-or-build pipeline

8.1 Match-or-build algorithm

match_or_build(account_id, problem_text, ticket_ref):
  embedding = embedding_service.embed(problem_text)

  # 1. Match authored flows
  flow_hits = rag_service.match_flows(account_id, embedding, k=5)
  if flow_hits and flow_hits[0].score >= MATCH_THRESHOLD:
      return {kind: 'flow', id: flow_hits[0].flow_id, score: ...}

  # 2. Match outcome-validated proposals only
  proposal_hits = rag_service.match_proposals(
      account_id, embedding, k=5,
      where=validated_by_outcome=true,
  )
  if proposal_hits and proposal_hits[0].score >= MATCH_THRESHOLD:
      return {kind: 'proposal', id: proposal_hits[0].proposal_id, score: ...}

  # 3. Build fresh
  kb_chunks = rag_service.match_kb_chunks(account_id, embedding, k=8)
  if not kb_chunks:
      raise BuildAbortedNoKB(
          "Cannot build a tree with no KB content. "
          "Upload docs or wait for a connector sync."
      )
  nearest_flows = flow_hits[:3]
  proposal = ai_tree_builder.build(
      problem_text, kb_chunks, nearest_flows, account_id, ticket_ref
  )
  return {kind: 'proposal', id: proposal.id, score: None}

MATCH_THRESHOLD — per-account configurable; default 0.75 (cosine).

The "no empty KB build" rule is enforced because an AI tree built on the model's general knowledge — without MSP-specific grounding — risks suggesting unsafe or hallucinated fixes.

8.2 AI tree-build details

Model: settings.get_model_for_action('l1_realtime_build'). Recommend Sonnet for v1 (latency-sensitive).

Schema: output validated against the existing flow node schema (matches tree_editor output). Validation failure aborts the build rather than persisting malformed data.

Prompt strategy (per Lesson on prompt anti-parrot — critical):

  • System prompt: role definition + output schema using <placeholder> notation only. Never literal field values.
  • Few-shot examples loaded as user/assistant messages from a separate file, never inline in the system prompt.
  • User message: {problem_statement} + {kb_context: [doc_title, section, content]} + {nearest_flow_summaries} + instruction to cite KB chunks per node.
  • Output includes kb_citations: [{node_id, kb_doc_id, snippet}] for walker's "Source:" pane and engineer review.

Latency: whole-tree-then-return (~515s typical). UX is a shimmer "Building from KB…" placeholder. Streaming node-by-node deferred to v2.

Anthropic SDK config (per Lesson): max_retries=1. Prompt caching enabled on the stable system+few-shot bundle (high cache hit rate expected per account).

Telemetry:

  • l1.match_or_build.duration_ms, l1.match_or_build.outcome (flow_match/proposal_match/built/aborted_no_kb)
  • anthropic.cache events (existing pattern) tagged action=l1_realtime_build
  • l1.tree_build.tokens_in, tokens_out

Anti-parrot guardrail: the existing tests/test_prompt_anti_parrot.py auto-discovers new prompt constants via pattern match on *_PROMPT / *_SCHEMA / *_PROTOCOL / *_FORMAT. No new test required.

8.3 Hallucinated-citation defense

After build, the writer verifies every kb_doc_id in kb_citations exists in the account's KB. Unverified citations are stripped from the walker's "Source:" pane (the node still renders, just without a source). Engineer review surfaces stripped citations as a warning.


9. KB ingestion

9.1 Connector interface

class KBConnector(ABC):
    async def test_credentials(self) -> bool
    async def list_documents(self, since: datetime | None) -> AsyncIterator[KBDocRef]
    async def fetch_content(self, ref: KBDocRef) -> KBDocContent
    async def subscribe_to_changes(self) -> AsyncIterator[ChangeEvent]   # optional, no-op v1

Registry dispatches by provider string. Credentials encrypted at rest via Fernet (reuse services/psa/encryption.py pattern).

9.2 Per-connector specifics

IT Glue Hudu Microsoft Graph (SharePoint/OneDrive)
Auth API token (header) API key (header) OAuth 2.0
Ingested types Documents, KB Articles Articles docx, pdf, md, txt
Never ingested Passwords, Configurations, sensitive flex assets Passwords, sensitive items Files in folders matching (secret|confidential|private) heuristic; files with a tenant Sensitivity Label
Filtering Per-org (techs see all client orgs they have permission to) Per-folder Per-site / per-drive (owner picks at config time)
Rate limits ~100/min token bucket ~250/min token bucket Built-in Graph throttling backoff

All three deliver content to kb_ingestion_writer which:

  1. Chunks (paragraph-aware, configurable size with overlap)
  2. Embeds via embedding_service
  3. Upserts into kb_documents keyed on (connector_config_id, source_ref); chunks into kb_document_chunks

Cross-connector conflicts: same doc text appearing in two connectors yields two rows (provider-scoped source_ref). Engineers can dedup manually if needed.

9.3 Sync scheduling

kb_ingestion_scheduler.py runs as APScheduler interval job, max_instances=1. Per cycle:

  1. Query active kb_connector_configs where last_sync_at is older than sync_interval_minutes (default 360 = 6h).
  2. Dispatch per account; concurrency cap = 4 simultaneous accounts.
  3. For each connector: list_documents(since=last_sync_at) → for each ref, fetch_content → write.
  4. Compute the diff between current refs and existing rows (same connector_config_id); soft-delete missing ones via deleted_at.
  5. Update last_sync_at, last_sync_status, last_sync_error.

Must use _admin_session_factory() not get_db() for startup-side and scheduler-side queries (per Lesson on RLS at startup — no app.current_account_id set).

Immediate sync via POST /api/v1/kb-connectors/{id}/sync enqueues a job; scheduler picks it up within ~30s.


10. Escalation flow

  1. L1 clicks Escalate → modal (reason category + optional free text).
  2. POST /api/v1/l1/sessions/{id}/escalate → backend:
    • Calls extended escalation_package_generator.generate(session_id, include_l1_walk=true). Package contents:
      problem_statement, customer_name, customer_contact,
      ticket_ref (PSA id or internal id),
      target_kind ('flow' | 'proposal'), target_id,
      walked_path,
      ai_draft_proposal_id,
      kb_citations,
      escalation_reason, reason_category, l1_user_id
      
    • Creates an ai_session with the package serialized into system context for the chat surface.
    • If PSA-backed: psa_provider.reassign_ticket(ticket_id, to=account.engineer_queue_name). Default 'Tier 2'. Owner configurable in /account/integrations.
    • If internal-backed: internal_tickets.status='escalated', assigned_user_id=null (round-robin assignment is out of scope).
    • Writes notification via existing notification_service — bell badge to all engineers in account.
    • Audit log entry; acting_as reflects whether L1 or coverage-engineer escalated.
  3. Toast on L1 side, return to /l1.
  4. Engineer clicks notification → /pilot/{sessionId} → chat surface renders the package as a sticky "Escalation context" card; engineer continues in chat.

Un-escalate is out of scope. If engineer wants to bounce back, they reassign in PSA manually.


11. Internal ticket fallback

When the account has no active PSA provider:

  • Intake creates internal_tickets row instead of a PSA ticket.
  • Queue surface merges PSA + internal with Internal / PSA origin badge.
  • Escalation flips internal_tickets.status='escalated' and assigns engineer (or leaves null for any engineer to claim — v1 behavior).
  • Engineer post-escalation sees the internal ticket as a session; no PSA roundtrip.

Promote to PSA: owner-only action on any internal ticket. Pushes the ticket into the configured PSA provider, sets psa_promoted_ticket_id. Manual; not automatic on PSA-install. Lets MSPs adopt PSA mid-flight without orphaning prior internal tickets.


12. Outcome-validation lifecycle

1. L1 intake → match_or_build → FlowProposal(source='ai_realtime_l1',
                                              validated_by_outcome=false,
                                              linked_ticket_id=...)
2. L1 walks → POST /l1/sessions/{id}/step appends to walked_path_snapshot
3. L1 hits Resolve:
     modal: "Did this resolve it?" [Yes] [No] + resolution_notes
4. helpful=true → flow_proposal.validated_by_outcome = true
                 → walked_path_snapshot frozen
                 → ticket closed (PSA or internal)
   helpful=false → validated_by_outcome stays false
                  → L1 prompted: "Escalate instead?"
5. Engineer review queue:
     ORDER BY validated_by_outcome DESC, created_at DESC
     - Outcome-validated drafts surface first
     - Promote / edit-and-promote / retire
6. Promote → new flow with source='ai_promoted'; original proposal kept with status='promoted'
           → future match_or_build matches the new flow on the flow-match pass

13. Out of scope (v1 non-goals)

  • End-user / self-service portal ("L0" tier).
  • Engineer warm-transfer / live take-over during a call.
  • L1 ↔ engineer real-time chat during a call.
  • Multi-language UI / customer-language toggle in walker.
  • Auto-promote internal tickets to PSA on integration install.
  • AI tree streaming (node-by-node).
  • KB write-back to IT Glue/Hudu/SharePoint (read-only ingestion).
  • Confluence connector.
  • Per-step KB citation editing in engineer review (engineers edit the tree, not citations).
  • Final Stripe pricing SKU (data model supports differential pricing; price set in Stripe dashboard).
  • "Switch to L1 mode" persistent toggle for engineers (coverage flag + banner only).
  • Cancel/un-escalate flow.
  • Round-robin engineer assignment on internal-ticket escalations.

14. Testing strategy

14.1 Backend (pytest)

  • Unit: match_or_build covers all four paths (flow-match, proposal-match, built, aborted_no_kb).
  • Unit: ai_tree_builder schema validation — assert rejection of malformed Anthropic output before persistence.
  • Unit: each connector's list_documents + fetch_content against recorded HTTP fixtures.
  • Integration: intake → walk → resolve(helpful=true) → assert FlowProposal.validated_by_outcome=true, ticket closed.
  • Integration: intake → walk → escalate → assert PSA reassign_ticket invoked, ai_session created with package, audit log entry, notification dispatched.
  • Integration: KB scheduler — max_instances=1, sequential per-account, soft-delete on removal.
  • RLS regression (highest priority): l1_tech user in account A cannot read account B's tickets, drafts, KB docs, or connector configs. Added to existing RLS test suite.
  • Anti-parrot: existing CI test auto-discovers new prompt module.

14.2 Frontend

  • Unit: usePermissions — L1 sees L1 paths, blocked from engineer paths. Coverage flag opens L1 paths.
  • Unit: L1WalkPage — node advance, escalate modal, resolve modal flips validated_by_outcome correctly.
  • Unit: L1CoverageBanner — visible for engineer-with-flag on /l1/*, hidden for L1 users.
  • E2E (Playwright, scoped selectors per Lesson):
    • L1 sign-in → dashboard → intake → walker → resolve → verify ticket closed + proposal flagged.
    • Engineer with can_cover_l1 → sidebar entry visible → click → coverage banner shows → walks a session → audit log records acting_as='l1_coverage'.
    • L1 hitting /pilot, /trees/new, /escalations → 403 or redirect.

15. Acceptance criteria (v1 ships when…)

  • L1 role assignable; assigned L1 sees L1 sidebar only; no engineer route reachable.
  • L1 intake creates a ticket (PSA or internal) and lands in walker session.
  • Walker handles both flows and proposals; AI-built badge + sources shown for proposals.
  • Escalate generates package, reassigns ticket, notifies engineers.
  • Resolve flips validated_by_outcome; review queue prioritizes outcome-validated drafts.
  • All three KB connectors configurable; initial sync + periodic re-sync + soft-delete on removal.
  • AI build refuses with informative error when account KB is empty.
  • Coverage flag works end-to-end with audit-log tagging.
  • RLS blocks cross-tenant reads on every new table.
  • L1 seat count tracked separately from engineer seats in admin/billing UI.

16. Risks & mitigations

Risk Mitigation
AI builds an unsafe tree Schema validation rejects malformed output. Engineer review is the gate before draft becomes "real" flow. v1 refuses to build when KB is empty.
Hallucinated KB citations Post-build verification that each kb_doc_id exists; unverified citations stripped from walker, surfaced as warning in engineer review.
Duplicate proposals for same problem Validated-proposal match pass deduplicates after one L1 validates; pre-validation dups are tolerated and dedup'd during engineer review.
KB ingestion captures sensitive content Per-connector deny-lists (passwords, sensitive flex assets, MS Graph Sensitivity Labels). Owners exclude specific folders/sites at config. All ingested docs visible in /account/kb for manual deletion.
AI build latency frustrates customer on call Build-progress UI sets expectation. Escalate button visible from page load. Future: pre-warm builds on PSA-ticket-landed event.
Three connectors is more scope than originally proposed Acknowledged. Each connector is ~12 weeks of work. Plan should sequence them and allow shipping with IT Glue + Hudu first if SharePoint slips.
Engineer review queue backlog stalls library growth Validated-proposal match pass means good drafts get reused without engineer review. Backlog only delays the move from 'proposal' to 'flow', not the L1's ability to use validated content.

17. Naming reference

Layer Value
DB enum (account_role) l1_tech
UI display "L1 Tech" / "L1"
Sidebar entry "L1 Workspace"
URL prefix /l1
Coverage flag column users.can_cover_l1
Coverage audit tag acting_as = 'l1_coverage'
Pricing label "L1 seat"
Stripe SKU Set in Stripe dashboard at launch — data model supports differential pricing now

18. Open implementation decisions (deferred to plan, not blocking design)

  • Specific MATCH_THRESHOLD default value validation (initial 0.75, tune from telemetry post-launch).
  • Specific Anthropic model choice for l1_realtime_build (Sonnet vs Opus — pick based on quality benchmark during plan).
  • Chunk size + overlap for KB ingestion writer (tune in implementation).
  • Engineer queue label default ('Tier 2' vs 'Engineering') — owner-configurable anyway.
  • Exact look of the build-progress shimmer animation — design-system handoff.

These are tuning/UX-polish details, not architectural forks. They land during the writing-plans phase, not here.

Note on scope and phasing

This is a substantive feature: new role, four frontend pages, ~12 endpoints, AI tree-builder, three KB connectors, escalation extensions, and six migrations. The implementation plan will almost certainly phase the work — a reasonable cut is:

  • Phase 1: role + L1 surface against existing authored flows (no AI build, no connectors yet). Validates the seat model, walker UX, escalation, internal ticket fallback, and coverage flag end-to-end.
  • Phase 2: kb_documents schema + AI tree-builder + match-or-build pipeline. Enables real-time AI flows grounded on manually-uploaded KB.
  • Phase 3: the three KB connectors (IT Glue, Hudu, SharePoint/OneDrive). Each is roughly self-contained — can ship one at a time and reorder if a connector blocks.

Phasing is a plan-level decision; the spec captures the full feature.


End of spec.