fix(tests): stabilize escalation SSE backend tests

Co-Authored-By: Codex <noreply@openai.com>
2026-04-27 19:47:43 -04:00
parent ba46fc5644
commit bc15952857
7 changed files with 153 additions and 73 deletions
--- a/.ai/HANDOFF.md
+++ b/.ai/HANDOFF.md
@@ -2,64 +2,50 @@

 # HANDOFF.md

-**Last updated:** 2026-04-27 EDT (paused mid-build for Codex review)
+**Last updated:** 2026-04-27 EDT

 **Active task:** **Escalation Mode** wedge build. See [`CURRENT_TASK.md`](CURRENT_TASK.md) for the full status; this file holds the resume point only.

-**Branch:** `feat/escalation-metric-endpoint` — six commits stacked on `main` (`c0ed6d9`). Working tree has UNCOMMITTED WIP for the SSE push.
+**Branch:** `feat/escalation-metric-endpoint` — SSE backend WIP is now test-stabilized locally. Working tree should be clean after the handoff commit.

-## Status — paused for Codex review
+## Status

-Build is paused mid-flight on the SSE push. Hand the branch (and the WIP) to Codex for an outside-voice pass before stacking more commits, fixing tests, or pushing. Reasons: local backend test loop got tangled (multiple stale pytest processes contended on the same Postgres test schema; the suite design rebuilds the schema per test which doesn't tolerate concurrent runs well), and the SSE work is the kind of cross-layer surface a second pair of eyes is most valuable on.
+Previous session diagnosed the slow-test issue and fixed the backend test loop.

-What Codex should look at:
-1. The new SSE endpoint at [`backend/app/api/endpoints/session_handoffs.py`](../backend/app/api/endpoints/session_handoffs.py) — `stream_escalations` — and the in-memory pub/sub bus at [`backend/app/core/escalation_bus.py`](../backend/app/core/escalation_bus.py).
-2. Whether the bus's single-process / non-durable design is acceptable for the v1 pilot (Railway single-replica) and what the swap-to-Redis story should look like.
-3. The dispatch wiring in [`backend/app/services/handoff_manager.py`](../backend/app/services/handoff_manager.py) — `dispatch_escalation_notifications` now publishes to the bus before the email fan-out. Race / ordering / failure-mode review.
-4. Auth on the SSE stream — same `require_engineer_or_admin` dep as `/queue` and `/claim`. Browsers can't send custom headers via the native `EventSource` API; the planned frontend uses a fetch-based `ReadableStream` reader (matching the existing `streamDocumentation` pattern in [`frontend/src/api/aiSessions.ts`](../frontend/src/api/aiSessions.ts)). Verify that's the right call vs. a query-token scheme.
-5. Whether the bus's "drop-on-full-queue" semantic is acceptable, given a stuck subscriber would silently miss live-arrival cards (they'd still see them on next page load via REST `/queue`).
+Root causes:
+- Multiple stale pytest processes were still alive inside `resolutionflow_backend`, despite the prior handoff saying they were dead. They held `resolutionflow_test` transactions open and caused later tests to block on `DROP SCHEMA public CASCADE`.
+- `test_escalations_stream_returns_sse_content_type` used HTTPX `ASGITransport` against an infinite SSE stream. That transport buffers the entire response body before returning, so the test waited forever and held the auth DB dependency transaction open.
+- Escalation handoff tests created `intent="escalate"` handoffs without stubbing `_generate_ai_assessment()`, so they waited on the real AI path instead of testing handoff behavior.
+- The bus keyed subscribers by raw `account_id`; string UUIDs and `UUID` objects for the same account did not match.

-## Resume point (after Codex review)
+Fixes made:
+- `stream_escalations` now uses `Depends(require_engineer_or_admin, scope="function")` so auth DB dependencies are released before the long-lived stream body.
+- The SSE handshake test now calls `stream_escalations()` directly and consumes only the first generator yield, avoiding HTTPX's infinite-stream buffering behavior.
+- Handoff manager/API tests stub `_generate_ai_assessment()` with an `AsyncMock`.
+- `EscalationBus` normalizes string/UUID account IDs at subscribe/publish/unsubscribe/subscriber_count boundaries, with a regression test.

-1. **Get the test suite back to green.** Stale pytest zombies in the container were cleared (PIDs 1790034, 1844996, 1883167, 1916565, 1935830, 2009437, 2009449 — all dead, parent uvicorn-reload didn't reap them; PID slots remain but no live processes). Re-run with `pytest -n auto` to keep wall-clock manageable. Files: `tests/test_escalation_bus.py` (7 tests), the 4 new dispatch + SSE tests in `tests/test_handoff_manager.py` and `tests/test_session_handoffs_api.py`.
-2. **Frontend SSE subscription** in `EscalationQueue.tsx` — fetch-based reader, prepend new cards with the locked 200ms slide-in, reconnect with backoff, tab-title flash when backgrounded, respect `prefers-reduced-motion`. Then ship the magic-moment handoff-context screen (4 sections, dissolves into FlowPilot session view).
-3. Push the branch + open a draft PR.
+Verified:
+- `pytest tests/test_escalation_bus.py tests/test_handoff_manager.py tests/test_session_handoffs_api.py tests/test_flowpilot_analytics_escalations.py --override-ini=addopts= -q --durations=20` → `31 passed in 46.95s`
+- Same subset with `-n auto` → `31 passed in 17.80s`
+- No remaining pytest processes or `resolutionflow%test%` Postgres sessions after the run.

-## Stack
+## Resume point

-```
-WIP   (uncommitted): SSE bus + endpoint + dispatcher publish + 7 bus tests + 1 dispatcher test + 2 SSE endpoint tests
-a283d0d  docs(ai): refresh handoff state mid-flight on Escalation Mode build
-9f0bfd4  feat(escalations): mount time-to-first-action stat-card on /escalations
-07d0db9  feat(handoff): email engineer-or-admin teammates on escalation
-7a5b853  feat(api): role-gate handoff claim to engineer-or-admin
-52f6d03  feat(analytics): add escalation time-to-first-action metric endpoint
-d51e95c  docs(plans): add escalation-mode wedge design + test plan
-```
-
-## Where things stand
-
- CI on `main` still healthy. Branch protection: `CI / frontend (pull_request)` required, `CI / backend (pull_request)` required, `CI / e2e (pull_request)` not yet required.
- The 20 tests passing as of `9f0bfd4` are still passing (last green run logged before the SSE work). The newly added SSE tests (7 bus + 1 dispatcher integration + 2 endpoint) HAVE NOT been verified end-to-end this session — they ran clean on the bus suite alone (7/7 in 0.14s) but the DB-backed integration tests were aborted before completing.
- The plan doc at [`docs/plans/2026-04-27-escalation-mode-wedge-design.md`](../docs/plans/2026-04-27-escalation-mode-wedge-design.md) is the source of truth for every UI / metric / scope decision. The embedded **GSTACK REVIEW REPORT** at the bottom shows Eng + Design CLEARED and Codex INFO from the design-stage pass.
+1. Continue the **Frontend SSE subscription** in `EscalationQueue.tsx`: fetch-based reader, prepend new cards with the locked 200ms slide-in, reconnect with backoff, tab-title flash when backgrounded, respect `prefers-reduced-motion`.
+2. Then ship the **magic-moment handoff-context screen**: 4 sections (problem summary / what's been tried / AI assessment / Start here CTA), loads on Pick Up, then dissolves into regular FlowPilot session view.
+3. Push the branch and open a draft PR when the frontend/live-arrival slice is ready.

 ## Useful breadcrumbs

- Metric endpoint: [`backend/app/api/endpoints/flowpilot_analytics.py`](../backend/app/api/endpoints/flowpilot_analytics.py) — `get_escalation_metrics` at the bottom.
- Notification dispatch (email + bus publish): [`backend/app/services/handoff_manager.py`](../backend/app/services/handoff_manager.py) — `dispatch_escalation_notifications`. Wired in [`backend/app/api/endpoints/session_handoffs.py`](../backend/app/api/endpoints/session_handoffs.py) **after** `db.commit()` so a rolled-back handoff never emails or fans out.
- SSE endpoint (WIP): [`backend/app/api/endpoints/session_handoffs.py`](../backend/app/api/endpoints/session_handoffs.py) — `stream_escalations`. Heartbeat every 25s, account-scoped subscribe, role-gated to engineer-or-admin.
- Pub/sub bus (WIP): [`backend/app/core/escalation_bus.py`](../backend/app/core/escalation_bus.py). Module-level singleton, in-memory, `asyncio.Queue` per subscriber with 64-event maxsize and drop-on-full semantics.
- Frontend stat-card: [`frontend/src/components/flowpilot/EscalationMetricCard.tsx`](../frontend/src/components/flowpilot/EscalationMetricCard.tsx). Renders `n_with_action / n_claimed`, avg + median, and the metric_definition disclaimer.
- Two-metric framing — required reading before quoting any number to a pilot. In-product endpoint measures *post-claim time-to-first-action*; the savings claim is `manual_baseline − in_product`. Manual baseline comes from the founder's stopwatch on the next 5 escalations (The Assignment in the design doc).
- The `notification_sent` boolean is intentionally NOT being written. Per Codex's design-stage correction it should be replaced by per-channel delivery records; v1.x story. For now application logs are the audit trail.
+- SSE endpoint: [`backend/app/api/endpoints/session_handoffs.py`](../backend/app/api/endpoints/session_handoffs.py) — `stream_escalations`.
+- Pub/sub bus: [`backend/app/core/escalation_bus.py`](../backend/app/core/escalation_bus.py). In-memory, account-scoped, non-durable, 64-event per-subscriber queue, drop-on-full.
+- Notification dispatch: [`backend/app/services/handoff_manager.py`](../backend/app/services/handoff_manager.py) — `dispatch_escalation_notifications`, called after `db.commit()` in the handoff endpoint.
+- Frontend streaming reference: [`frontend/src/api/aiSessions.ts`](../frontend/src/api/aiSessions.ts) — `streamDocumentation` uses fetch + `ReadableStream`, which remains the right pattern because native `EventSource` cannot send auth headers.
+- Metric endpoint: [`backend/app/api/endpoints/flowpilot_analytics.py`](../backend/app/api/endpoints/flowpilot_analytics.py) — `get_escalation_metrics`.

 ## Watch-outs

- `ai_session_step` has NO `user_id` column — the metric query keys "first action by senior" off `session_id + created_at > claimed_at`. Fine for v1 because session activity post-claim IS the senior's activity (session reactivates under `escalated_to_id`).
- `account_id` is denormalized on `ai_session_step` (Phase 4 RLS pattern). Use it directly; don't join through `ai_sessions`.
- POST `/handoff` still requires the session owner to be the escalator (`AISession.user_id == current_user.id`). Peer-tech escalation is a v2 TODO.
- The test suite uses `DROP SCHEMA public CASCADE` + `CREATE SCHEMA public` per test (see [`backend/tests/conftest.py:144`](../backend/tests/conftest.py#L144)). Concurrent pytest runs against the same test DB collide. Always run one suite at a time, or via `-n auto` xdist with the per-worker-DB isolation already in conftest.
-
-## Kill-switch (week 8)
-
-If 0 of 3 pilots produce a verifiable hours-saved-per-week number above 1.0, revisit the wedge. The design doc names the alternative direction (deterministic-ops territory) but data lands first.
+- Do not reintroduce `client.stream()`/ASGITransport tests for infinite SSE responses; test the generator directly or use a real server-level test.
+- `DROP SCHEMA public CASCADE` per test is still the dominant cost: DB-backed tests spend ~1.7-2.8s in setup. Use `-n auto` for focused backend loops.
+- The bus is acceptable for v1 pilot scale only because Railway is single-replica. Redis pub/sub is the obvious swap when horizontal scaling appears.
+- Synchronous `_generate_ai_assessment()` during escalation creation remains product-latency risk; tests are now isolated from it, but the UX path should be watched as the magic-moment screen is built.
--- a/.ai/SESSION_LOG.md
+++ b/.ai/SESSION_LOG.md
@@ -12,6 +12,18 @@

 ---

+## 2026-04-27 19:50 EDT — Codex — Stabilize Escalation Mode SSE backend tests
+
+- Diagnosed slow backend tests on `feat/escalation-metric-endpoint`. Multiple stale pytest processes were still alive inside `resolutionflow_backend` and held `resolutionflow_test` transactions open, blocking later per-test schema resets on `DROP SCHEMA public CASCADE`.
+- Reproduced a deterministic hang in `test_escalations_stream_returns_sse_content_type`: HTTPX `ASGITransport` buffers the full response body before returning, so an infinite SSE response never yielded the initial chunk and kept the auth DB dependency transaction open.
+- Fixed `stream_escalations` to release auth dependencies before the long-lived stream body with `Depends(..., scope="function")`.
+- Reworked the SSE handshake test to call `stream_escalations()` directly and consume one generator yield, then close it; kept viewer role-gate coverage through the API client.
+- Stubbed `_generate_ai_assessment()` in handoff manager/API tests so escalation handoff tests no longer wait on the real AI path.
+- Normalized account IDs inside `EscalationBus` so string UUIDs and `UUID` objects hit the same subscriber bucket; added a regression test.
+- Verified focused backend subset: serial `31 passed in 46.95s`; xdist `31 passed in 17.80s`. Confirmed no lingering pytest processes or test DB sessions afterward.
+- Left for next session: continue frontend SSE subscription in `EscalationQueue.tsx`, then the magic-moment handoff-context screen.
+- Files touched: `backend/app/api/endpoints/session_handoffs.py`, `backend/app/core/escalation_bus.py`, `backend/tests/test_escalation_bus.py`, `backend/tests/test_handoff_manager.py`, `backend/tests/test_session_handoffs_api.py`, `.ai/HANDOFF.md`, `.ai/SESSION_LOG.md`.
+
 ## 2026-04-26 03:50 EDT — Claude Code — Ship AssistantChatPage prefill `currentChatRef` fix; close out PR #150

 - User reported a troubleshooting-session bug: after answering a subset of task-lane questions and clicking *Send N of M Responses*, no AI response appeared. Traced to `AssistantChatPage`: the dashboard prefill effect set `activeChatId` after creating a new chat session but never updated `currentChatRef.current`. The `currentChatRef.current !== sentForChatId` guard in `handleSend` and `handleTaskSubmit` then bailed silently on every later request and discarded the AI's reply. The user message was already pushed to the chat before the await, so the user saw their answers but nothing else.