WIP: SSE pub/sub for live escalation arrivals (paused for Codex review)
First half of the WebSocket/SSE push slice. Paused mid-flight to hand
the branch to Codex for outside-voice review before stacking more
commits on top. See .ai/HANDOFF.md for the full pause context + what
to look at.
What's here:
- backend/app/core/escalation_bus.py — module-level singleton in-memory
pub/sub keyed by account_id. asyncio.Queue per subscriber with
64-event maxsize and drop-on-full semantics. Designed to be swappable
for Redis pub/sub when Railway scales past single-replica.
- backend/app/api/endpoints/session_handoffs.py — GET
/api/v1/ai-sessions/escalations/stream SSE endpoint. Auth via
require_engineer_or_admin. 25s heartbeat. Account-scoped subscribe
bound to current_user.account_id.
- backend/app/services/handoff_manager.py — dispatch_escalation_notifications
now publishes a `handoff_created` event to the bus BEFORE the email
fan-out, in a try/except so a bus failure can't block email delivery.
- backend/tests/test_escalation_bus.py — 7 unit tests, all green
standalone (0.14s). Cross-tenant isolation, drop-on-full, no-subscribers.
- backend/tests/test_handoff_manager.py — +1 dispatcher integration test
(publishes to bus, payload shape).
- backend/tests/test_session_handoffs_api.py — +2 endpoint tests (viewer
blocked, ready event handshake).
[gstack-context]
Decisions:
- SSE over WebSocket (one-way, browser EventSource semantics, fewer
moving parts behind Railway proxy)
- In-memory bus over Redis for v1 pilot (3 MSPs, single replica)
- Drop-on-full subscriber queue rather than back-pressure publishers
- Bus publish ahead of email send, both wrapped in try/except so
neither can break handoff creation
- Frontend will be a fetch-based ReadableStream reader matching the
existing streamDocumentation pattern, not native EventSource
(custom-header auth)
Remaining (post-Codex):
- Frontend SSE subscription in EscalationQueue.tsx (slide-in,
reconnect, tab-title flash, prefers-reduced-motion)
- Magic-moment handoff-context screen
- Re-run the full backend test suite to verify the SSE +
dispatcher integration tests (bus units already green standalone)
Tried:
- Running the full test suite repeatedly without xdist; the per-test
DROP SCHEMA + recreate fixture made wall-clock prohibitive when
multiple stale runs collided on the same Postgres test schema.
Resolution: -n auto next time.
[/gstack-context]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
97
backend/app/core/escalation_bus.py
Normal file
97
backend/app/core/escalation_bus.py
Normal file
@@ -0,0 +1,97 @@
|
||||
"""In-memory pub/sub bus for live escalation events.
|
||||
|
||||
Single-process, non-durable. When a handoff fires, every connected SSE
|
||||
subscriber for the same `account_id` receives the event. Subscribers come
|
||||
and go as senior techs open and close the EscalationQueue page.
|
||||
|
||||
Pre-PMF scale (3 pilots × 5-20 techs/MSP = ~15-60 concurrent subscribers
|
||||
total, single Railway replica) makes in-memory the right call. When the
|
||||
deployment scales horizontally, swap this for Redis pub/sub or similar —
|
||||
the public surface (`publish` / `subscribe`) is intentionally narrow so
|
||||
the swap is local.
|
||||
|
||||
Events are JSON-serializable dicts. `publish()` is non-blocking (drops the
|
||||
event if a subscriber's queue is full rather than back-pressuring the
|
||||
caller). `subscribe()` MUST be paired with `unsubscribe()` in a finally
|
||||
block, or you leak queues.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
from typing import Any
|
||||
from uuid import UUID
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# Bound how many unconsumed events can sit in a subscriber's queue before
|
||||
# we start dropping. 64 is generous for the queue-page use case; if a
|
||||
# subscriber is that far behind, they're probably gone or stuck.
|
||||
_QUEUE_MAXSIZE = 64
|
||||
|
||||
|
||||
class EscalationBus:
|
||||
"""Account-scoped pub/sub for escalation arrival events."""
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._subscribers: dict[UUID, set[asyncio.Queue[dict[str, Any]]]] = {}
|
||||
self._lock = asyncio.Lock()
|
||||
|
||||
async def subscribe(self, account_id: UUID) -> asyncio.Queue[dict[str, Any]]:
|
||||
"""Register a new subscriber queue for an account.
|
||||
|
||||
Caller must invoke `unsubscribe(account_id, queue)` when the
|
||||
consumer disconnects.
|
||||
"""
|
||||
queue: asyncio.Queue[dict[str, Any]] = asyncio.Queue(
|
||||
maxsize=_QUEUE_MAXSIZE
|
||||
)
|
||||
async with self._lock:
|
||||
self._subscribers.setdefault(account_id, set()).add(queue)
|
||||
return queue
|
||||
|
||||
async def unsubscribe(
|
||||
self, account_id: UUID, queue: asyncio.Queue[dict[str, Any]]
|
||||
) -> None:
|
||||
async with self._lock:
|
||||
subs = self._subscribers.get(account_id)
|
||||
if subs is None:
|
||||
return
|
||||
subs.discard(queue)
|
||||
if not subs:
|
||||
self._subscribers.pop(account_id, None)
|
||||
|
||||
async def publish(self, account_id: UUID, event: dict[str, Any]) -> int:
|
||||
"""Fan event out to every subscriber for `account_id`.
|
||||
|
||||
Returns the number of subscribers that successfully received the
|
||||
event. Drops the event for any subscriber whose queue is full
|
||||
(logs at warning level).
|
||||
"""
|
||||
async with self._lock:
|
||||
subs = list(self._subscribers.get(account_id, ()))
|
||||
if not subs:
|
||||
return 0
|
||||
delivered = 0
|
||||
for queue in subs:
|
||||
try:
|
||||
queue.put_nowait(event)
|
||||
delivered += 1
|
||||
except asyncio.QueueFull:
|
||||
logger.warning(
|
||||
"EscalationBus: dropped event for full subscriber queue "
|
||||
"(account_id=%s, event=%s)",
|
||||
account_id,
|
||||
event.get("type", "?"),
|
||||
)
|
||||
return delivered
|
||||
|
||||
def subscriber_count(self, account_id: UUID) -> int:
|
||||
"""Diagnostic — number of active subscribers for an account."""
|
||||
return len(self._subscribers.get(account_id, ()))
|
||||
|
||||
|
||||
# Module-level singleton. FastAPI imports this; `subscribe()` and `publish()`
|
||||
# are coroutine-safe via the internal Lock.
|
||||
bus = EscalationBus()
|
||||
Reference in New Issue
Block a user