fix: CRITICAL — scope copilot tree query to current account (#131)

* docs: add tenant data isolation design spec

Complete architecture plan for multi-tenant data isolation across
all layers (PostgreSQL RLS, application-layer filtering, schema
migration, testing strategy, and phased rollout checklist).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: add background job isolation policy to tenant isolation spec

Documents policy for all 5 existing background jobs:
- Knowledge Flywheel and PSA Retry flagged for account_id threading
- Chat Retention already follows correct pattern (model for others)
- Maintenance Schedule Firing needs account_id in queries + Session creation
- AI Conversation Expiry approved as cross-tenant with justification

Adds approved cross-tenant query registry and Phase 2 checklist items.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: add tenant isolation Phase 0 implementation plan

8 tasks covering: CRITICAL copilot hotfix, tenant_filter() helper,
get_tenant_context dependency, analytics/category/AI session gap fixes,
full UUID endpoint audit, TargetList dead code audit, teams orphan
check, and CI grep check for missing tenant filters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: CRITICAL — scope copilot tree query to current account

A user who knew another account's tree UUID could start a copilot
conversation, causing the tree's full node structure, names, and
descriptions to be sent to the AI as part of the system prompt.

Fix: add account_id (or is_default / visibility='public') filter to
the tree SELECT in copilot_service.start_conversation(). Returns 404
for inaccessible trees. Test added in test_tenant_isolation_p0.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit was merged in pull request #131.
This commit is contained in:
chihlasm
2026-04-09 00:41:30 -04:00
committed by GitHub
parent ba0680ce06
commit 29a9573d6e
4 changed files with 1911 additions and 4 deletions

View File

@@ -8,7 +8,7 @@ from datetime import datetime, timezone, timedelta
from typing import Optional, Any
from uuid import UUID
from sqlalchemy import select
from sqlalchemy import select, or_
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.orm import selectinload
@@ -103,13 +103,23 @@ async def start_conversation(
Returns (conversation, greeting_message).
"""
# Load tree
# Load tree — must be accessible to this account.
# Allows own account's trees, default trees, and public trees.
# Raises ValueError (caught by endpoint as 404) if not found or not accessible.
result = await db.execute(
select(Tree).options(selectinload(Tree.tags)).where(Tree.id == tree_id)
select(Tree).options(selectinload(Tree.tags)).where(
Tree.id == tree_id,
or_(
Tree.account_id == account_id,
Tree.author_id == user_id,
Tree.is_default == True,
Tree.is_public == True,
),
)
)
tree = result.scalar_one_or_none()
if not tree:
raise ValueError(f"Tree {tree_id} not found")
raise ValueError(f"Tree {tree_id} not found or not accessible")
conversation = CopilotConversation(
user_id=user_id,

View File

@@ -0,0 +1,107 @@
"""Cross-tenant isolation tests for Phase 0 gap fixes."""
import pytest
from httpx import AsyncClient
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select
import uuid
from app.models.account import Account
from app.models.user import User
from app.models.tree import Tree
from app.core.security import get_password_hash
# ── Helpers ──────────────────────────────────────────────────────────────────
async def _create_account_and_user(db: AsyncSession, suffix: str) -> tuple[Account, User, str]:
"""Create an account + owner user. Returns (account, user, plain_password)."""
account = Account(
name=f"Test Corp {suffix}",
display_code=uuid.uuid4().hex[:8],
)
db.add(account)
await db.flush()
password = "TestPass123!"
user = User(
email=f"user-{suffix}-{uuid.uuid4().hex[:6]}@example.com",
name=f"User {suffix}",
password_hash=get_password_hash(password),
is_active=True,
account_id=account.id,
account_role="owner",
)
db.add(user)
await db.flush()
return account, user, password
async def _login(client: AsyncClient, email: str, password: str) -> dict:
"""Return auth headers for a user."""
resp = await client.post(
"/api/v1/auth/login/json",
json={"email": email, "password": password},
)
assert resp.status_code == 200, resp.text
return {"Authorization": f"Bearer {resp.json()['access_token']}"}
async def _create_private_tree(db: AsyncSession, account: Account, user: User) -> Tree:
"""Create a private tree owned by account."""
tree = Tree(
name=f"Private Tree {uuid.uuid4().hex[:6]}",
account_id=account.id,
author_id=user.id,
visibility="private",
tree_type="troubleshooting",
tree_structure={"id": "root", "type": "start", "children": []},
is_active=True,
status="published",
)
db.add(tree)
await db.flush()
return tree
# ── Task 1: Copilot bypass ────────────────────────────────────────────────────
@pytest.mark.asyncio
async def test_copilot_cannot_start_conversation_with_other_account_tree(
client: AsyncClient, test_db: AsyncSession
):
"""Account A cannot start a copilot conversation using Account B's private tree UUID."""
acct_a, user_a, pass_a = await _create_account_and_user(test_db, "a")
acct_b, user_b, pass_b = await _create_account_and_user(test_db, "b")
tree_b = await _create_private_tree(test_db, acct_b, user_b)
await test_db.commit()
headers_a = await _login(client, user_a.email, pass_a)
resp = await client.post(
"/api/v1/copilot/conversations",
json={"tree_id": str(tree_b.id), "session_id": None, "current_node_id": None},
headers=headers_a,
)
# Must be 404 (not 200, not 403 — never confirm existence)
assert resp.status_code == 404, f"Expected 404, got {resp.status_code}: {resp.text}"
@pytest.mark.asyncio
async def test_copilot_service_rejects_cross_tenant_tree(test_db: AsyncSession):
"""copilot_service.start_conversation raises ValueError for cross-tenant tree."""
from app.services import copilot_service
acct_a, user_a, pass_a = await _create_account_and_user(test_db, "svc-a")
acct_b, user_b, pass_b = await _create_account_and_user(test_db, "svc-b")
tree_b = await _create_private_tree(test_db, acct_b, user_b)
await test_db.commit()
with pytest.raises(ValueError, match="not found"):
await copilot_service.start_conversation(
user_id=user_a.id,
account_id=user_a.account_id,
tree_id=tree_b.id,
session_id=None,
current_node_id=None,
db=test_db,
)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,654 @@
# Tenant Data Isolation — Design Spec
> **Date:** 2026-04-09
> **Status:** Approved — ready for implementation planning
> **Approach:** PostgreSQL RLS as safety net; application-layer filtering as primary enforcement
---
## Overview
ResolutionFlow is a multi-tenant SaaS. The tenant boundary is `account_id` (UUID foreign key to `accounts.id`). Two accounts must be completely isolated at every layer: one client must never be able to see, access, query, or receive data belonging to another client — even if there is a bug in application code.
This spec establishes the complete foundation that must be in place before further feature development proceeds.
### Design Principle
**Application code is the primary enforcement mechanism. PostgreSQL Row-Level Security (RLS) is the safety net.**
Both layers must always be present. Developers must never rely on RLS to do filtering that application code should be doing. A missing `tenant_filter()` in application code is a code review failure even if RLS would have caught it.
---
## Current State
- Tenant boundary: `account_id` (not `team_id` — that column is legacy)
- Isolation today: 100% application-layer, manual per-endpoint `.where(account_id == ...)`
- No RLS, no DB session variables, no middleware tenant injection
- Nullable `account_id` on ~8 core models (User, Tree, TreeCategory, etc.)
- ~20 tables with no direct `account_id` (scoped only through join chains)
- 4 models with `team_id` only, no `account_id`: TargetList, ScriptBuilderSession, ScriptTemplate, ScriptGeneration
- Known existing gaps: see Section 4
---
## Section 1: Schema Changes
### 1a. Denormalize `account_id` onto all tenant-relevant tables
Add `account_id UUID NOT NULL` (with FK to `accounts.id` and index) to each of the following tables that currently lack it:
| Table | Backfill path |
|---|---|
| `sessions` | `sessions.user_id → users.account_id` |
| `attachments` | `attachments.session_id → sessions.user_id → users.account_id` |
| `session_supporting_data` | `session_supporting_data.session_id → sessions.user_id → users.account_id` |
| `audit_logs` | `audit_logs.user_id → users.account_id` |
| `maintenance_schedules` | `maintenance_schedules.tree_id → trees.account_id` |
| `user_folders` | `user_folders.user_id → users.account_id` |
| `user_pinned_trees` | `user_pinned_trees.user_id → users.account_id` |
| `session_branches` | `session_branches.ai_session_id → ai_sessions.account_id` (verify column name — may be `session_id`) |
| `session_handoffs` | `session_handoffs.ai_session_id → ai_sessions.account_id` (verify column name — may be `session_id`) |
| `session_resolution_outputs` | `session_resolution_outputs.session_id → sessions.user_id → users.account_id` |
| `fork_points` | `fork_points.session_id → ai_sessions.account_id` |
| `ai_session_steps` | `ai_session_steps.session_id → ai_sessions.account_id` |
| `ai_suggestions` | `ai_suggestions.tree_id → trees.account_id` |
| `step_ratings` | `step_ratings.user_id → users.account_id` (backfill from rater, not the step) |
| `step_usage_logs` | `step_usage_logs.user_id → users.account_id` (backfill from user, not the step) |
| `psa_post_logs` | `psa_post_logs.psa_connection_id → psa_connections.account_id` (psa_connections already has account_id NOT NULL) |
| `psa_member_mappings` | `psa_member_mappings.psa_connection_id → psa_connections.account_id` |
| `notification_logs` | `notification_logs.notification_config_id → notification_configs.account_id` |
**Deferred:** `tree_shares` — backfill strategy and RLS policy depend on whether sharing is intra-tenant only or cross-tenant. Must be resolved before RLS is enabled on this table. See Section 7 (Open Questions).
### 1b. Make existing nullable `account_id` columns NOT NULL
These tables have `account_id` already but it is nullable:
- `users`
- `trees`
- `tree_categories`
- `tree_tags`
- `step_categories`
- `step_library`
- `tree_embeddings`
- `feedback`
Action: Backfill any NULL rows (assign to correct account or delete if orphaned test/seed data), then `ALTER COLUMN account_id SET NOT NULL`.
### 1c. Global / template content
Move global content out of tenant tables into dedicated tables with no RLS:
- **`template_trees`** — default and publicly visible trees (currently `is_default=True` or `visibility='public'` in `trees`). All tenants can read; no RLS.
- **`platform_steps`** — public steps from step library (currently `visibility='public'` in `step_library`). All tenants can read; no RLS.
- Tables already global (no change needed): `script_categories`, `platform_settings`, `plan_limits`, `feature_flags`, `plan_feature_defaults`.
### 1d. Migration sequence (per table)
Every table that gains `account_id` follows this exact sequence. Each step must succeed before the next begins:
1. `ADD COLUMN account_id UUID` (nullable)
2. Backfill via `UPDATE ... JOIN ...`
3. `SELECT COUNT(*) WHERE account_id IS NULL` — must be zero before proceeding
4. `ALTER COLUMN account_id SET NOT NULL`
5. `CREATE INDEX ON <table>(account_id)`
6. Enable RLS (in Phase 3, not here)
Any migration that cannot reach step 3 (zero NULLs) must roll back completely. No partial state.
---
## Section 2: PostgreSQL Roles & RLS Infrastructure
### 2a. Database roles
Two PostgreSQL roles:
**`resolutionflow_app`**
- Used by all application requests
- Subject to all RLS policies
- Standard table privileges: `SELECT, INSERT, UPDATE, DELETE` on all tenant tables
**`resolutionflow_admin`**
- Used by: Alembic migrations, seed scripts, super admin API endpoints, scheduled background jobs requiring cross-tenant access
- Has `BYPASSRLS` attribute — not subject to RLS policies
- Same table privileges as `resolutionflow_app`
- Connection string exposed as `DATABASE_ADMIN_URL` env var
The current `postgres` superuser is replaced in the application by these two roles. The connection string in `DATABASE_URL` transitions to `resolutionflow_app`. Alembic uses `DATABASE_URL_SYNC` pointing to `resolutionflow_admin`.
### 2b. Per-request tenant context injection
Every request that passes through `get_db()` must execute, inside a transaction boundary:
```sql
SET LOCAL app.current_account_id = '<account_uuid>';
```
`SET LOCAL` is transaction-scoped — it resets automatically when the transaction ends. No cleanup needed. This is implemented in a modified `get_db()` dependency that receives `current_user` and executes the SET before yielding the session.
**Fail-closed behavior:** If `app.current_account_id` is not set (e.g., a bug where `SET LOCAL` was skipped), `current_setting('app.current_account_id', false)::uuid` returns NULL. `NULL = NULL` is false in SQL — the RLS policy matches zero rows. This is the correct fail-closed behavior.
Public endpoints (no authenticated user) do not call `SET LOCAL`. RLS matches zero rows for all tenant tables. This is correct.
### 2c. RLS policy pattern
Every tenant table (from Section 1 + all existing tables with account_id) gets:
```sql
ALTER TABLE <table> ENABLE ROW LEVEL SECURITY;
ALTER TABLE <table> FORCE ROW LEVEL SECURITY;
-- SELECT
CREATE POLICY tenant_select ON <table> FOR SELECT
USING (account_id = current_setting('app.current_account_id', false)::uuid);
-- INSERT
CREATE POLICY tenant_insert ON <table> FOR INSERT
WITH CHECK (account_id = current_setting('app.current_account_id', false)::uuid);
-- UPDATE
CREATE POLICY tenant_update ON <table> FOR UPDATE
USING (account_id = current_setting('app.current_account_id', false)::uuid)
WITH CHECK (account_id = current_setting('app.current_account_id', false)::uuid);
-- DELETE
CREATE POLICY tenant_delete ON <table> FOR DELETE
USING (account_id = current_setting('app.current_account_id', false)::uuid);
```
**`FORCE ROW LEVEL SECURITY`** ensures the table owner is also subject to policies. The `resolutionflow_admin` role bypasses via its `BYPASSRLS` attribute, not via ownership.
**`audit_logs` exception:** SELECT policy only. No `WITH CHECK` on INSERT (app inserts audit logs freely). No UPDATE or DELETE policies ever. These constraints are permanent and must be documented in the migration comment.
**Global tables** (`platform_settings`, `plan_limits`, `feature_flags`, `plan_feature_defaults`, `template_trees`, `platform_steps`): No RLS.
### 2d. Connection pool reuse safety
The `SET LOCAL` approach is transaction-scoped. A connection pool reuse test (see Section 6) must verify that a connection returned to the pool after tenant A's request does not carry tenant A's `account_id` into tenant B's request. This is guaranteed by `SET LOCAL` (resets on transaction end) but must be explicitly verified.
---
## Section 3: Application-Layer Enforcement Patterns
### 3a. `tenant_filter()` helper
Add to `backend/app/core/filters.py`:
```python
def tenant_filter(model, account_id: uuid.UUID):
"""
Primary app-layer tenant filter.
MUST be used in every SELECT/UPDATE/DELETE on tenant tables.
RLS is the safety net — this is the primary enforcement.
"""
return model.account_id == account_id
```
All existing filter helpers (`build_tree_access_filter`, `build_step_visibility_filter`) must internally call `tenant_filter()` as their base constraint.
### 3b. Fetch-and-verify pattern
For ID-based lookups, filter by **both** `id` AND `account_id` in the query — not fetch-then-check:
```python
# Correct
stmt = select(Tree).where(
Tree.id == tree_id,
tenant_filter(Tree, current_user.account_id)
)
tree = (await db.execute(stmt)).scalar_one_or_none()
if not tree:
raise HTTPException(status_code=404)
# Prohibited — fetch first, then check
tree = await db.get(Tree, tree_id)
if tree.account_id != current_user.account_id:
raise HTTPException(status_code=403) # Reveals existence — also wrong
```
Endpoints must return **404, not 403**, for cross-tenant ID lookups. Never confirm that a resource exists.
### 3c. `get_tenant_context` dependency
Add to `backend/app/api/deps.py`:
```python
async def get_tenant_context(
current_user: User = Depends(get_current_active_user)
) -> uuid.UUID:
"""
Returns the current user's account_id.
Raises 403 if the user has no account association.
Inject this instead of accessing current_user.account_id directly.
"""
if current_user.account_id is None:
raise HTTPException(status_code=403, detail="User not associated with any account")
return current_user.account_id
```
### 3d. Insert pattern
All inserts on tenant tables must explicitly set `account_id`. RLS `WITH CHECK` rejects inserts where `account_id` doesn't match the session variable, but application code must also set it:
```python
new_record = Tree(
account_id=tenant_account_id, # Required — never omit
...
)
```
### 3e. Code review checklist rule
Every PR that touches a tenant table must be verified against:
1. Every SELECT/UPDATE/DELETE includes `tenant_filter()` or a wrapper that calls it
2. ID lookups filter by both `id` and `account_id` in the same query
3. Inserts explicitly set `account_id`
4. 404 (not 403) returned for cross-tenant ID lookups
5. A cross-tenant isolation test is included (Phase 2 onwards)
### 3f. CI grep check
A grep-based CI check is active from the end of Phase 0 on all PRs. It flags:
- Queries on known tenant tables that don't include `tenant_filter` or `account_id` as a filter term
- Initial implementation: warn (not block) to allow calibration; switch to block after 2 weeks of false-positive tuning
Pattern to be defined during Phase 0 implementation.
---
## Section 4: Existing Gap Fixes
### Immediate Hotfix (ships before Phase 0)
**CRITICAL: Copilot tree access bypass**
- **Files:** `backend/app/api/endpoints/copilot.py`, `backend/app/services/copilot_service.py`
- **Issue:** `start_conversation()` loads a tree by UUID without verifying the requesting user has access. An attacker who knows a tree UUID from another account can extract its full structure, node names, and descriptions via the AI system prompt.
- **Fix:** Add `can_access_tree(current_user, tree)` check in the endpoint after loading the tree. Also add `tenant_filter(Tree, account_id)` to the tree query in `copilot_service.py`. Raise 404 (not 403) if the tree is not found or not accessible.
- Ships as an independent hotfix PR, merged immediately.
### Phase 0 Fixes
**LOW: Analytics flow endpoint** (`backend/app/api/endpoints/analytics.py`)
- `GET /analytics/flows/{tree_id}` returns analytics for any tree by UUID with no access check
- Fix: Add `tenant_filter(Tree, current_user.account_id)` to the tree fetch query. 404 if not found.
**LOW: Category tree count** (`backend/app/api/endpoints/categories.py`)
- Tree count per category includes trees from all accounts
- Fix: Add `tenant_filter(Tree, current_user.account_id)` to the count subquery
**LOW: AI session scope inconsistency** (`backend/app/api/endpoints/ai_sessions.py`)
- List endpoint is user-scoped (`user_id == current_user.id`); search endpoint uses `OR(user_id, account_id)` exposing other users' session summaries within the same account
- Sessions are user-scoped only. Cross-user access permitted only via explicit escalation or session sharing
- Fix: Restrict search endpoint to `user_id == current_user.id`. List and search must behave consistently.
**Phase 0 UUID endpoint audit**
- Systematically review every endpoint with a `{resource_id}` URL parameter
- For each: verify that the ID lookup either (a) filters by `id AND account_id` in the query, or (b) calls `can_access_<resource>(current_user, resource)` on the fetched object
- Document every instance found, classify by severity, fix all before Phase 0 closes
---
## Section 5: Legacy `team_id` Migration
### 5a. TargetList — audit-gated
Before any migration work on `TargetList`:
1. Run a full codebase reference audit: grep for all references to `TargetList`, `target_list`, `target-list`
2. Query the production and staging databases for row count
3. Decision tree:
- Zero code references AND zero rows → drop the table entirely
- Zero rows but code references exist → deprecate code, then drop
- Rows exist → migrate (see sequence below)
4. Migration only proceeds after audit result is documented and approved
If migration proceeds:
- Backfill path: `team_id → teams → users WHERE is_team_admin → account_id`
- If any row cannot be backfilled to a valid `account_id`**full rollback**, no exceptions, no manual review queues
### 5b. Pre-migration: teams-to-accounts orphan check
Before any backfill using `team_id → account_id` chains, run:
```sql
SELECT COUNT(*) FROM teams t
LEFT JOIN users u ON u.team_id = t.id AND u.account_id IS NOT NULL
WHERE u.id IS NULL;
```
Count must be zero before any backfill proceeds. Report and resolve orphaned teams first.
### 5c. Approved migrations (no audit gate)
**ScriptBuilderSession:** Add `account_id`, backfill from `user_id → users.account_id`, set NOT NULL.
**ScriptTemplate:** Add `account_id`, backfill from `created_by_id → users.account_id`, set NOT NULL.
**ScriptGeneration:** Add `account_id`, backfill from `user_id → users.account_id`, set NOT NULL.
### 5d. team_id cleanup
Do NOT drop `team_id` columns during migration. Keep until all application code is updated to use `account_id` exclusively. Drop `team_id` columns in a later cleanup migration after verification.
---
## Section 6: Testing Strategy
### Phase 1: RLS validation tests
**File:** `backend/tests/test_rls_isolation.py`
This test suite validates RLS policies at the database layer, independent of any application code. It connects to the test database using the `resolutionflow_app` role.
**Setup fixture:**
- Creates two accounts (`account_a`, `account_b`) with seed rows in all tenant tables
- Creates an async DB connection using `resolutionflow_app` role
- Sets `SET LOCAL app.current_account_id = '<account_a_uuid>'` before each test
**Test cases per table (~5 cases per table, ~32 tables ≈ ~160 total):**
1. **SELECT isolation** — querying as account_a returns zero rows for account_b's data
2. **INSERT enforcement** — inserting with `account_id = account_b_uuid` raises a PostgreSQL exception (rejected by `WITH CHECK`)
3. **INSERT cross-tenant FK** — inserting with correct `account_id` but a FK value (e.g., `tree_id`) that belongs to account_b is also rejected
4. **UPDATE enforcement** — updating account_b's rows as account_a affects zero rows
5. **DELETE enforcement** — deleting account_b's rows as account_a affects zero rows
**Fail-closed test:**
```python
# Unset app.current_account_id — must raise a database exception
# Acceptable: psycopg2.errors.InvalidTextRepresentation (NULL::uuid cast fails)
# NOT acceptable: query returning zero rows silently
with pytest.raises(asyncpg.PostgresError):
await conn.execute("SELECT * FROM trees")
```
The exact exception type must be documented based on the behavior of `current_setting('app.current_account_id', false)::uuid` when unset. The test asserts on that specific exception.
**audit_logs special cases:**
- No INSERT `WITH CHECK` test — audit logs must be insertable freely
- No UPDATE or DELETE tests — those policies must not exist
- Verify via `pg_policies` that only a SELECT policy exists on `audit_logs`
**tree_shares:** Intentionally deferred. A TODO comment is included:
```python
# TODO: tree_shares RLS tests deferred pending sharing model decision.
# Must be added before RLS is enabled on the tree_shares table.
# See: docs/superpowers/specs/2026-04-09-tenant-data-isolation-design.md Section 7
```
**Tables covered in Phase 1:**
`trees`, `ai_sessions`, `ai_chat_sessions`, `ai_conversations`, `sessions`, `step_library`, `tree_categories`, `tree_tags`, `step_categories`, `flow_proposals`, `attachments`, `audit_logs`, `psa_connections`, `copilot_conversations`, `file_uploads`, `kb_imports`, `subscriptions`, `account_invites`, `ai_usage`, `notifications`, `session_shares`, `script_builder_sessions`, `script_templates`, `script_generations`, `maintenance_schedules`, `user_folders`, `user_pinned_trees`, `session_supporting_data`, `session_branches`, `session_resolution_outputs`, `psa_post_logs`, `notification_logs`
**Connection pool reuse test:**
```python
async def test_set_local_does_not_leak_between_connections():
"""SET LOCAL must not leak account_id when a connection is returned to the pool."""
conn = await pool.acquire()
async with conn.transaction():
await conn.execute("SET LOCAL app.current_account_id = $1", str(account_a_id))
# transaction ends, SET LOCAL resets
# Return conn to pool, re-acquire — verify account_id is not set
conn2 = await pool.acquire()
result = await conn2.fetchval("SELECT current_setting('app.current_account_id', true)")
assert result is None or result == ""
```
### Phase 2: Per-endpoint cross-tenant tests
As each endpoint is touched in any PR from Phase 1 onward, a cross-tenant isolation test is added in the same PR:
```python
async def test_cannot_access_other_account_<resource>(
client_account_a, account_b_resource
):
"""Account A cannot access Account B's resource by UUID."""
response = await client_account_a.get(f"/<resource>/{account_b_resource.id}")
assert response.status_code == 404 # Not 403 — never reveal existence
```
This is a hard requirement per the code review checklist.
---
## Section 7: Phased Rollout Plan
### Immediate Hotfix (before everything else)
- Fix CRITICAL: Copilot tree access bypass (see Section 4)
- Ships as independent PR, merged immediately, does not wait for Phase 0
---
### Phase 0 — Foundation
Goal: Fix all existing gaps; establish patterns and tooling; gate future PRs.
1. Fix LOW: Analytics flow endpoint missing ownership check
2. Fix LOW: Category tree count scope
3. Fix LOW: AI session search/list inconsistency (restrict both to `user_id`)
4. **Full UUID endpoint audit** — every `{resource_id}` URL param checked. All gaps documented and fixed before Phase 0 closes.
5. Add `get_tenant_context` dependency to `deps.py`
6. Add `tenant_filter()` helper to `filters.py`. Update existing filter helpers to call it.
7. **Dead code audit:** TargetList references + database row count. Report result.
8. **Orphan check:** Teams without a resolvable account_id. Report result.
9. **Define and activate CI grep check** for missing `tenant_filter()` on tenant tables. Active on all PRs from Phase 1 forward.
Gate: All gaps patched. CI grep check active. TargetList and team orphan audit results documented.
---
### Phase 1 — Schema Migration
Goal: Every tenant-relevant table has a direct, NOT NULL `account_id` column.
10. Add `account_id` to all tables in Section 1a (migration per logical domain group: core sessions, PSA, AI, steps, notifications)
11. Make nullable `account_id` NOT NULL on models from Section 1b
12. Migrate `ScriptBuilderSession`, `ScriptTemplate`, `ScriptGeneration` from `team_id` to `account_id`
13. `TargetList`: execute result of Phase 0 audit (drop or migrate)
14. Create `template_trees` and `platform_steps` tables; migrate global content
Gate: Zero NULL `account_id` values in any tenant table. All backfills verified. Database passes zero-NULL assertion query for every table in scope.
---
### Phase 2 — PostgreSQL Infrastructure
Goal: Database roles established; `SET LOCAL` wired into every request; RLS test suite green.
15. Create `resolutionflow_app` and `resolutionflow_admin` PostgreSQL roles; grant privileges
16. Update `DATABASE_URL` to `resolutionflow_app`; add `DATABASE_ADMIN_URL` for admin connections
17. Update Alembic to use `DATABASE_ADMIN_URL`
18. Modify `get_db()` to execute `SET LOCAL app.current_account_id` per request, inside transaction boundary
19. Update super admin endpoints to use `resolutionflow_admin` connection
20. Write `test_rls_isolation.py` — all ~160 RLS tests. Must be 100% green before Phase 3. Connection pool reuse test included.
21. Measure `SET LOCAL` per-request overhead baseline
Gate: All RLS tests green. All existing integration tests green. Connection pool reuse test green. Performance baseline documented.
---
### Phase 3 — Enable RLS
Goal: RLS policies active on all tenant tables. Phased by domain, not all at once.
Enable RLS in these batches. Run full test suite after each batch before proceeding:
**Batch A: Core data**
`trees`, `tree_categories`, `tree_tags`, `sessions`, `attachments`, `session_supporting_data`, `maintenance_schedules`
**Batch B: AI & sessions**
`ai_sessions`, `ai_chat_sessions`, `ai_conversations`, `ai_usage`, `ai_session_steps`, `session_branches`, `session_handoffs`, `session_resolution_outputs`, `fork_points`, `copilot_conversations`, `flow_proposals`
**Batch C: Steps & library**
`step_library`, `step_categories`, `step_ratings`, `step_usage_logs`, `ai_suggestions`, `template_trees` (no RLS), `platform_steps` (no RLS)
**Batch D: Integrations & users**
`psa_connections`, `psa_post_logs`, `psa_member_mappings`, `subscriptions`, `account_invites`, `file_uploads`, `kb_imports`, `notifications`, `notification_logs`, `session_shares`, `user_folders`, `user_pinned_trees`
**Batch E: Scripts & auth**
`script_builder_sessions`, `script_templates`, `script_generations`, `audit_logs`
For each batch migration: `ENABLE ROW LEVEL SECURITY` + `FORCE ROW LEVEL SECURITY` + create all applicable policies.
Gate: All RLS tests green after each batch. All integration tests green. Staging smoke test after Batch E. No performance regressions.
---
### Phase 4 — Ongoing
- Every PR that touches a tenant endpoint includes a cross-tenant isolation test
- CI grep check blocks PRs with missing `tenant_filter()` (warn → block after 2-week calibration)
- `tree_shares` RLS: design sharing model, implement tests, enable RLS before any sharing feature ships
- `team_id` columns: drop in a cleanup migration after all application code is fully migrated
---
## Section 8: Background Job Policy
Background jobs and scheduled tasks that process tenant data are a distinct isolation surface. Unlike request-scoped endpoints, they do not have a `current_user` and must manage tenant context explicitly.
### Policy
Every background job that touches tenant tables must comply with one of two patterns:
**Pattern A — `resolutionflow_admin` with explicit per-query `account_id` filtering**
Use when: the job is inherently cross-tenant (e.g., processes all pending work across all accounts in a single pass). The admin role bypasses RLS, so explicit `account_id` filters in every query are mandatory — RLS is not the safety net here.
```python
# Allowed: cross-tenant batch SELECT for IDs only, then loop per-account
result = await db.execute(
select(Model.id, Model.account_id).where(Model.status == "pending")
)
for row in result.all():
await _process_one(row.id, row.account_id, db) # account_id threaded through
# In the processing function: all queries must filter by account_id
async def _process_one(record_id, account_id, db):
result = await db.execute(
select(Model).where(Model.id == record_id, Model.account_id == account_id)
)
```
**Pattern B — `resolutionflow_app` with `SET LOCAL` per tenant loop iteration**
Use when: the job processes tenants one at a time and it is practical to set the tenant context per iteration.
```python
for account_id in account_ids:
async with async_session_maker() as db:
await db.execute(
text("SET LOCAL app.current_account_id = :id"),
{"id": str(account_id)}
)
# All queries in this block are RLS-enforced to this tenant
```
### No cross-tenant queries without justification
No background job may issue a SELECT, UPDATE, or DELETE that spans multiple tenants' data in a single query without explicit written justification in the code comment and in this spec. Approved cross-tenant operations are documented below.
### Inventory: Current Background Jobs
| Job | File | Pattern | Status | Notes |
|---|---|---|---|---|
| Knowledge Flywheel | `knowledge_flywheel_scheduler.py` | Admin + explicit filter | **Needs update** | Batch SELECT across all accounts with no `account_id` filter. Must thread `account_id` from the session into all processing calls. |
| PSA Retry | `psa_retry_scheduler.py` | Admin + explicit filter | **Needs update** | Batch SELECT across all accounts with no `account_id` filter. Must add `account_id` to batch query and thread through to `retry_failed_push`. |
| Chat Retention Cleanup | `retention_cleanup.py` | Admin + explicit filter | **Correct pattern** | Already loops per-account with explicit `account_id` in all queries. Model for other jobs. Needs role update to `resolutionflow_admin`. |
| Maintenance Schedule Firing | `scheduler.py` (`_fire_maintenance_schedule`) | Admin + explicit filter | **Needs update** | Fetches `MaintenanceSchedule` and `Tree` by ID without `account_id` filter. After Phase 1, `Session` creation must set `account_id`. Add `account_id` to all queries. |
| AI Conversation Expiry | `scheduler.py` (`_cleanup_expired_ai_conversations`) | Admin, cross-tenant approved | **Correct pattern** | Cross-tenant DELETE by `expires_at` is explicitly justified: rows are expired regardless of tenant. Needs role update to `resolutionflow_admin`. Document this approval in code comment. |
### Approved cross-tenant queries
The following cross-tenant operations are explicitly approved. All others require a new entry here before shipping:
| Job | Query | Justification |
|---|---|---|
| AI Conversation Expiry | `DELETE FROM ai_conversations WHERE expires_at < NOW()` | Time-based expiry is tenant-independent. Deleting expired rows does not expose data across tenants. |
| Chat Retention Cleanup | `SELECT id FROM accounts` | Required to iterate per-account. Read of account IDs only; no tenant data accessed. |
| Knowledge Flywheel (after fix) | `SELECT id, account_id FROM ai_sessions WHERE analysis_status='pending'` | Cross-tenant ID harvest only. No tenant data in the SELECT. `account_id` is threaded into per-tenant processing. |
| PSA Retry (after fix) | `SELECT id, account_id FROM psa_post_logs WHERE status='pending_retry'` | Cross-tenant ID harvest only. `account_id` threaded into per-tenant processing. |
### Checklist additions (Phase 2)
- [ ] All background jobs updated to use `resolutionflow_admin` role via `DATABASE_ADMIN_URL`
- [ ] Knowledge Flywheel: `account_id` added to batch query and threaded through to `analyze_session`
- [ ] PSA Retry: `account_id` added to batch query and threaded through to `retry_failed_push`
- [ ] Maintenance Schedule Firing: all queries include `account_id`; `Session` creation sets `account_id` after Phase 1 migration
- [ ] AI Conversation Expiry: cross-tenant approval comment added to code
- [ ] All jobs reviewed against this policy before Phase 3 (RLS enable)
---
## Section 9: Open Questions
| Question | Impact | Owner |
|---|---|---|
| Is tree sharing intra-tenant only, or can trees be shared across accounts? | Determines `tree_shares` schema, backfill strategy, and RLS policy. Tree_shares table deferred until resolved. | Product |
| What is the exact PostgreSQL exception raised when `current_setting('app.current_account_id', false)::uuid` is evaluated with no value set? | Determines the fail-closed test assertion. Must be tested in Phase 2. | Engineering |
| TargetList: zero references + zero rows? Or does data exist? | Determines whether table is dropped or migrated. | Phase 0 audit |
---
## Appendix: Pre-Implementation Checklist
Everything that must be in place before writing feature code on any tenant-data endpoint:
### Foundation (Phase 0)
- [ ] Copilot tree access bypass hotfix shipped
- [ ] Analytics flow endpoint ownership check added
- [ ] Category tree count scoped to account
- [ ] AI session list and search both restricted to `user_id`
- [ ] Full UUID endpoint audit completed; all gaps documented and fixed
- [ ] `get_tenant_context` dependency added to `deps.py`
- [ ] `tenant_filter()` helper added to `filters.py`
- [ ] Existing filter helpers updated to use `tenant_filter()` internally
- [ ] TargetList dead code audit result documented
- [ ] Teams orphan count query run and result documented
- [ ] CI grep check defined and active
### Schema (Phase 1)
- [ ] `account_id NOT NULL` on all tables in Section 1a denormalization list
- [ ] `account_id NOT NULL` on all existing nullable models from Section 1b
- [ ] ScriptBuilderSession, ScriptTemplate, ScriptGeneration migrated from team_id
- [ ] TargetList: dropped or migrated per audit result
- [ ] template_trees and platform_steps tables created
- [ ] Zero NULL assertion passes for every tenant table
- [ ] Migration sequence (add nullable → backfill → verify → NOT NULL → index) followed for each table
### Infrastructure (Phase 2)
- [ ] `resolutionflow_app` role created with correct privileges
- [ ] `resolutionflow_admin` role created with `BYPASSRLS`
- [ ] `DATABASE_URL` updated to `resolutionflow_app`
- [ ] `DATABASE_ADMIN_URL` added for admin connections
- [ ] Alembic uses `DATABASE_ADMIN_URL`
- [ ] `get_db()` executes `SET LOCAL app.current_account_id` per request inside transaction
- [ ] Super admin endpoints use admin connection
- [ ] All background jobs updated to use `resolutionflow_admin` role (see Section 8)
- [ ] Knowledge Flywheel: `account_id` threaded through batch query and `analyze_session`
- [ ] PSA Retry: `account_id` threaded through batch query and `retry_failed_push`
- [ ] Maintenance Schedule: all queries include `account_id`; `Session` creation sets `account_id`
- [ ] AI Conversation Expiry: cross-tenant approval comment added to code
- [ ] `test_rls_isolation.py` written with ~160 test cases
- [ ] All RLS tests pass (100%)
- [ ] Connection pool reuse test passes
- [ ] Fail-closed exception documented and asserted
- [ ] Performance baseline for `SET LOCAL` overhead documented
### RLS Active (Phase 3)
- [ ] Batch A RLS policies applied and all tests green
- [ ] Batch B RLS policies applied and all tests green
- [ ] Batch C RLS policies applied and all tests green
- [ ] Batch D RLS policies applied and all tests green
- [ ] Batch E RLS policies applied and all tests green
- [ ] Staging smoke test passed
- [ ] All existing integration tests green with RLS active
### Ongoing Standards
- [ ] Every new endpoint PR includes cross-tenant isolation test
- [ ] CI grep check blocks PRs with missing `tenant_filter()`
- [ ] `tree_shares` deferred — not enabled until sharing model documented
- [ ] `team_id` cleanup migration scheduled after full migration complete