fix: CRITICAL — scope copilot tree query to current account (#131)
* docs: add tenant data isolation design spec Complete architecture plan for multi-tenant data isolation across all layers (PostgreSQL RLS, application-layer filtering, schema migration, testing strategy, and phased rollout checklist). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: add background job isolation policy to tenant isolation spec Documents policy for all 5 existing background jobs: - Knowledge Flywheel and PSA Retry flagged for account_id threading - Chat Retention already follows correct pattern (model for others) - Maintenance Schedule Firing needs account_id in queries + Session creation - AI Conversation Expiry approved as cross-tenant with justification Adds approved cross-tenant query registry and Phase 2 checklist items. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: add tenant isolation Phase 0 implementation plan 8 tasks covering: CRITICAL copilot hotfix, tenant_filter() helper, get_tenant_context dependency, analytics/category/AI session gap fixes, full UUID endpoint audit, TargetList dead code audit, teams orphan check, and CI grep check for missing tenant filters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: CRITICAL — scope copilot tree query to current account A user who knew another account's tree UUID could start a copilot conversation, causing the tree's full node structure, names, and descriptions to be sent to the AI as part of the system prompt. Fix: add account_id (or is_default / visibility='public') filter to the tree SELECT in copilot_service.start_conversation(). Returns 404 for inaccessible trees. Test added in test_tenant_isolation_p0.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit was merged in pull request #131.
This commit is contained in:
1136
docs/superpowers/plans/2026-04-09-tenant-isolation-phase-0.md
Normal file
1136
docs/superpowers/plans/2026-04-09-tenant-isolation-phase-0.md
Normal file
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,654 @@
|
||||
# Tenant Data Isolation — Design Spec
|
||||
|
||||
> **Date:** 2026-04-09
|
||||
> **Status:** Approved — ready for implementation planning
|
||||
> **Approach:** PostgreSQL RLS as safety net; application-layer filtering as primary enforcement
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
ResolutionFlow is a multi-tenant SaaS. The tenant boundary is `account_id` (UUID foreign key to `accounts.id`). Two accounts must be completely isolated at every layer: one client must never be able to see, access, query, or receive data belonging to another client — even if there is a bug in application code.
|
||||
|
||||
This spec establishes the complete foundation that must be in place before further feature development proceeds.
|
||||
|
||||
### Design Principle
|
||||
|
||||
**Application code is the primary enforcement mechanism. PostgreSQL Row-Level Security (RLS) is the safety net.**
|
||||
|
||||
Both layers must always be present. Developers must never rely on RLS to do filtering that application code should be doing. A missing `tenant_filter()` in application code is a code review failure even if RLS would have caught it.
|
||||
|
||||
---
|
||||
|
||||
## Current State
|
||||
|
||||
- Tenant boundary: `account_id` (not `team_id` — that column is legacy)
|
||||
- Isolation today: 100% application-layer, manual per-endpoint `.where(account_id == ...)`
|
||||
- No RLS, no DB session variables, no middleware tenant injection
|
||||
- Nullable `account_id` on ~8 core models (User, Tree, TreeCategory, etc.)
|
||||
- ~20 tables with no direct `account_id` (scoped only through join chains)
|
||||
- 4 models with `team_id` only, no `account_id`: TargetList, ScriptBuilderSession, ScriptTemplate, ScriptGeneration
|
||||
- Known existing gaps: see Section 4
|
||||
|
||||
---
|
||||
|
||||
## Section 1: Schema Changes
|
||||
|
||||
### 1a. Denormalize `account_id` onto all tenant-relevant tables
|
||||
|
||||
Add `account_id UUID NOT NULL` (with FK to `accounts.id` and index) to each of the following tables that currently lack it:
|
||||
|
||||
| Table | Backfill path |
|
||||
|---|---|
|
||||
| `sessions` | `sessions.user_id → users.account_id` |
|
||||
| `attachments` | `attachments.session_id → sessions.user_id → users.account_id` |
|
||||
| `session_supporting_data` | `session_supporting_data.session_id → sessions.user_id → users.account_id` |
|
||||
| `audit_logs` | `audit_logs.user_id → users.account_id` |
|
||||
| `maintenance_schedules` | `maintenance_schedules.tree_id → trees.account_id` |
|
||||
| `user_folders` | `user_folders.user_id → users.account_id` |
|
||||
| `user_pinned_trees` | `user_pinned_trees.user_id → users.account_id` |
|
||||
| `session_branches` | `session_branches.ai_session_id → ai_sessions.account_id` (verify column name — may be `session_id`) |
|
||||
| `session_handoffs` | `session_handoffs.ai_session_id → ai_sessions.account_id` (verify column name — may be `session_id`) |
|
||||
| `session_resolution_outputs` | `session_resolution_outputs.session_id → sessions.user_id → users.account_id` |
|
||||
| `fork_points` | `fork_points.session_id → ai_sessions.account_id` |
|
||||
| `ai_session_steps` | `ai_session_steps.session_id → ai_sessions.account_id` |
|
||||
| `ai_suggestions` | `ai_suggestions.tree_id → trees.account_id` |
|
||||
| `step_ratings` | `step_ratings.user_id → users.account_id` (backfill from rater, not the step) |
|
||||
| `step_usage_logs` | `step_usage_logs.user_id → users.account_id` (backfill from user, not the step) |
|
||||
| `psa_post_logs` | `psa_post_logs.psa_connection_id → psa_connections.account_id` (psa_connections already has account_id NOT NULL) |
|
||||
| `psa_member_mappings` | `psa_member_mappings.psa_connection_id → psa_connections.account_id` |
|
||||
| `notification_logs` | `notification_logs.notification_config_id → notification_configs.account_id` |
|
||||
|
||||
**Deferred:** `tree_shares` — backfill strategy and RLS policy depend on whether sharing is intra-tenant only or cross-tenant. Must be resolved before RLS is enabled on this table. See Section 7 (Open Questions).
|
||||
|
||||
### 1b. Make existing nullable `account_id` columns NOT NULL
|
||||
|
||||
These tables have `account_id` already but it is nullable:
|
||||
|
||||
- `users`
|
||||
- `trees`
|
||||
- `tree_categories`
|
||||
- `tree_tags`
|
||||
- `step_categories`
|
||||
- `step_library`
|
||||
- `tree_embeddings`
|
||||
- `feedback`
|
||||
|
||||
Action: Backfill any NULL rows (assign to correct account or delete if orphaned test/seed data), then `ALTER COLUMN account_id SET NOT NULL`.
|
||||
|
||||
### 1c. Global / template content
|
||||
|
||||
Move global content out of tenant tables into dedicated tables with no RLS:
|
||||
|
||||
- **`template_trees`** — default and publicly visible trees (currently `is_default=True` or `visibility='public'` in `trees`). All tenants can read; no RLS.
|
||||
- **`platform_steps`** — public steps from step library (currently `visibility='public'` in `step_library`). All tenants can read; no RLS.
|
||||
- Tables already global (no change needed): `script_categories`, `platform_settings`, `plan_limits`, `feature_flags`, `plan_feature_defaults`.
|
||||
|
||||
### 1d. Migration sequence (per table)
|
||||
|
||||
Every table that gains `account_id` follows this exact sequence. Each step must succeed before the next begins:
|
||||
|
||||
1. `ADD COLUMN account_id UUID` (nullable)
|
||||
2. Backfill via `UPDATE ... JOIN ...`
|
||||
3. `SELECT COUNT(*) WHERE account_id IS NULL` — must be zero before proceeding
|
||||
4. `ALTER COLUMN account_id SET NOT NULL`
|
||||
5. `CREATE INDEX ON <table>(account_id)`
|
||||
6. Enable RLS (in Phase 3, not here)
|
||||
|
||||
Any migration that cannot reach step 3 (zero NULLs) must roll back completely. No partial state.
|
||||
|
||||
---
|
||||
|
||||
## Section 2: PostgreSQL Roles & RLS Infrastructure
|
||||
|
||||
### 2a. Database roles
|
||||
|
||||
Two PostgreSQL roles:
|
||||
|
||||
**`resolutionflow_app`**
|
||||
- Used by all application requests
|
||||
- Subject to all RLS policies
|
||||
- Standard table privileges: `SELECT, INSERT, UPDATE, DELETE` on all tenant tables
|
||||
|
||||
**`resolutionflow_admin`**
|
||||
- Used by: Alembic migrations, seed scripts, super admin API endpoints, scheduled background jobs requiring cross-tenant access
|
||||
- Has `BYPASSRLS` attribute — not subject to RLS policies
|
||||
- Same table privileges as `resolutionflow_app`
|
||||
- Connection string exposed as `DATABASE_ADMIN_URL` env var
|
||||
|
||||
The current `postgres` superuser is replaced in the application by these two roles. The connection string in `DATABASE_URL` transitions to `resolutionflow_app`. Alembic uses `DATABASE_URL_SYNC` pointing to `resolutionflow_admin`.
|
||||
|
||||
### 2b. Per-request tenant context injection
|
||||
|
||||
Every request that passes through `get_db()` must execute, inside a transaction boundary:
|
||||
|
||||
```sql
|
||||
SET LOCAL app.current_account_id = '<account_uuid>';
|
||||
```
|
||||
|
||||
`SET LOCAL` is transaction-scoped — it resets automatically when the transaction ends. No cleanup needed. This is implemented in a modified `get_db()` dependency that receives `current_user` and executes the SET before yielding the session.
|
||||
|
||||
**Fail-closed behavior:** If `app.current_account_id` is not set (e.g., a bug where `SET LOCAL` was skipped), `current_setting('app.current_account_id', false)::uuid` returns NULL. `NULL = NULL` is false in SQL — the RLS policy matches zero rows. This is the correct fail-closed behavior.
|
||||
|
||||
Public endpoints (no authenticated user) do not call `SET LOCAL`. RLS matches zero rows for all tenant tables. This is correct.
|
||||
|
||||
### 2c. RLS policy pattern
|
||||
|
||||
Every tenant table (from Section 1 + all existing tables with account_id) gets:
|
||||
|
||||
```sql
|
||||
ALTER TABLE <table> ENABLE ROW LEVEL SECURITY;
|
||||
ALTER TABLE <table> FORCE ROW LEVEL SECURITY;
|
||||
|
||||
-- SELECT
|
||||
CREATE POLICY tenant_select ON <table> FOR SELECT
|
||||
USING (account_id = current_setting('app.current_account_id', false)::uuid);
|
||||
|
||||
-- INSERT
|
||||
CREATE POLICY tenant_insert ON <table> FOR INSERT
|
||||
WITH CHECK (account_id = current_setting('app.current_account_id', false)::uuid);
|
||||
|
||||
-- UPDATE
|
||||
CREATE POLICY tenant_update ON <table> FOR UPDATE
|
||||
USING (account_id = current_setting('app.current_account_id', false)::uuid)
|
||||
WITH CHECK (account_id = current_setting('app.current_account_id', false)::uuid);
|
||||
|
||||
-- DELETE
|
||||
CREATE POLICY tenant_delete ON <table> FOR DELETE
|
||||
USING (account_id = current_setting('app.current_account_id', false)::uuid);
|
||||
```
|
||||
|
||||
**`FORCE ROW LEVEL SECURITY`** ensures the table owner is also subject to policies. The `resolutionflow_admin` role bypasses via its `BYPASSRLS` attribute, not via ownership.
|
||||
|
||||
**`audit_logs` exception:** SELECT policy only. No `WITH CHECK` on INSERT (app inserts audit logs freely). No UPDATE or DELETE policies ever. These constraints are permanent and must be documented in the migration comment.
|
||||
|
||||
**Global tables** (`platform_settings`, `plan_limits`, `feature_flags`, `plan_feature_defaults`, `template_trees`, `platform_steps`): No RLS.
|
||||
|
||||
### 2d. Connection pool reuse safety
|
||||
|
||||
The `SET LOCAL` approach is transaction-scoped. A connection pool reuse test (see Section 6) must verify that a connection returned to the pool after tenant A's request does not carry tenant A's `account_id` into tenant B's request. This is guaranteed by `SET LOCAL` (resets on transaction end) but must be explicitly verified.
|
||||
|
||||
---
|
||||
|
||||
## Section 3: Application-Layer Enforcement Patterns
|
||||
|
||||
### 3a. `tenant_filter()` helper
|
||||
|
||||
Add to `backend/app/core/filters.py`:
|
||||
|
||||
```python
|
||||
def tenant_filter(model, account_id: uuid.UUID):
|
||||
"""
|
||||
Primary app-layer tenant filter.
|
||||
MUST be used in every SELECT/UPDATE/DELETE on tenant tables.
|
||||
RLS is the safety net — this is the primary enforcement.
|
||||
"""
|
||||
return model.account_id == account_id
|
||||
```
|
||||
|
||||
All existing filter helpers (`build_tree_access_filter`, `build_step_visibility_filter`) must internally call `tenant_filter()` as their base constraint.
|
||||
|
||||
### 3b. Fetch-and-verify pattern
|
||||
|
||||
For ID-based lookups, filter by **both** `id` AND `account_id` in the query — not fetch-then-check:
|
||||
|
||||
```python
|
||||
# Correct
|
||||
stmt = select(Tree).where(
|
||||
Tree.id == tree_id,
|
||||
tenant_filter(Tree, current_user.account_id)
|
||||
)
|
||||
tree = (await db.execute(stmt)).scalar_one_or_none()
|
||||
if not tree:
|
||||
raise HTTPException(status_code=404)
|
||||
|
||||
# Prohibited — fetch first, then check
|
||||
tree = await db.get(Tree, tree_id)
|
||||
if tree.account_id != current_user.account_id:
|
||||
raise HTTPException(status_code=403) # Reveals existence — also wrong
|
||||
```
|
||||
|
||||
Endpoints must return **404, not 403**, for cross-tenant ID lookups. Never confirm that a resource exists.
|
||||
|
||||
### 3c. `get_tenant_context` dependency
|
||||
|
||||
Add to `backend/app/api/deps.py`:
|
||||
|
||||
```python
|
||||
async def get_tenant_context(
|
||||
current_user: User = Depends(get_current_active_user)
|
||||
) -> uuid.UUID:
|
||||
"""
|
||||
Returns the current user's account_id.
|
||||
Raises 403 if the user has no account association.
|
||||
Inject this instead of accessing current_user.account_id directly.
|
||||
"""
|
||||
if current_user.account_id is None:
|
||||
raise HTTPException(status_code=403, detail="User not associated with any account")
|
||||
return current_user.account_id
|
||||
```
|
||||
|
||||
### 3d. Insert pattern
|
||||
|
||||
All inserts on tenant tables must explicitly set `account_id`. RLS `WITH CHECK` rejects inserts where `account_id` doesn't match the session variable, but application code must also set it:
|
||||
|
||||
```python
|
||||
new_record = Tree(
|
||||
account_id=tenant_account_id, # Required — never omit
|
||||
...
|
||||
)
|
||||
```
|
||||
|
||||
### 3e. Code review checklist rule
|
||||
|
||||
Every PR that touches a tenant table must be verified against:
|
||||
|
||||
1. Every SELECT/UPDATE/DELETE includes `tenant_filter()` or a wrapper that calls it
|
||||
2. ID lookups filter by both `id` and `account_id` in the same query
|
||||
3. Inserts explicitly set `account_id`
|
||||
4. 404 (not 403) returned for cross-tenant ID lookups
|
||||
5. A cross-tenant isolation test is included (Phase 2 onwards)
|
||||
|
||||
### 3f. CI grep check
|
||||
|
||||
A grep-based CI check is active from the end of Phase 0 on all PRs. It flags:
|
||||
- Queries on known tenant tables that don't include `tenant_filter` or `account_id` as a filter term
|
||||
- Initial implementation: warn (not block) to allow calibration; switch to block after 2 weeks of false-positive tuning
|
||||
|
||||
Pattern to be defined during Phase 0 implementation.
|
||||
|
||||
---
|
||||
|
||||
## Section 4: Existing Gap Fixes
|
||||
|
||||
### Immediate Hotfix (ships before Phase 0)
|
||||
|
||||
**CRITICAL: Copilot tree access bypass**
|
||||
- **Files:** `backend/app/api/endpoints/copilot.py`, `backend/app/services/copilot_service.py`
|
||||
- **Issue:** `start_conversation()` loads a tree by UUID without verifying the requesting user has access. An attacker who knows a tree UUID from another account can extract its full structure, node names, and descriptions via the AI system prompt.
|
||||
- **Fix:** Add `can_access_tree(current_user, tree)` check in the endpoint after loading the tree. Also add `tenant_filter(Tree, account_id)` to the tree query in `copilot_service.py`. Raise 404 (not 403) if the tree is not found or not accessible.
|
||||
- Ships as an independent hotfix PR, merged immediately.
|
||||
|
||||
### Phase 0 Fixes
|
||||
|
||||
**LOW: Analytics flow endpoint** (`backend/app/api/endpoints/analytics.py`)
|
||||
- `GET /analytics/flows/{tree_id}` returns analytics for any tree by UUID with no access check
|
||||
- Fix: Add `tenant_filter(Tree, current_user.account_id)` to the tree fetch query. 404 if not found.
|
||||
|
||||
**LOW: Category tree count** (`backend/app/api/endpoints/categories.py`)
|
||||
- Tree count per category includes trees from all accounts
|
||||
- Fix: Add `tenant_filter(Tree, current_user.account_id)` to the count subquery
|
||||
|
||||
**LOW: AI session scope inconsistency** (`backend/app/api/endpoints/ai_sessions.py`)
|
||||
- List endpoint is user-scoped (`user_id == current_user.id`); search endpoint uses `OR(user_id, account_id)` exposing other users' session summaries within the same account
|
||||
- Sessions are user-scoped only. Cross-user access permitted only via explicit escalation or session sharing
|
||||
- Fix: Restrict search endpoint to `user_id == current_user.id`. List and search must behave consistently.
|
||||
|
||||
**Phase 0 UUID endpoint audit**
|
||||
- Systematically review every endpoint with a `{resource_id}` URL parameter
|
||||
- For each: verify that the ID lookup either (a) filters by `id AND account_id` in the query, or (b) calls `can_access_<resource>(current_user, resource)` on the fetched object
|
||||
- Document every instance found, classify by severity, fix all before Phase 0 closes
|
||||
|
||||
---
|
||||
|
||||
## Section 5: Legacy `team_id` Migration
|
||||
|
||||
### 5a. TargetList — audit-gated
|
||||
|
||||
Before any migration work on `TargetList`:
|
||||
|
||||
1. Run a full codebase reference audit: grep for all references to `TargetList`, `target_list`, `target-list`
|
||||
2. Query the production and staging databases for row count
|
||||
3. Decision tree:
|
||||
- Zero code references AND zero rows → drop the table entirely
|
||||
- Zero rows but code references exist → deprecate code, then drop
|
||||
- Rows exist → migrate (see sequence below)
|
||||
4. Migration only proceeds after audit result is documented and approved
|
||||
|
||||
If migration proceeds:
|
||||
- Backfill path: `team_id → teams → users WHERE is_team_admin → account_id`
|
||||
- If any row cannot be backfilled to a valid `account_id` → **full rollback**, no exceptions, no manual review queues
|
||||
|
||||
### 5b. Pre-migration: teams-to-accounts orphan check
|
||||
|
||||
Before any backfill using `team_id → account_id` chains, run:
|
||||
|
||||
```sql
|
||||
SELECT COUNT(*) FROM teams t
|
||||
LEFT JOIN users u ON u.team_id = t.id AND u.account_id IS NOT NULL
|
||||
WHERE u.id IS NULL;
|
||||
```
|
||||
|
||||
Count must be zero before any backfill proceeds. Report and resolve orphaned teams first.
|
||||
|
||||
### 5c. Approved migrations (no audit gate)
|
||||
|
||||
**ScriptBuilderSession:** Add `account_id`, backfill from `user_id → users.account_id`, set NOT NULL.
|
||||
|
||||
**ScriptTemplate:** Add `account_id`, backfill from `created_by_id → users.account_id`, set NOT NULL.
|
||||
|
||||
**ScriptGeneration:** Add `account_id`, backfill from `user_id → users.account_id`, set NOT NULL.
|
||||
|
||||
### 5d. team_id cleanup
|
||||
|
||||
Do NOT drop `team_id` columns during migration. Keep until all application code is updated to use `account_id` exclusively. Drop `team_id` columns in a later cleanup migration after verification.
|
||||
|
||||
---
|
||||
|
||||
## Section 6: Testing Strategy
|
||||
|
||||
### Phase 1: RLS validation tests
|
||||
|
||||
**File:** `backend/tests/test_rls_isolation.py`
|
||||
|
||||
This test suite validates RLS policies at the database layer, independent of any application code. It connects to the test database using the `resolutionflow_app` role.
|
||||
|
||||
**Setup fixture:**
|
||||
- Creates two accounts (`account_a`, `account_b`) with seed rows in all tenant tables
|
||||
- Creates an async DB connection using `resolutionflow_app` role
|
||||
- Sets `SET LOCAL app.current_account_id = '<account_a_uuid>'` before each test
|
||||
|
||||
**Test cases per table (~5 cases per table, ~32 tables ≈ ~160 total):**
|
||||
|
||||
1. **SELECT isolation** — querying as account_a returns zero rows for account_b's data
|
||||
2. **INSERT enforcement** — inserting with `account_id = account_b_uuid` raises a PostgreSQL exception (rejected by `WITH CHECK`)
|
||||
3. **INSERT cross-tenant FK** — inserting with correct `account_id` but a FK value (e.g., `tree_id`) that belongs to account_b is also rejected
|
||||
4. **UPDATE enforcement** — updating account_b's rows as account_a affects zero rows
|
||||
5. **DELETE enforcement** — deleting account_b's rows as account_a affects zero rows
|
||||
|
||||
**Fail-closed test:**
|
||||
```python
|
||||
# Unset app.current_account_id — must raise a database exception
|
||||
# Acceptable: psycopg2.errors.InvalidTextRepresentation (NULL::uuid cast fails)
|
||||
# NOT acceptable: query returning zero rows silently
|
||||
with pytest.raises(asyncpg.PostgresError):
|
||||
await conn.execute("SELECT * FROM trees")
|
||||
```
|
||||
|
||||
The exact exception type must be documented based on the behavior of `current_setting('app.current_account_id', false)::uuid` when unset. The test asserts on that specific exception.
|
||||
|
||||
**audit_logs special cases:**
|
||||
- No INSERT `WITH CHECK` test — audit logs must be insertable freely
|
||||
- No UPDATE or DELETE tests — those policies must not exist
|
||||
- Verify via `pg_policies` that only a SELECT policy exists on `audit_logs`
|
||||
|
||||
**tree_shares:** Intentionally deferred. A TODO comment is included:
|
||||
```python
|
||||
# TODO: tree_shares RLS tests deferred pending sharing model decision.
|
||||
# Must be added before RLS is enabled on the tree_shares table.
|
||||
# See: docs/superpowers/specs/2026-04-09-tenant-data-isolation-design.md Section 7
|
||||
```
|
||||
|
||||
**Tables covered in Phase 1:**
|
||||
`trees`, `ai_sessions`, `ai_chat_sessions`, `ai_conversations`, `sessions`, `step_library`, `tree_categories`, `tree_tags`, `step_categories`, `flow_proposals`, `attachments`, `audit_logs`, `psa_connections`, `copilot_conversations`, `file_uploads`, `kb_imports`, `subscriptions`, `account_invites`, `ai_usage`, `notifications`, `session_shares`, `script_builder_sessions`, `script_templates`, `script_generations`, `maintenance_schedules`, `user_folders`, `user_pinned_trees`, `session_supporting_data`, `session_branches`, `session_resolution_outputs`, `psa_post_logs`, `notification_logs`
|
||||
|
||||
**Connection pool reuse test:**
|
||||
```python
|
||||
async def test_set_local_does_not_leak_between_connections():
|
||||
"""SET LOCAL must not leak account_id when a connection is returned to the pool."""
|
||||
conn = await pool.acquire()
|
||||
async with conn.transaction():
|
||||
await conn.execute("SET LOCAL app.current_account_id = $1", str(account_a_id))
|
||||
# transaction ends, SET LOCAL resets
|
||||
# Return conn to pool, re-acquire — verify account_id is not set
|
||||
conn2 = await pool.acquire()
|
||||
result = await conn2.fetchval("SELECT current_setting('app.current_account_id', true)")
|
||||
assert result is None or result == ""
|
||||
```
|
||||
|
||||
### Phase 2: Per-endpoint cross-tenant tests
|
||||
|
||||
As each endpoint is touched in any PR from Phase 1 onward, a cross-tenant isolation test is added in the same PR:
|
||||
|
||||
```python
|
||||
async def test_cannot_access_other_account_<resource>(
|
||||
client_account_a, account_b_resource
|
||||
):
|
||||
"""Account A cannot access Account B's resource by UUID."""
|
||||
response = await client_account_a.get(f"/<resource>/{account_b_resource.id}")
|
||||
assert response.status_code == 404 # Not 403 — never reveal existence
|
||||
```
|
||||
|
||||
This is a hard requirement per the code review checklist.
|
||||
|
||||
---
|
||||
|
||||
## Section 7: Phased Rollout Plan
|
||||
|
||||
### Immediate Hotfix (before everything else)
|
||||
|
||||
- Fix CRITICAL: Copilot tree access bypass (see Section 4)
|
||||
- Ships as independent PR, merged immediately, does not wait for Phase 0
|
||||
|
||||
---
|
||||
|
||||
### Phase 0 — Foundation
|
||||
|
||||
Goal: Fix all existing gaps; establish patterns and tooling; gate future PRs.
|
||||
|
||||
1. Fix LOW: Analytics flow endpoint missing ownership check
|
||||
2. Fix LOW: Category tree count scope
|
||||
3. Fix LOW: AI session search/list inconsistency (restrict both to `user_id`)
|
||||
4. **Full UUID endpoint audit** — every `{resource_id}` URL param checked. All gaps documented and fixed before Phase 0 closes.
|
||||
5. Add `get_tenant_context` dependency to `deps.py`
|
||||
6. Add `tenant_filter()` helper to `filters.py`. Update existing filter helpers to call it.
|
||||
7. **Dead code audit:** TargetList references + database row count. Report result.
|
||||
8. **Orphan check:** Teams without a resolvable account_id. Report result.
|
||||
9. **Define and activate CI grep check** for missing `tenant_filter()` on tenant tables. Active on all PRs from Phase 1 forward.
|
||||
|
||||
Gate: All gaps patched. CI grep check active. TargetList and team orphan audit results documented.
|
||||
|
||||
---
|
||||
|
||||
### Phase 1 — Schema Migration
|
||||
|
||||
Goal: Every tenant-relevant table has a direct, NOT NULL `account_id` column.
|
||||
|
||||
10. Add `account_id` to all tables in Section 1a (migration per logical domain group: core sessions, PSA, AI, steps, notifications)
|
||||
11. Make nullable `account_id` NOT NULL on models from Section 1b
|
||||
12. Migrate `ScriptBuilderSession`, `ScriptTemplate`, `ScriptGeneration` from `team_id` to `account_id`
|
||||
13. `TargetList`: execute result of Phase 0 audit (drop or migrate)
|
||||
14. Create `template_trees` and `platform_steps` tables; migrate global content
|
||||
|
||||
Gate: Zero NULL `account_id` values in any tenant table. All backfills verified. Database passes zero-NULL assertion query for every table in scope.
|
||||
|
||||
---
|
||||
|
||||
### Phase 2 — PostgreSQL Infrastructure
|
||||
|
||||
Goal: Database roles established; `SET LOCAL` wired into every request; RLS test suite green.
|
||||
|
||||
15. Create `resolutionflow_app` and `resolutionflow_admin` PostgreSQL roles; grant privileges
|
||||
16. Update `DATABASE_URL` to `resolutionflow_app`; add `DATABASE_ADMIN_URL` for admin connections
|
||||
17. Update Alembic to use `DATABASE_ADMIN_URL`
|
||||
18. Modify `get_db()` to execute `SET LOCAL app.current_account_id` per request, inside transaction boundary
|
||||
19. Update super admin endpoints to use `resolutionflow_admin` connection
|
||||
20. Write `test_rls_isolation.py` — all ~160 RLS tests. Must be 100% green before Phase 3. Connection pool reuse test included.
|
||||
21. Measure `SET LOCAL` per-request overhead baseline
|
||||
|
||||
Gate: All RLS tests green. All existing integration tests green. Connection pool reuse test green. Performance baseline documented.
|
||||
|
||||
---
|
||||
|
||||
### Phase 3 — Enable RLS
|
||||
|
||||
Goal: RLS policies active on all tenant tables. Phased by domain, not all at once.
|
||||
|
||||
Enable RLS in these batches. Run full test suite after each batch before proceeding:
|
||||
|
||||
**Batch A: Core data**
|
||||
`trees`, `tree_categories`, `tree_tags`, `sessions`, `attachments`, `session_supporting_data`, `maintenance_schedules`
|
||||
|
||||
**Batch B: AI & sessions**
|
||||
`ai_sessions`, `ai_chat_sessions`, `ai_conversations`, `ai_usage`, `ai_session_steps`, `session_branches`, `session_handoffs`, `session_resolution_outputs`, `fork_points`, `copilot_conversations`, `flow_proposals`
|
||||
|
||||
**Batch C: Steps & library**
|
||||
`step_library`, `step_categories`, `step_ratings`, `step_usage_logs`, `ai_suggestions`, `template_trees` (no RLS), `platform_steps` (no RLS)
|
||||
|
||||
**Batch D: Integrations & users**
|
||||
`psa_connections`, `psa_post_logs`, `psa_member_mappings`, `subscriptions`, `account_invites`, `file_uploads`, `kb_imports`, `notifications`, `notification_logs`, `session_shares`, `user_folders`, `user_pinned_trees`
|
||||
|
||||
**Batch E: Scripts & auth**
|
||||
`script_builder_sessions`, `script_templates`, `script_generations`, `audit_logs`
|
||||
|
||||
For each batch migration: `ENABLE ROW LEVEL SECURITY` + `FORCE ROW LEVEL SECURITY` + create all applicable policies.
|
||||
|
||||
Gate: All RLS tests green after each batch. All integration tests green. Staging smoke test after Batch E. No performance regressions.
|
||||
|
||||
---
|
||||
|
||||
### Phase 4 — Ongoing
|
||||
|
||||
- Every PR that touches a tenant endpoint includes a cross-tenant isolation test
|
||||
- CI grep check blocks PRs with missing `tenant_filter()` (warn → block after 2-week calibration)
|
||||
- `tree_shares` RLS: design sharing model, implement tests, enable RLS before any sharing feature ships
|
||||
- `team_id` columns: drop in a cleanup migration after all application code is fully migrated
|
||||
|
||||
---
|
||||
|
||||
## Section 8: Background Job Policy
|
||||
|
||||
Background jobs and scheduled tasks that process tenant data are a distinct isolation surface. Unlike request-scoped endpoints, they do not have a `current_user` and must manage tenant context explicitly.
|
||||
|
||||
### Policy
|
||||
|
||||
Every background job that touches tenant tables must comply with one of two patterns:
|
||||
|
||||
**Pattern A — `resolutionflow_admin` with explicit per-query `account_id` filtering**
|
||||
|
||||
Use when: the job is inherently cross-tenant (e.g., processes all pending work across all accounts in a single pass). The admin role bypasses RLS, so explicit `account_id` filters in every query are mandatory — RLS is not the safety net here.
|
||||
|
||||
```python
|
||||
# Allowed: cross-tenant batch SELECT for IDs only, then loop per-account
|
||||
result = await db.execute(
|
||||
select(Model.id, Model.account_id).where(Model.status == "pending")
|
||||
)
|
||||
for row in result.all():
|
||||
await _process_one(row.id, row.account_id, db) # account_id threaded through
|
||||
|
||||
# In the processing function: all queries must filter by account_id
|
||||
async def _process_one(record_id, account_id, db):
|
||||
result = await db.execute(
|
||||
select(Model).where(Model.id == record_id, Model.account_id == account_id)
|
||||
)
|
||||
```
|
||||
|
||||
**Pattern B — `resolutionflow_app` with `SET LOCAL` per tenant loop iteration**
|
||||
|
||||
Use when: the job processes tenants one at a time and it is practical to set the tenant context per iteration.
|
||||
|
||||
```python
|
||||
for account_id in account_ids:
|
||||
async with async_session_maker() as db:
|
||||
await db.execute(
|
||||
text("SET LOCAL app.current_account_id = :id"),
|
||||
{"id": str(account_id)}
|
||||
)
|
||||
# All queries in this block are RLS-enforced to this tenant
|
||||
```
|
||||
|
||||
### No cross-tenant queries without justification
|
||||
|
||||
No background job may issue a SELECT, UPDATE, or DELETE that spans multiple tenants' data in a single query without explicit written justification in the code comment and in this spec. Approved cross-tenant operations are documented below.
|
||||
|
||||
### Inventory: Current Background Jobs
|
||||
|
||||
| Job | File | Pattern | Status | Notes |
|
||||
|---|---|---|---|---|
|
||||
| Knowledge Flywheel | `knowledge_flywheel_scheduler.py` | Admin + explicit filter | **Needs update** | Batch SELECT across all accounts with no `account_id` filter. Must thread `account_id` from the session into all processing calls. |
|
||||
| PSA Retry | `psa_retry_scheduler.py` | Admin + explicit filter | **Needs update** | Batch SELECT across all accounts with no `account_id` filter. Must add `account_id` to batch query and thread through to `retry_failed_push`. |
|
||||
| Chat Retention Cleanup | `retention_cleanup.py` | Admin + explicit filter | **Correct pattern** | Already loops per-account with explicit `account_id` in all queries. Model for other jobs. Needs role update to `resolutionflow_admin`. |
|
||||
| Maintenance Schedule Firing | `scheduler.py` (`_fire_maintenance_schedule`) | Admin + explicit filter | **Needs update** | Fetches `MaintenanceSchedule` and `Tree` by ID without `account_id` filter. After Phase 1, `Session` creation must set `account_id`. Add `account_id` to all queries. |
|
||||
| AI Conversation Expiry | `scheduler.py` (`_cleanup_expired_ai_conversations`) | Admin, cross-tenant approved | **Correct pattern** | Cross-tenant DELETE by `expires_at` is explicitly justified: rows are expired regardless of tenant. Needs role update to `resolutionflow_admin`. Document this approval in code comment. |
|
||||
|
||||
### Approved cross-tenant queries
|
||||
|
||||
The following cross-tenant operations are explicitly approved. All others require a new entry here before shipping:
|
||||
|
||||
| Job | Query | Justification |
|
||||
|---|---|---|
|
||||
| AI Conversation Expiry | `DELETE FROM ai_conversations WHERE expires_at < NOW()` | Time-based expiry is tenant-independent. Deleting expired rows does not expose data across tenants. |
|
||||
| Chat Retention Cleanup | `SELECT id FROM accounts` | Required to iterate per-account. Read of account IDs only; no tenant data accessed. |
|
||||
| Knowledge Flywheel (after fix) | `SELECT id, account_id FROM ai_sessions WHERE analysis_status='pending'` | Cross-tenant ID harvest only. No tenant data in the SELECT. `account_id` is threaded into per-tenant processing. |
|
||||
| PSA Retry (after fix) | `SELECT id, account_id FROM psa_post_logs WHERE status='pending_retry'` | Cross-tenant ID harvest only. `account_id` threaded into per-tenant processing. |
|
||||
|
||||
### Checklist additions (Phase 2)
|
||||
|
||||
- [ ] All background jobs updated to use `resolutionflow_admin` role via `DATABASE_ADMIN_URL`
|
||||
- [ ] Knowledge Flywheel: `account_id` added to batch query and threaded through to `analyze_session`
|
||||
- [ ] PSA Retry: `account_id` added to batch query and threaded through to `retry_failed_push`
|
||||
- [ ] Maintenance Schedule Firing: all queries include `account_id`; `Session` creation sets `account_id` after Phase 1 migration
|
||||
- [ ] AI Conversation Expiry: cross-tenant approval comment added to code
|
||||
- [ ] All jobs reviewed against this policy before Phase 3 (RLS enable)
|
||||
|
||||
---
|
||||
|
||||
## Section 9: Open Questions
|
||||
|
||||
| Question | Impact | Owner |
|
||||
|---|---|---|
|
||||
| Is tree sharing intra-tenant only, or can trees be shared across accounts? | Determines `tree_shares` schema, backfill strategy, and RLS policy. Tree_shares table deferred until resolved. | Product |
|
||||
| What is the exact PostgreSQL exception raised when `current_setting('app.current_account_id', false)::uuid` is evaluated with no value set? | Determines the fail-closed test assertion. Must be tested in Phase 2. | Engineering |
|
||||
| TargetList: zero references + zero rows? Or does data exist? | Determines whether table is dropped or migrated. | Phase 0 audit |
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Pre-Implementation Checklist
|
||||
|
||||
Everything that must be in place before writing feature code on any tenant-data endpoint:
|
||||
|
||||
### Foundation (Phase 0)
|
||||
- [ ] Copilot tree access bypass hotfix shipped
|
||||
- [ ] Analytics flow endpoint ownership check added
|
||||
- [ ] Category tree count scoped to account
|
||||
- [ ] AI session list and search both restricted to `user_id`
|
||||
- [ ] Full UUID endpoint audit completed; all gaps documented and fixed
|
||||
- [ ] `get_tenant_context` dependency added to `deps.py`
|
||||
- [ ] `tenant_filter()` helper added to `filters.py`
|
||||
- [ ] Existing filter helpers updated to use `tenant_filter()` internally
|
||||
- [ ] TargetList dead code audit result documented
|
||||
- [ ] Teams orphan count query run and result documented
|
||||
- [ ] CI grep check defined and active
|
||||
|
||||
### Schema (Phase 1)
|
||||
- [ ] `account_id NOT NULL` on all tables in Section 1a denormalization list
|
||||
- [ ] `account_id NOT NULL` on all existing nullable models from Section 1b
|
||||
- [ ] ScriptBuilderSession, ScriptTemplate, ScriptGeneration migrated from team_id
|
||||
- [ ] TargetList: dropped or migrated per audit result
|
||||
- [ ] template_trees and platform_steps tables created
|
||||
- [ ] Zero NULL assertion passes for every tenant table
|
||||
- [ ] Migration sequence (add nullable → backfill → verify → NOT NULL → index) followed for each table
|
||||
|
||||
### Infrastructure (Phase 2)
|
||||
- [ ] `resolutionflow_app` role created with correct privileges
|
||||
- [ ] `resolutionflow_admin` role created with `BYPASSRLS`
|
||||
- [ ] `DATABASE_URL` updated to `resolutionflow_app`
|
||||
- [ ] `DATABASE_ADMIN_URL` added for admin connections
|
||||
- [ ] Alembic uses `DATABASE_ADMIN_URL`
|
||||
- [ ] `get_db()` executes `SET LOCAL app.current_account_id` per request inside transaction
|
||||
- [ ] Super admin endpoints use admin connection
|
||||
- [ ] All background jobs updated to use `resolutionflow_admin` role (see Section 8)
|
||||
- [ ] Knowledge Flywheel: `account_id` threaded through batch query and `analyze_session`
|
||||
- [ ] PSA Retry: `account_id` threaded through batch query and `retry_failed_push`
|
||||
- [ ] Maintenance Schedule: all queries include `account_id`; `Session` creation sets `account_id`
|
||||
- [ ] AI Conversation Expiry: cross-tenant approval comment added to code
|
||||
- [ ] `test_rls_isolation.py` written with ~160 test cases
|
||||
- [ ] All RLS tests pass (100%)
|
||||
- [ ] Connection pool reuse test passes
|
||||
- [ ] Fail-closed exception documented and asserted
|
||||
- [ ] Performance baseline for `SET LOCAL` overhead documented
|
||||
|
||||
### RLS Active (Phase 3)
|
||||
- [ ] Batch A RLS policies applied and all tests green
|
||||
- [ ] Batch B RLS policies applied and all tests green
|
||||
- [ ] Batch C RLS policies applied and all tests green
|
||||
- [ ] Batch D RLS policies applied and all tests green
|
||||
- [ ] Batch E RLS policies applied and all tests green
|
||||
- [ ] Staging smoke test passed
|
||||
- [ ] All existing integration tests green with RLS active
|
||||
|
||||
### Ongoing Standards
|
||||
- [ ] Every new endpoint PR includes cross-tenant isolation test
|
||||
- [ ] CI grep check blocks PRs with missing `tenant_filter()`
|
||||
- [ ] `tree_shares` deferred — not enabled until sharing model documented
|
||||
- [ ] `team_id` cleanup migration scheduled after full migration complete
|
||||
Reference in New Issue
Block a user