Files
resolutionflow/docs/superpowers/specs/2026-04-09-tenant-data-isolation-design.md
chihlasm b3dba57bc5 feat: tenant isolation Phase 0 — app-layer filters, UUID audit, CI gate (#132)
* docs: add tenant data isolation design spec

Complete architecture plan for multi-tenant data isolation across
all layers (PostgreSQL RLS, application-layer filtering, schema
migration, testing strategy, and phased rollout checklist).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: add background job isolation policy to tenant isolation spec

Documents policy for all 5 existing background jobs:
- Knowledge Flywheel and PSA Retry flagged for account_id threading
- Chat Retention already follows correct pattern (model for others)
- Maintenance Schedule Firing needs account_id in queries + Session creation
- AI Conversation Expiry approved as cross-tenant with justification

Adds approved cross-tenant query registry and Phase 2 checklist items.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: add tenant isolation Phase 0 implementation plan

8 tasks covering: CRITICAL copilot hotfix, tenant_filter() helper,
get_tenant_context dependency, analytics/category/AI session gap fixes,
full UUID endpoint audit, TargetList dead code audit, teams orphan
check, and CI grep check for missing tenant filters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add tenant_filter() helper and get_tenant_context dependency

tenant_filter(model, account_id) is the canonical app-layer tenant
scoping expression. Every query on a tenant table must use it.
build_tree_access_filter and build_step_visibility_filter updated
to call tenant_filter() internally for the account_id match.

get_tenant_context is a FastAPI dependency that returns account_id
or raises 403 if the user has no account — prevents raw access to
current_user.account_id and centralises the null check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: scope analytics/flows/{tree_id} to requesting account

Any authenticated user could read flow analytics (session counts,
completion rates, CSAT) for any tree UUID. Now returns 404 if the
tree doesn't belong to the requesting account.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: scope category tree_count to requesting account

tree_count on GET /categories/{id} was including trees from all
accounts, leaking cross-tenant row counts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: restrict AI session search to current user only

Search endpoint used OR(user_id, account_id), exposing other users'
problem_summary and problem_domain within the same account. Sessions
are user-scoped only — cross-user access requires explicit escalation
or sharing. List and search endpoints now behave consistently.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: add ownership check and 404 responses to ai-sessions endpoints

Cross-tenant isolation audit found:
- retry-psa-push had NO ownership check (CRITICAL) — any user could retry any session's PSA push
- save_task_lane used db.get() without ownership filter, returned 403 revealing existence
- get_session returned 403 instead of 404 for unauthorized access
- stream_documentation returned 403 instead of 404

All now use query-level user_id filtering and return 404 to avoid revealing existence.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: return 404 instead of 403 for cross-tenant session access

All session endpoints (get, update, complete, scratchpad, variables, export,
ticket-link) now return 404 instead of 403 when a user tries to access
another user's session. This prevents confirming existence of resources
across tenant boundaries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: return 404 instead of 403 for cross-tenant tree access

get_tree and update_tree now return 404 when a user cannot access a tree
(private tree from another account). Prevents confirming resource existence
across tenant boundaries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: return 404 instead of 403 for cross-tenant step access

get_step_or_404 now returns 404 when can_view_step or can_edit_step fails,
preventing confirmation of step existence across tenant boundaries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: return 404 instead of 403 for cross-tenant upload access

get_upload_url and delete_upload now return 404 when the upload belongs to
a different account/user, preventing resource existence confirmation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: return 404 instead of 403 for cross-tenant share access

revoke_share and create_share now return 404 when the caller is not the
owner, preventing resource existence confirmation across users.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: return 404 instead of 403 for cross-team tree access in maintenance schedules

_get_tree_or_403 now returns 404 when the user's team does not match,
preventing confirmation of tree existence across teams.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: return 404 instead of 403 for cross-account tag access

get_tag now returns 404 for account-specific tags that belong to another
account, preventing resource existence confirmation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: return 404 instead of 403 for cross-account step category access

get_step_category now returns 404 for account-specific categories that
belong to another account, preventing resource existence confirmation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test: add cross-tenant isolation tests for Task 6 UUID audit

Tests cover:
- Tree GET/PUT returns 404 for cross-account access
- Session GET returns 404 for cross-user access
- AI session GET returns 404 for cross-user access
- AI session retry-psa-push requires ownership
- Upload URL returns 404 for cross-account access
- Share revoke returns 404 for cross-user access

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: return 404 (not 403) for get_documentation cross-user access; add missing Task 6 tests

get_documentation was revealing session existence via 403. Added pre-check
query filtering by session_id AND user_id before calling the engine.

Also add cross-tenant isolation tests for steps, tags, step_categories,
and maintenance_schedules endpoints fixed in Task 6 (TDD was skipped).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address Task 6 quality review — rename helper, restore 403 for intra-account, add docs test

- Rename _get_tree_or_403 → _get_tree_or_404 in maintenance_schedules.py
  (function now raises 404, old name was misleading)
- Restore HTTP 403 for intra-account permission failures in update_tree:
  same-account users who can see a tree but can't edit it got 404 (wrong);
  only cross-account lookups should return 404 to avoid confirming existence
- Apply same 403/404 distinction to update_tree_visibility
- Add test: get_documentation must return 404 for cross-user session access
- Add comment documenting owner-only design for documentation endpoints

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: Task 7+8 — TargetList audit, CI tenant-filter grep check

Task 7: TargetList dead code audit
- Found active code references in 12+ files across backend and frontend
  (full CRUD API + frontend page + MaintenanceScheduleSection + BatchLaunchModal)
- Decision: migrate to account_id in Phase 1 (cannot drop)
- DB row count not available from code-server — must verify from VPS SSH
  before Phase 1 migration
- Teams orphan check query documented; must run from VPS SSH before Phase 1
- Results documented in spec Section 9

Task 8: CI tenant-filter enforcement check (warn mode)
- Create backend/scripts/check_tenant_filters.py
  Scans endpoint and service files for select() on tenant tables without
  tenant_filter/account_id/user_id in surrounding context. Currently
  reports 109 warnings (Phase 1 backlog). Exits 0 (warn mode).
- Add Check tenant filter enforcement step to backend CI job
  Add --fail flag after Phase 1 backlog clears to make it blocking.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: record Phase 0 audit results — 0 orphaned teams, 0 target_list rows

Both checks confirmed 2026-04-09 from production DB.
Phase 1 migration is safe to proceed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 00:42:19 -04:00

33 KiB

Tenant Data Isolation — Design Spec

Date: 2026-04-09 Status: Approved — ready for implementation planning Approach: PostgreSQL RLS as safety net; application-layer filtering as primary enforcement


Overview

ResolutionFlow is a multi-tenant SaaS. The tenant boundary is account_id (UUID foreign key to accounts.id). Two accounts must be completely isolated at every layer: one client must never be able to see, access, query, or receive data belonging to another client — even if there is a bug in application code.

This spec establishes the complete foundation that must be in place before further feature development proceeds.

Design Principle

Application code is the primary enforcement mechanism. PostgreSQL Row-Level Security (RLS) is the safety net.

Both layers must always be present. Developers must never rely on RLS to do filtering that application code should be doing. A missing tenant_filter() in application code is a code review failure even if RLS would have caught it.


Current State

  • Tenant boundary: account_id (not team_id — that column is legacy)
  • Isolation today: 100% application-layer, manual per-endpoint .where(account_id == ...)
  • No RLS, no DB session variables, no middleware tenant injection
  • Nullable account_id on ~8 core models (User, Tree, TreeCategory, etc.)
  • ~20 tables with no direct account_id (scoped only through join chains)
  • 4 models with team_id only, no account_id: TargetList, ScriptBuilderSession, ScriptTemplate, ScriptGeneration
  • Known existing gaps: see Section 4

Section 1: Schema Changes

1a. Denormalize account_id onto all tenant-relevant tables

Add account_id UUID NOT NULL (with FK to accounts.id and index) to each of the following tables that currently lack it:

Table Backfill path
sessions sessions.user_id → users.account_id
attachments attachments.session_id → sessions.user_id → users.account_id
session_supporting_data session_supporting_data.session_id → sessions.user_id → users.account_id
audit_logs audit_logs.user_id → users.account_id
maintenance_schedules maintenance_schedules.tree_id → trees.account_id
user_folders user_folders.user_id → users.account_id
user_pinned_trees user_pinned_trees.user_id → users.account_id
session_branches session_branches.ai_session_id → ai_sessions.account_id (verify column name — may be session_id)
session_handoffs session_handoffs.ai_session_id → ai_sessions.account_id (verify column name — may be session_id)
session_resolution_outputs session_resolution_outputs.session_id → sessions.user_id → users.account_id
fork_points fork_points.session_id → ai_sessions.account_id
ai_session_steps ai_session_steps.session_id → ai_sessions.account_id
ai_suggestions ai_suggestions.tree_id → trees.account_id
step_ratings step_ratings.user_id → users.account_id (backfill from rater, not the step)
step_usage_logs step_usage_logs.user_id → users.account_id (backfill from user, not the step)
psa_post_logs psa_post_logs.psa_connection_id → psa_connections.account_id (psa_connections already has account_id NOT NULL)
psa_member_mappings psa_member_mappings.psa_connection_id → psa_connections.account_id
notification_logs notification_logs.notification_config_id → notification_configs.account_id

Deferred: tree_shares — backfill strategy and RLS policy depend on whether sharing is intra-tenant only or cross-tenant. Must be resolved before RLS is enabled on this table. See Section 7 (Open Questions).

1b. Make existing nullable account_id columns NOT NULL

These tables have account_id already but it is nullable:

  • users
  • trees
  • tree_categories
  • tree_tags
  • step_categories
  • step_library
  • tree_embeddings
  • feedback

Action: Backfill any NULL rows (assign to correct account or delete if orphaned test/seed data), then ALTER COLUMN account_id SET NOT NULL.

1c. Global / template content

Move global content out of tenant tables into dedicated tables with no RLS:

  • template_trees — default and publicly visible trees (currently is_default=True or visibility='public' in trees). All tenants can read; no RLS.
  • platform_steps — public steps from step library (currently visibility='public' in step_library). All tenants can read; no RLS.
  • Tables already global (no change needed): script_categories, platform_settings, plan_limits, feature_flags, plan_feature_defaults.

1d. Migration sequence (per table)

Every table that gains account_id follows this exact sequence. Each step must succeed before the next begins:

  1. ADD COLUMN account_id UUID (nullable)
  2. Backfill via UPDATE ... JOIN ...
  3. SELECT COUNT(*) WHERE account_id IS NULL — must be zero before proceeding
  4. ALTER COLUMN account_id SET NOT NULL
  5. CREATE INDEX ON <table>(account_id)
  6. Enable RLS (in Phase 3, not here)

Any migration that cannot reach step 3 (zero NULLs) must roll back completely. No partial state.


Section 2: PostgreSQL Roles & RLS Infrastructure

2a. Database roles

Two PostgreSQL roles:

resolutionflow_app

  • Used by all application requests
  • Subject to all RLS policies
  • Standard table privileges: SELECT, INSERT, UPDATE, DELETE on all tenant tables

resolutionflow_admin

  • Used by: Alembic migrations, seed scripts, super admin API endpoints, scheduled background jobs requiring cross-tenant access
  • Has BYPASSRLS attribute — not subject to RLS policies
  • Same table privileges as resolutionflow_app
  • Connection string exposed as DATABASE_ADMIN_URL env var

The current postgres superuser is replaced in the application by these two roles. The connection string in DATABASE_URL transitions to resolutionflow_app. Alembic uses DATABASE_URL_SYNC pointing to resolutionflow_admin.

2b. Per-request tenant context injection

Every request that passes through get_db() must execute, inside a transaction boundary:

SET LOCAL app.current_account_id = '<account_uuid>';

SET LOCAL is transaction-scoped — it resets automatically when the transaction ends. No cleanup needed. This is implemented in a modified get_db() dependency that receives current_user and executes the SET before yielding the session.

Fail-closed behavior: If app.current_account_id is not set (e.g., a bug where SET LOCAL was skipped), current_setting('app.current_account_id', false)::uuid returns NULL. NULL = NULL is false in SQL — the RLS policy matches zero rows. This is the correct fail-closed behavior.

Public endpoints (no authenticated user) do not call SET LOCAL. RLS matches zero rows for all tenant tables. This is correct.

2c. RLS policy pattern

Every tenant table (from Section 1 + all existing tables with account_id) gets:

ALTER TABLE <table> ENABLE ROW LEVEL SECURITY;
ALTER TABLE <table> FORCE ROW LEVEL SECURITY;

-- SELECT
CREATE POLICY tenant_select ON <table> FOR SELECT
  USING (account_id = current_setting('app.current_account_id', false)::uuid);

-- INSERT
CREATE POLICY tenant_insert ON <table> FOR INSERT
  WITH CHECK (account_id = current_setting('app.current_account_id', false)::uuid);

-- UPDATE
CREATE POLICY tenant_update ON <table> FOR UPDATE
  USING (account_id = current_setting('app.current_account_id', false)::uuid)
  WITH CHECK (account_id = current_setting('app.current_account_id', false)::uuid);

-- DELETE
CREATE POLICY tenant_delete ON <table> FOR DELETE
  USING (account_id = current_setting('app.current_account_id', false)::uuid);

FORCE ROW LEVEL SECURITY ensures the table owner is also subject to policies. The resolutionflow_admin role bypasses via its BYPASSRLS attribute, not via ownership.

audit_logs exception: SELECT policy only. No WITH CHECK on INSERT (app inserts audit logs freely). No UPDATE or DELETE policies ever. These constraints are permanent and must be documented in the migration comment.

Global tables (platform_settings, plan_limits, feature_flags, plan_feature_defaults, template_trees, platform_steps): No RLS.

2d. Connection pool reuse safety

The SET LOCAL approach is transaction-scoped. A connection pool reuse test (see Section 6) must verify that a connection returned to the pool after tenant A's request does not carry tenant A's account_id into tenant B's request. This is guaranteed by SET LOCAL (resets on transaction end) but must be explicitly verified.


Section 3: Application-Layer Enforcement Patterns

3a. tenant_filter() helper

Add to backend/app/core/filters.py:

def tenant_filter(model, account_id: uuid.UUID):
    """
    Primary app-layer tenant filter.
    MUST be used in every SELECT/UPDATE/DELETE on tenant tables.
    RLS is the safety net — this is the primary enforcement.
    """
    return model.account_id == account_id

All existing filter helpers (build_tree_access_filter, build_step_visibility_filter) must internally call tenant_filter() as their base constraint.

3b. Fetch-and-verify pattern

For ID-based lookups, filter by both id AND account_id in the query — not fetch-then-check:

# Correct
stmt = select(Tree).where(
    Tree.id == tree_id,
    tenant_filter(Tree, current_user.account_id)
)
tree = (await db.execute(stmt)).scalar_one_or_none()
if not tree:
    raise HTTPException(status_code=404)

# Prohibited — fetch first, then check
tree = await db.get(Tree, tree_id)
if tree.account_id != current_user.account_id:
    raise HTTPException(status_code=403)  # Reveals existence — also wrong

Endpoints must return 404, not 403, for cross-tenant ID lookups. Never confirm that a resource exists.

3c. get_tenant_context dependency

Add to backend/app/api/deps.py:

async def get_tenant_context(
    current_user: User = Depends(get_current_active_user)
) -> uuid.UUID:
    """
    Returns the current user's account_id.
    Raises 403 if the user has no account association.
    Inject this instead of accessing current_user.account_id directly.
    """
    if current_user.account_id is None:
        raise HTTPException(status_code=403, detail="User not associated with any account")
    return current_user.account_id

3d. Insert pattern

All inserts on tenant tables must explicitly set account_id. RLS WITH CHECK rejects inserts where account_id doesn't match the session variable, but application code must also set it:

new_record = Tree(
    account_id=tenant_account_id,  # Required — never omit
    ...
)

3e. Code review checklist rule

Every PR that touches a tenant table must be verified against:

  1. Every SELECT/UPDATE/DELETE includes tenant_filter() or a wrapper that calls it
  2. ID lookups filter by both id and account_id in the same query
  3. Inserts explicitly set account_id
  4. 404 (not 403) returned for cross-tenant ID lookups
  5. A cross-tenant isolation test is included (Phase 2 onwards)

3f. CI grep check

A grep-based CI check is active from the end of Phase 0 on all PRs. It flags:

  • Queries on known tenant tables that don't include tenant_filter or account_id as a filter term
  • Initial implementation: warn (not block) to allow calibration; switch to block after 2 weeks of false-positive tuning

Pattern to be defined during Phase 0 implementation.


Section 4: Existing Gap Fixes

Immediate Hotfix (ships before Phase 0)

CRITICAL: Copilot tree access bypass

  • Files: backend/app/api/endpoints/copilot.py, backend/app/services/copilot_service.py
  • Issue: start_conversation() loads a tree by UUID without verifying the requesting user has access. An attacker who knows a tree UUID from another account can extract its full structure, node names, and descriptions via the AI system prompt.
  • Fix: Add can_access_tree(current_user, tree) check in the endpoint after loading the tree. Also add tenant_filter(Tree, account_id) to the tree query in copilot_service.py. Raise 404 (not 403) if the tree is not found or not accessible.
  • Ships as an independent hotfix PR, merged immediately.

Phase 0 Fixes

LOW: Analytics flow endpoint (backend/app/api/endpoints/analytics.py)

  • GET /analytics/flows/{tree_id} returns analytics for any tree by UUID with no access check
  • Fix: Add tenant_filter(Tree, current_user.account_id) to the tree fetch query. 404 if not found.

LOW: Category tree count (backend/app/api/endpoints/categories.py)

  • Tree count per category includes trees from all accounts
  • Fix: Add tenant_filter(Tree, current_user.account_id) to the count subquery

LOW: AI session scope inconsistency (backend/app/api/endpoints/ai_sessions.py)

  • List endpoint is user-scoped (user_id == current_user.id); search endpoint uses OR(user_id, account_id) exposing other users' session summaries within the same account
  • Sessions are user-scoped only. Cross-user access permitted only via explicit escalation or session sharing
  • Fix: Restrict search endpoint to user_id == current_user.id. List and search must behave consistently.

Phase 0 UUID endpoint audit

  • Systematically review every endpoint with a {resource_id} URL parameter
  • For each: verify that the ID lookup either (a) filters by id AND account_id in the query, or (b) calls can_access_<resource>(current_user, resource) on the fetched object
  • Document every instance found, classify by severity, fix all before Phase 0 closes

Section 5: Legacy team_id Migration

5a. TargetList — audit-gated

Before any migration work on TargetList:

  1. Run a full codebase reference audit: grep for all references to TargetList, target_list, target-list
  2. Query the production and staging databases for row count
  3. Decision tree:
    • Zero code references AND zero rows → drop the table entirely
    • Zero rows but code references exist → deprecate code, then drop
    • Rows exist → migrate (see sequence below)
  4. Migration only proceeds after audit result is documented and approved

If migration proceeds:

  • Backfill path: team_id → teams → users WHERE is_team_admin → account_id
  • If any row cannot be backfilled to a valid account_idfull rollback, no exceptions, no manual review queues

5b. Pre-migration: teams-to-accounts orphan check

Before any backfill using team_id → account_id chains, run:

SELECT COUNT(*) FROM teams t
LEFT JOIN users u ON u.team_id = t.id AND u.account_id IS NOT NULL
WHERE u.id IS NULL;

Count must be zero before any backfill proceeds. Report and resolve orphaned teams first.

5c. Approved migrations (no audit gate)

ScriptBuilderSession: Add account_id, backfill from user_id → users.account_id, set NOT NULL.

ScriptTemplate: Add account_id, backfill from created_by_id → users.account_id, set NOT NULL.

ScriptGeneration: Add account_id, backfill from user_id → users.account_id, set NOT NULL.

5d. team_id cleanup

Do NOT drop team_id columns during migration. Keep until all application code is updated to use account_id exclusively. Drop team_id columns in a later cleanup migration after verification.


Section 6: Testing Strategy

Phase 1: RLS validation tests

File: backend/tests/test_rls_isolation.py

This test suite validates RLS policies at the database layer, independent of any application code. It connects to the test database using the resolutionflow_app role.

Setup fixture:

  • Creates two accounts (account_a, account_b) with seed rows in all tenant tables
  • Creates an async DB connection using resolutionflow_app role
  • Sets SET LOCAL app.current_account_id = '<account_a_uuid>' before each test

Test cases per table (~5 cases per table, ~32 tables ≈ ~160 total):

  1. SELECT isolation — querying as account_a returns zero rows for account_b's data
  2. INSERT enforcement — inserting with account_id = account_b_uuid raises a PostgreSQL exception (rejected by WITH CHECK)
  3. INSERT cross-tenant FK — inserting with correct account_id but a FK value (e.g., tree_id) that belongs to account_b is also rejected
  4. UPDATE enforcement — updating account_b's rows as account_a affects zero rows
  5. DELETE enforcement — deleting account_b's rows as account_a affects zero rows

Fail-closed test:

# Unset app.current_account_id — must raise a database exception
# Acceptable: psycopg2.errors.InvalidTextRepresentation (NULL::uuid cast fails)
# NOT acceptable: query returning zero rows silently
with pytest.raises(asyncpg.PostgresError):
    await conn.execute("SELECT * FROM trees")

The exact exception type must be documented based on the behavior of current_setting('app.current_account_id', false)::uuid when unset. The test asserts on that specific exception.

audit_logs special cases:

  • No INSERT WITH CHECK test — audit logs must be insertable freely
  • No UPDATE or DELETE tests — those policies must not exist
  • Verify via pg_policies that only a SELECT policy exists on audit_logs

tree_shares: Intentionally deferred. A TODO comment is included:

# TODO: tree_shares RLS tests deferred pending sharing model decision.
# Must be added before RLS is enabled on the tree_shares table.
# See: docs/superpowers/specs/2026-04-09-tenant-data-isolation-design.md Section 7

Tables covered in Phase 1: trees, ai_sessions, ai_chat_sessions, ai_conversations, sessions, step_library, tree_categories, tree_tags, step_categories, flow_proposals, attachments, audit_logs, psa_connections, copilot_conversations, file_uploads, kb_imports, subscriptions, account_invites, ai_usage, notifications, session_shares, script_builder_sessions, script_templates, script_generations, maintenance_schedules, user_folders, user_pinned_trees, session_supporting_data, session_branches, session_resolution_outputs, psa_post_logs, notification_logs

Connection pool reuse test:

async def test_set_local_does_not_leak_between_connections():
    """SET LOCAL must not leak account_id when a connection is returned to the pool."""
    conn = await pool.acquire()
    async with conn.transaction():
        await conn.execute("SET LOCAL app.current_account_id = $1", str(account_a_id))
        # transaction ends, SET LOCAL resets
    # Return conn to pool, re-acquire — verify account_id is not set
    conn2 = await pool.acquire()
    result = await conn2.fetchval("SELECT current_setting('app.current_account_id', true)")
    assert result is None or result == ""

Phase 2: Per-endpoint cross-tenant tests

As each endpoint is touched in any PR from Phase 1 onward, a cross-tenant isolation test is added in the same PR:

async def test_cannot_access_other_account_<resource>(
    client_account_a, account_b_resource
):
    """Account A cannot access Account B's resource by UUID."""
    response = await client_account_a.get(f"/<resource>/{account_b_resource.id}")
    assert response.status_code == 404  # Not 403 — never reveal existence

This is a hard requirement per the code review checklist.


Section 7: Phased Rollout Plan

Immediate Hotfix (before everything else)

  • Fix CRITICAL: Copilot tree access bypass (see Section 4)
  • Ships as independent PR, merged immediately, does not wait for Phase 0

Phase 0 — Foundation

Goal: Fix all existing gaps; establish patterns and tooling; gate future PRs.

  1. Fix LOW: Analytics flow endpoint missing ownership check
  2. Fix LOW: Category tree count scope
  3. Fix LOW: AI session search/list inconsistency (restrict both to user_id)
  4. Full UUID endpoint audit — every {resource_id} URL param checked. All gaps documented and fixed before Phase 0 closes.
  5. Add get_tenant_context dependency to deps.py
  6. Add tenant_filter() helper to filters.py. Update existing filter helpers to call it.
  7. Dead code audit: TargetList references + database row count. Report result.
  8. Orphan check: Teams without a resolvable account_id. Report result.
  9. Define and activate CI grep check for missing tenant_filter() on tenant tables. Active on all PRs from Phase 1 forward.

Gate: All gaps patched. CI grep check active. TargetList and team orphan audit results documented.


Phase 1 — Schema Migration

Goal: Every tenant-relevant table has a direct, NOT NULL account_id column.

  1. Add account_id to all tables in Section 1a (migration per logical domain group: core sessions, PSA, AI, steps, notifications)
  2. Make nullable account_id NOT NULL on models from Section 1b
  3. Migrate ScriptBuilderSession, ScriptTemplate, ScriptGeneration from team_id to account_id
  4. TargetList: execute result of Phase 0 audit (drop or migrate)
  5. Create template_trees and platform_steps tables; migrate global content

Gate: Zero NULL account_id values in any tenant table. All backfills verified. Database passes zero-NULL assertion query for every table in scope.


Phase 2 — PostgreSQL Infrastructure

Goal: Database roles established; SET LOCAL wired into every request; RLS test suite green.

  1. Create resolutionflow_app and resolutionflow_admin PostgreSQL roles; grant privileges
  2. Update DATABASE_URL to resolutionflow_app; add DATABASE_ADMIN_URL for admin connections
  3. Update Alembic to use DATABASE_ADMIN_URL
  4. Modify get_db() to execute SET LOCAL app.current_account_id per request, inside transaction boundary
  5. Update super admin endpoints to use resolutionflow_admin connection
  6. Write test_rls_isolation.py — all ~160 RLS tests. Must be 100% green before Phase 3. Connection pool reuse test included.
  7. Measure SET LOCAL per-request overhead baseline

Gate: All RLS tests green. All existing integration tests green. Connection pool reuse test green. Performance baseline documented.


Phase 3 — Enable RLS

Goal: RLS policies active on all tenant tables. Phased by domain, not all at once.

Enable RLS in these batches. Run full test suite after each batch before proceeding:

Batch A: Core data trees, tree_categories, tree_tags, sessions, attachments, session_supporting_data, maintenance_schedules

Batch B: AI & sessions ai_sessions, ai_chat_sessions, ai_conversations, ai_usage, ai_session_steps, session_branches, session_handoffs, session_resolution_outputs, fork_points, copilot_conversations, flow_proposals

Batch C: Steps & library step_library, step_categories, step_ratings, step_usage_logs, ai_suggestions, template_trees (no RLS), platform_steps (no RLS)

Batch D: Integrations & users psa_connections, psa_post_logs, psa_member_mappings, subscriptions, account_invites, file_uploads, kb_imports, notifications, notification_logs, session_shares, user_folders, user_pinned_trees

Batch E: Scripts & auth script_builder_sessions, script_templates, script_generations, audit_logs

For each batch migration: ENABLE ROW LEVEL SECURITY + FORCE ROW LEVEL SECURITY + create all applicable policies.

Gate: All RLS tests green after each batch. All integration tests green. Staging smoke test after Batch E. No performance regressions.


Phase 4 — Ongoing

  • Every PR that touches a tenant endpoint includes a cross-tenant isolation test
  • CI grep check blocks PRs with missing tenant_filter() (warn → block after 2-week calibration)
  • tree_shares RLS: design sharing model, implement tests, enable RLS before any sharing feature ships
  • team_id columns: drop in a cleanup migration after all application code is fully migrated

Section 8: Background Job Policy

Background jobs and scheduled tasks that process tenant data are a distinct isolation surface. Unlike request-scoped endpoints, they do not have a current_user and must manage tenant context explicitly.

Policy

Every background job that touches tenant tables must comply with one of two patterns:

Pattern A — resolutionflow_admin with explicit per-query account_id filtering

Use when: the job is inherently cross-tenant (e.g., processes all pending work across all accounts in a single pass). The admin role bypasses RLS, so explicit account_id filters in every query are mandatory — RLS is not the safety net here.

# Allowed: cross-tenant batch SELECT for IDs only, then loop per-account
result = await db.execute(
    select(Model.id, Model.account_id).where(Model.status == "pending")
)
for row in result.all():
    await _process_one(row.id, row.account_id, db)  # account_id threaded through

# In the processing function: all queries must filter by account_id
async def _process_one(record_id, account_id, db):
    result = await db.execute(
        select(Model).where(Model.id == record_id, Model.account_id == account_id)
    )

Pattern B — resolutionflow_app with SET LOCAL per tenant loop iteration

Use when: the job processes tenants one at a time and it is practical to set the tenant context per iteration.

for account_id in account_ids:
    async with async_session_maker() as db:
        await db.execute(
            text("SET LOCAL app.current_account_id = :id"),
            {"id": str(account_id)}
        )
        # All queries in this block are RLS-enforced to this tenant

No cross-tenant queries without justification

No background job may issue a SELECT, UPDATE, or DELETE that spans multiple tenants' data in a single query without explicit written justification in the code comment and in this spec. Approved cross-tenant operations are documented below.

Inventory: Current Background Jobs

Job File Pattern Status Notes
Knowledge Flywheel knowledge_flywheel_scheduler.py Admin + explicit filter Needs update Batch SELECT across all accounts with no account_id filter. Must thread account_id from the session into all processing calls.
PSA Retry psa_retry_scheduler.py Admin + explicit filter Needs update Batch SELECT across all accounts with no account_id filter. Must add account_id to batch query and thread through to retry_failed_push.
Chat Retention Cleanup retention_cleanup.py Admin + explicit filter Correct pattern Already loops per-account with explicit account_id in all queries. Model for other jobs. Needs role update to resolutionflow_admin.
Maintenance Schedule Firing scheduler.py (_fire_maintenance_schedule) Admin + explicit filter Needs update Fetches MaintenanceSchedule and Tree by ID without account_id filter. After Phase 1, Session creation must set account_id. Add account_id to all queries.
AI Conversation Expiry scheduler.py (_cleanup_expired_ai_conversations) Admin, cross-tenant approved Correct pattern Cross-tenant DELETE by expires_at is explicitly justified: rows are expired regardless of tenant. Needs role update to resolutionflow_admin. Document this approval in code comment.

Approved cross-tenant queries

The following cross-tenant operations are explicitly approved. All others require a new entry here before shipping:

Job Query Justification
AI Conversation Expiry DELETE FROM ai_conversations WHERE expires_at < NOW() Time-based expiry is tenant-independent. Deleting expired rows does not expose data across tenants.
Chat Retention Cleanup SELECT id FROM accounts Required to iterate per-account. Read of account IDs only; no tenant data accessed.
Knowledge Flywheel (after fix) SELECT id, account_id FROM ai_sessions WHERE analysis_status='pending' Cross-tenant ID harvest only. No tenant data in the SELECT. account_id is threaded into per-tenant processing.
PSA Retry (after fix) SELECT id, account_id FROM psa_post_logs WHERE status='pending_retry' Cross-tenant ID harvest only. account_id threaded into per-tenant processing.

Checklist additions (Phase 2)

  • All background jobs updated to use resolutionflow_admin role via DATABASE_ADMIN_URL
  • Knowledge Flywheel: account_id added to batch query and threaded through to analyze_session
  • PSA Retry: account_id added to batch query and threaded through to retry_failed_push
  • Maintenance Schedule Firing: all queries include account_id; Session creation sets account_id after Phase 1 migration
  • AI Conversation Expiry: cross-tenant approval comment added to code
  • All jobs reviewed against this policy before Phase 3 (RLS enable)

Section 9: Open Questions

Question Impact Owner
Is tree sharing intra-tenant only, or can trees be shared across accounts? Determines tree_shares schema, backfill strategy, and RLS policy. Tree_shares table deferred until resolved. Product
What is the exact PostgreSQL exception raised when current_setting('app.current_account_id', false)::uuid is evaluated with no value set? Determines the fail-closed test assertion. Must be tested in Phase 2. Engineering
TargetList audit complete: active code references found in 12+ files across backend and frontend (full CRUD API, frontend page, used in MaintenanceScheduleSection and BatchLaunchModal). Cannot be dropped. Row count confirmed: 0 rows in production. Decision: migrate to account_id in Phase 1 via backfill from team_id → accounts. Zero rows means backfill is trivial (no data to migrate, just schema change). ✓ Resolved — migrate in Phase 1. Zero rows confirmed 2026-04-09. ✓ Done
Teams orphan check: 0 orphaned teams confirmed 2026-04-09. Phase 1 backfill using team_id → account_id chain is safe to proceed. ✓ Resolved — Phase 1 can proceed without team cleanup. ✓ Done

Appendix: Pre-Implementation Checklist

Everything that must be in place before writing feature code on any tenant-data endpoint:

Foundation (Phase 0)

  • Copilot tree access bypass hotfix shipped
  • Analytics flow endpoint ownership check added
  • Category tree count scoped to account
  • AI session list and search both restricted to user_id
  • Full UUID endpoint audit completed; all gaps documented and fixed
  • get_tenant_context dependency added to deps.py
  • tenant_filter() helper added to filters.py
  • Existing filter helpers updated to use tenant_filter() internally
  • TargetList dead code audit result documented
  • Teams orphan count query run and result documented
  • CI grep check defined and active

Schema (Phase 1)

  • account_id NOT NULL on all tables in Section 1a denormalization list
  • account_id NOT NULL on all existing nullable models from Section 1b
  • ScriptBuilderSession, ScriptTemplate, ScriptGeneration migrated from team_id
  • TargetList: dropped or migrated per audit result
  • template_trees and platform_steps tables created
  • Zero NULL assertion passes for every tenant table
  • Migration sequence (add nullable → backfill → verify → NOT NULL → index) followed for each table

Infrastructure (Phase 2)

  • resolutionflow_app role created with correct privileges
  • resolutionflow_admin role created with BYPASSRLS
  • DATABASE_URL updated to resolutionflow_app
  • DATABASE_ADMIN_URL added for admin connections
  • Alembic uses DATABASE_ADMIN_URL
  • get_db() executes SET LOCAL app.current_account_id per request inside transaction
  • Super admin endpoints use admin connection
  • All background jobs updated to use resolutionflow_admin role (see Section 8)
  • Knowledge Flywheel: account_id threaded through batch query and analyze_session
  • PSA Retry: account_id threaded through batch query and retry_failed_push
  • Maintenance Schedule: all queries include account_id; Session creation sets account_id
  • AI Conversation Expiry: cross-tenant approval comment added to code
  • test_rls_isolation.py written with ~160 test cases
  • All RLS tests pass (100%)
  • Connection pool reuse test passes
  • Fail-closed exception documented and asserted
  • Performance baseline for SET LOCAL overhead documented

RLS Active (Phase 3)

  • Batch A RLS policies applied and all tests green
  • Batch B RLS policies applied and all tests green
  • Batch C RLS policies applied and all tests green
  • Batch D RLS policies applied and all tests green
  • Batch E RLS policies applied and all tests green
  • Staging smoke test passed
  • All existing integration tests green with RLS active

Ongoing Standards

  • Every new endpoint PR includes cross-tenant isolation test
  • CI grep check blocks PRs with missing tenant_filter()
  • tree_shares deferred — not enabled until sharing model documented
  • team_id cleanup migration scheduled after full migration complete