Updated documentation; added PERFORMANCE-HEALTH-CHECK.md

2026-02-04 21:46:32 -05:00
parent 2733a00253
commit d7c5c8c9ce
2 changed files with 697 additions and 3 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -91,6 +91,20 @@ When adding new frontend pages or components, use "ResolutionFlow" for any user-
  - Purple gradient theme, custom fonts (Plus Jakarta Sans, Inter, Outfit)
  - Custom SVG logo in header and auth pages
  - Updated favicon and browser tab title
+- **Token Refresh Fix:**
+  - Silent refresh with single-flight queue (prevents concurrent 401 race conditions)
+  - Backend `get_refresh_token_payload` dependency extracts refresh token from Authorization header
+  - Frontend Axios interceptor queues failed requests during refresh, retries after success
+  - Auth store synced after silent refresh via `setTokens` action
+- **Session Scratchpad (Floating Overlay):**
+  - Fixed-position overlay panel (420px wide, 55vh tall) on right edge
+  - Floating button when collapsed, slide-in panel when expanded
+  - Ctrl+/ keyboard shortcut to toggle
+  - Auto-save with 1s debounce, markdown preview, localStorage persistence
+  - Main content adjusts width via padding transition when panel opens
+- **Global Thin Scrollbar Styling:**
+  - 6px thin scrollbars site-wide (Firefox `scrollbar-width: thin` + WebKit pseudo-elements)
+  - Theme-aware colors using CSS variables (`--border`, `--muted-foreground`)

 ### What's In Progress

@@ -180,7 +194,7 @@ patherly/
 │   │   ├── router.tsx
 │   │   ├── assets/brand/           # Brand logos (SVG)
 │   │   ├── api/                    # Axios API client
-│   │   │   ├── client.ts           # Axios instance with interceptors
+│   │   │   ├── client.ts           # Axios instance with refresh queue interceptor
 │   │   │   ├── auth.ts
 │   │   │   ├── trees.ts
 │   │   │   └── sessions.ts
@@ -196,7 +210,7 @@ patherly/
 │   │   │   ├── tree-editor/        # Tree editor components
 │   │   │   ├── tree-preview/       # Visual tree preview
 │   │   │   ├── step-library/       # Step library browser, forms, modals
-│   │   │   ├── session/            # Session modals, scratchpad sidebar
+│   │   │   ├── session/            # Session modals, scratchpad floating overlay
 │   │   │   └── ui/                 # MarkdownContent
 │   │   ├── pages/
 │   │   │   ├── LoginPage.tsx
@@ -508,6 +522,33 @@ Key state: `pendingStep`, `pendingContinuationNodeId`, `customBranchMode`, `bran
 Custom steps are stored in session JSONB (`custom_steps` field) and referenced by UUID in `pathTaken`.
 `findNode()` only searches tree structure -- use `findCustomStep()` for custom step UUIDs.

+### Token Refresh: Match Frontend/Backend Contract
+
+The refresh endpoint must accept tokens the same way the frontend sends them.
+
+```python
+# WRONG - Expects bare string, but frontend sends Authorization header
+@router.post("/refresh")
+async def refresh_token(refresh_token: str):
+    payload = decode_token(refresh_token)
+
+# CORRECT - Use dependency that reads from Authorization header
+@router.post("/refresh")
+async def refresh_token(
+    payload: Annotated[dict, Depends(get_refresh_token_payload)],
+):
+```
+
+The frontend Axios interceptor sends `Authorization: Bearer <refresh_token>`. The backend must extract it from the header, not expect it as a query/body parameter.
+
+### CORS Errors Can Mask Server 500s
+
+When the backend returns a 500 Internal Server Error, CORS headers are not added to the response. The browser reports this as a CORS error, hiding the real cause. Always check backend logs first when debugging CORS issues locally.
+
+### Run Migrations Before Local Testing
+
+After cloning or pulling new changes, always run `alembic upgrade head` before starting the backend. Missing migrations cause 500 errors (e.g., `column does not exist`) that manifest as CORS errors in the browser.
+
 ---

 ## API Endpoints Reference
@@ -586,7 +627,7 @@ interface Decision {

 ### State Management

- **Auth:** `useAuthStore` - Zustand with localStorage persistence
+- **Auth:** `useAuthStore` - Zustand with localStorage persistence (includes `setTokens` for silent refresh sync)
 - **Theme:** `useThemeStore` - Dark/light/system preference
 - **Tree Editor:** `useTreeEditorStore` - Zustand + immer + zundo (undo/redo)
 - **User Preferences:** `useUserPreferencesStore` - Zustand with localStorage persistence (export format default)
@@ -612,9 +653,28 @@ interface Decision {
 import api from '@/api/client'

 // Token refresh handled automatically by interceptor
+// Concurrent 401s are queued — only one refresh request fires at a time
+// On refresh failure, user is logged out and redirected to /login
 const response = await api.get('/api/v1/trees')
 ```

+### Floating Overlay Pattern (Scratchpad)
+
+The scratchpad uses `position: fixed` with an `onOpenChange` callback so the parent page can adjust layout:
+
+```tsx
+// Child: ScratchpadSidebar.tsx
+onOpenChange?: (isOpen: boolean) => void
+// Fires when collapsed state changes, parent uses it to add/remove padding
+
+// Parent: TreeNavigationPage.tsx
+const [scratchpadOpen, setScratchpadOpen] = useState(...)
+<div className={cn('...', scratchpadOpen && 'pr-[440px]')}>
+  <div className="mx-auto max-w-4xl">  {/* centers in available space */}
+```
+
+Position overlay at `right-2` (not `right-0`) so it sits inside the page scrollbar, and use full `rounded-lg` (not `rounded-l-lg`).
+
 ---

 ## Common Tasks
--- a/docs/PERFORMANCE-HEALTH-CHECK.md
+++ b/docs/PERFORMANCE-HEALTH-CHECK.md
@@ -0,0 +1,634 @@
+# ResolutionFlow Performance Health Check
+
+**Purpose:** Verify application performance and scalability before/during beta testing  
+**When to run:** Before beta launch, then monthly during growth phase  
+**Time required:** 2-3 hours first time, 30-60 minutes for routine checks
+
+---
+
+## Prerequisites
+
+- [ ] Docker Desktop running
+- [ ] Access to Railway dashboard
+- [ ] VS Code open with ResolutionFlow project
+- [ ] Python virtual environment activated
+- [ ] Node.js installed (for k6)
+
+---
+
+## 1. Database Performance Check
+
+### 1.1 Verify Indexes Exist
+
+**Why:** Indexes are like the index in a book - without them, PostgreSQL scans every row (slow). With them, lookups are instant.
+
+**Commands to run:**
+```bash
+# Connect to your Railway PostgreSQL database
+# Get connection string from Railway dashboard → PostgreSQL service → Variables → DATABASE_URL
+
+# Option 1: Use Railway CLI
+railway connect PostgreSQL
+
+# Option 2: Use psql directly
+psql "your-database-url-here"
+```
+
+**Once connected, run:**
+```sql
+-- Check what indexes exist
+SELECT 
+    tablename, 
+    indexname, 
+    indexdef 
+FROM pg_indexes 
+WHERE schemaname = 'public' 
+ORDER BY tablename, indexname;
+```
+
+**What you're looking for:**
+
+✅ **GOOD:** You should see indexes on:
+- `users.email` (for login lookups)
+- `users.username` (for login lookups)
+- `trees.created_by` (for "my trees" queries)
+- `tree_nodes.tree_id` (for loading tree structure)
+- `sessions.tree_id` (for session lookups)
+
+❌ **BAD:** If these are missing, queries will slow down as data grows
+
+**Fix if needed:**
+```sql
+-- Example: Add missing index
+CREATE INDEX idx_trees_created_by ON trees(created_by);
+CREATE INDEX idx_tree_nodes_tree_id ON tree_nodes(tree_id);
+CREATE INDEX idx_sessions_tree_id ON sessions(tree_id);
+```
+
+### 1.2 Test Query Performance
+
+**Run realistic queries and time them:**
+```sql
+-- Enable timing
+\timing
+
+-- Test: Full-text search on trees (simulates search bar)
+SELECT * FROM trees 
+WHERE to_tsvector('english', name || ' ' || description) @@ to_tsquery('english', 'password');
+
+-- Test: Load tree with all nodes (simulates opening tree editor)
+SELECT tn.* 
+FROM tree_nodes tn 
+WHERE tn.tree_id = 1  -- Replace with actual tree ID
+ORDER BY tn.position;
+
+-- Test: User's tree list (simulates dashboard)
+SELECT * FROM trees 
+WHERE created_by = 1  -- Replace with actual user ID
+ORDER BY updated_at DESC 
+LIMIT 20;
+```
+
+**Benchmarks:**
+- ✅ **GOOD:** All queries < 50ms
+- ⚠️ **WARNING:** Any query 50-200ms (optimize later)
+- ❌ **BAD:** Any query > 200ms (optimize NOW)
+
+### 1.3 Check Database Size
+```sql
+-- See how much data you have
+SELECT 
+    schemaname,
+    tablename,
+    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
+FROM pg_tables
+WHERE schemaname = 'public'
+ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
+```
+
+**What this tells you:** If tables are growing unexpectedly large, you might have data bloat or missing cleanup logic.
+
+---
+
+## 2. Frontend Performance Check
+
+### 2.1 Test Large Tree Rendering
+
+**Create a "stress test" tree:**
+
+1. Log into ResolutionFlow frontend
+2. Create a new tree called "Performance Test - Large Tree"
+3. Add 50-100 nodes (use copy/paste to speed this up)
+4. Save the tree
+
+**What to watch:**
+
+- Does the editor lag when adding nodes?
+- Does scrolling feel smooth?
+- Does saving take more than 2-3 seconds?
+
+**Tools to use:**
+
+Open Chrome DevTools (F12):
+```
+1. Go to Performance tab
+2. Click Record (red circle)
+3. Interact with large tree (scroll, add nodes, expand/collapse)
+4. Stop recording
+5. Look for red bars (blocking/slow operations)
+```
+
+**Benchmarks:**
+- ✅ **GOOD:** No operations block for > 100ms
+- ⚠️ **WARNING:** Some operations 100-300ms
+- ❌ **BAD:** Operations > 300ms (users will notice lag)
+
+### 2.2 Check Bundle Size
+
+**Why:** Large JavaScript bundles = slow initial page load
+```bash
+# From your React frontend directory
+cd frontend
+npm run build
+
+# Look at the output - it will show bundle sizes
+```
+
+**Benchmarks:**
+- ✅ **GOOD:** Main bundle < 500KB gzipped
+- ⚠️ **WARNING:** 500KB - 1MB
+- ❌ **BAD:** > 1MB (investigate what's bloating it)
+
+### 2.3 Lighthouse Audit
+
+**Chrome has this built-in:**
+```
+1. Open ResolutionFlow in Chrome
+2. F12 → Lighthouse tab
+3. Select "Desktop" + "Performance"
+4. Click "Analyze page load"
+```
+
+**Benchmarks:**
+- ✅ **GOOD:** Performance score > 80
+- ⚠️ **WARNING:** 60-80
+- ❌ **BAD:** < 60
+
+**Common issues and fixes:**
+- "Eliminate render-blocking resources" → lazy load components
+- "Reduce unused JavaScript" → code splitting needed
+- "Serve images in next-gen formats" → use WebP instead of PNG
+
+---
+
+## 3. API Response Time Check
+
+### 3.1 Manual Timing Test
+
+**Use Railway logs:**
+```
+1. Go to Railway dashboard → API service → Deployments
+2. Click "View Logs"
+3. Perform actions in ResolutionFlow frontend
+4. Watch logs for response times
+```
+
+FastAPI logs look like:
+```
+INFO:     127.0.0.1 - "GET /api/trees HTTP/1.1" 200 OK [0.023s]
+```
+
+**Benchmarks:**
+- ✅ **GOOD:** Most endpoints < 100ms
+- ⚠️ **WARNING:** Some endpoints 100-300ms
+- ❌ **BAD:** Any endpoint > 500ms
+
+### 3.2 Automated API Testing
+
+**Create a simple test script:**
+```python
+# File: tests/performance_test.py
+
+import httpx
+import time
+from statistics import mean
+
+API_BASE = "https://api.resolutionflow.com"  # Your Railway API URL
+TOKEN = "your-jwt-token-here"  # Get from browser DevTools after login
+
+headers = {"Authorization": f"Bearer {TOKEN}"}
+
+def time_endpoint(method, path, **kwargs):
+    """Time a single API request"""
+    start = time.time()
+    response = httpx.request(method, f"{API_BASE}{path}", headers=headers, **kwargs)
+    elapsed = (time.time() - start) * 1000  # Convert to milliseconds
+    return elapsed, response.status_code
+
+# Test critical endpoints
+tests = [
+    ("GET", "/api/trees"),
+    ("GET", "/api/trees/1"),  # Replace with actual tree ID
+    ("GET", "/api/trees/1/nodes"),
+    ("POST", "/api/trees/search", json={"query": "password"}),
+]
+
+print("API Performance Test Results:")
+print("-" * 50)
+
+for method, path in tests:
+    times = []
+    for i in range(5):  # Run each test 5 times
+        elapsed, status = time_endpoint(method, path)
+        times.append(elapsed)
+    
+    avg_time = mean(times)
+    print(f"{method} {path}")
+    print(f"  Average: {avg_time:.2f}ms")
+    print(f"  Min: {min(times):.2f}ms, Max: {max(times):.2f}ms")
+    print()
+```
+
+**Run it:**
+```bash
+python tests/performance_test.py
+```
+
+---
+
+## 4. Monitoring Setup
+
+### 4.1 Railway Built-in Monitoring
+
+**What Railway gives you for free:**
+```
+1. Go to Railway dashboard
+2. Click each service (API, Frontend, PostgreSQL)
+3. Go to "Metrics" tab
+```
+
+**Watch for:**
+- CPU usage spikes (should stay < 50% normally)
+- Memory usage growing over time (memory leak indicator)
+- Request rate (see usage patterns)
+
+**Set up alerts:**
+```
+1. Railway dashboard → Project Settings → Notifications
+2. Add your email
+3. Enable "Deployment Failed" and "Service Crashed"
+```
+
+### 4.2 Sentry Error Tracking (Recommended)
+
+**Why add Sentry:**
+- Free tier = 5,000 errors/month
+- Email alerts when things break
+- See exact user actions before crash
+- Industry standard (your future dev team will expect this)
+
+**Setup (5 minutes):**
+
+**Backend (FastAPI):**
+```bash
+pip install sentry-sdk[fastapi]
+```
+```python
+# File: main.py (add at the top)
+
+import sentry_sdk
+
+sentry_sdk.init(
+    dsn="your-sentry-dsn-here",  # Get from sentry.io after signup
+    traces_sample_rate=0.1,  # 10% of requests (free tier friendly)
+    environment="production",
+)
+```
+
+**Frontend (React):**
+```bash
+npm install @sentry/react
+```
+```javascript
+// File: src/index.js (add at the top)
+
+import * as Sentry from "@sentry/react";
+
+Sentry.init({
+  dsn: "your-sentry-dsn-here",
+  integrations: [new Sentry.BrowserTracing()],
+  tracesSampleRate: 0.1,
+  environment: "production",
+});
+```
+
+**Get your DSN:**
+```
+1. Sign up at sentry.io (free)
+2. Create new project → Select "FastAPI" and "React"
+3. Copy the DSN (looks like: https://abc123@o123.ingest.sentry.io/456)
+4. Add to Railway environment variables:
+   - SENTRY_DSN=your-dsn-here
+```
+
+**What you get:**
+
+- Email when errors occur
+- Stack traces showing exactly what broke
+- User session replay (see what they clicked before crash)
+- Performance monitoring (slow API calls flagged automatically)
+
+---
+
+## 5. Load Testing with k6
+
+**Why k6:**
+- Industry standard (Grafana Labs maintains it)
+- Shows you EXACTLY how many concurrent users your app can handle
+- Simple JavaScript syntax
+- Free and open source
+
+### 5.1 Install k6
+
+**Windows (using Chocolatey):**
+```powershell
+choco install k6
+```
+
+**Or download directly:**
+- Go to: https://k6.io/docs/get-started/installation/
+- Download Windows installer
+- Run installer
+
+**Verify:**
+```bash
+k6 version
+```
+
+### 5.2 Create Load Test Script
+
+**File: `tests/load_test.js`**
+```javascript
+import http from 'k6/http';
+import { check, sleep } from 'k6';
+
+// Test configuration
+export const options = {
+  stages: [
+    { duration: '30s', target: 10 },  // Ramp up to 10 users over 30s
+    { duration: '1m', target: 10 },   // Stay at 10 users for 1 minute
+    { duration: '30s', target: 20 },  // Ramp up to 20 users
+    { duration: '1m', target: 20 },   // Stay at 20 users for 1 minute
+    { duration: '30s', target: 0 },   // Ramp down to 0
+  ],
+  thresholds: {
+    http_req_duration: ['p(95)<500'], // 95% of requests must complete in 500ms
+    http_req_failed: ['rate<0.01'],   // Less than 1% of requests can fail
+  },
+};
+
+const BASE_URL = 'https://api.resolutionflow.com';
+let authToken;
+
+// Setup: Login once per virtual user
+export function setup() {
+  const loginRes = http.post(`${BASE_URL}/api/auth/login`, 
+    JSON.stringify({
+      username: 'test_user',  // Replace with test account
+      password: 'test_password',
+    }),
+    { headers: { 'Content-Type': 'application/json' } }
+  );
+  
+  return { token: loginRes.json('access_token') };
+}
+
+// Main test: Simulate realistic user behavior
+export default function (data) {
+  const headers = {
+    'Authorization': `Bearer ${data.token}`,
+    'Content-Type': 'application/json',
+  };
+
+  // Scenario 1: Load dashboard (get trees list)
+  let res = http.get(`${BASE_URL}/api/trees`, { headers });
+  check(res, {
+    'dashboard loaded': (r) => r.status === 200,
+    'dashboard fast': (r) => r.timings.duration < 300,
+  });
+  sleep(1);  // User reads for 1 second
+
+  // Scenario 2: Open a tree
+  res = http.get(`${BASE_URL}/api/trees/1`, { headers });  // Replace with real tree ID
+  check(res, {
+    'tree loaded': (r) => r.status === 200,
+    'tree load fast': (r) => r.timings.duration < 500,
+  });
+  sleep(2);  // User reads tree for 2 seconds
+
+  // Scenario 3: Load tree nodes
+  res = http.get(`${BASE_URL}/api/trees/1/nodes`, { headers });
+  check(res, {
+    'nodes loaded': (r) => r.status === 200,
+    'nodes fast': (r) => r.timings.duration < 500,
+  });
+  sleep(1);
+
+  // Scenario 4: Search trees
+  res = http.post(
+    `${BASE_URL}/api/trees/search`,
+    JSON.stringify({ query: 'password reset' }),
+    { headers }
+  );
+  check(res, {
+    'search worked': (r) => r.status === 200,
+    'search fast': (r) => r.timings.duration < 400,
+  });
+  sleep(2);
+}
+```
+
+### 5.3 Run Load Test
+
+**Basic test (10 users):**
+```bash
+k6 run tests/load_test.js
+```
+
+**Aggressive test (50 users):**
+```bash
+k6 run --vus 50 --duration 2m tests/load_test.js
+```
+
+**What the output means:**
+```
+     ✓ dashboard loaded
+     ✓ dashboard fast
+     
+     checks.........................: 95.23% ✓ 1234  ✗ 78
+     data_received..................: 1.2 MB  20 kB/s
+     data_sent......................: 456 kB  7.6 kB/s
+     http_req_blocked...............: avg=1.2ms   min=0s   med=0s   max=45ms  p(90)=0s   p(95)=0s  
+     http_req_duration..............: avg=142ms   min=23ms med=98ms max=1.2s  p(90)=245ms p(95)=387ms
+     http_reqs......................: 1234   20.5/s
+```
+
+**How to read this:**
+
+- `checks`: % of tests that passed (want > 95%)
+- `http_req_duration p(95)`: 95% of requests faster than this (want < 500ms)
+- `http_reqs`: Requests per second your app handled
+- `http_req_failed`: % of requests that errored (want < 1%)
+
+### 5.4 Interpret Results
+
+**✅ GOOD (Ready for beta):**
+```
+http_req_duration p(95) < 500ms
+http_req_failed < 1%
+All checks passing > 95%
+```
+
+**⚠️ WARNING (Watch closely during beta):**
+```
+http_req_duration p(95) 500-1000ms
+http_req_failed 1-5%
+Some checks failing
+```
+
+**❌ BAD (Fix before beta launch):**
+```
+http_req_duration p(95) > 1000ms
+http_req_failed > 5%
+Lots of timeouts or 500 errors
+```
+
+---
+
+## 6. Pre-Launch Checklist
+
+Run this checklist **before** inviting beta testers:
+
+### Database
+- [ ] All critical indexes exist (Section 1.1)
+- [ ] Query performance < 200ms (Section 1.2)
+- [ ] No unexplained table bloat (Section 1.3)
+
+### Frontend
+- [ ] Large tree (100 nodes) renders without lag (Section 2.1)
+- [ ] Bundle size < 1MB (Section 2.2)
+- [ ] Lighthouse score > 70 (Section 2.3)
+
+### API
+- [ ] All endpoints < 500ms under load (Section 3)
+- [ ] Railway logs show no errors (Section 4.1)
+
+### Monitoring
+- [ ] Railway alerts configured (Section 4.1)
+- [ ] Sentry installed (optional but recommended) (Section 4.2)
+
+### Load Testing
+- [ ] k6 test passes with 20 concurrent users (Section 5.3)
+- [ ] No request failures during load test (Section 5.4)
+
+---
+
+## 7. Monthly Health Check (After Launch)
+
+Once live with beta testers, run this monthly:
+
+**Quick version (30 minutes):**
+```bash
+# 1. Check Railway metrics
+# Look for: CPU/memory trends, error rate spikes
+
+# 2. Review Sentry errors (if installed)
+# Look for: New error patterns, increasing error rates
+
+# 3. Run quick load test
+k6 run tests/load_test.js
+
+# 4. Check database query times
+# Run queries from Section 1.2, watch for slowdowns
+```
+
+**When to do deep dive:**
+- After adding major new features
+- If users report slowness
+- Before scaling to new MSP clients
+- Every 3 months minimum
+
+---
+
+## 8. Common Performance Issues & Fixes
+
+### Issue: "Search is slow"
+
+**Diagnosis:**
+```sql
+EXPLAIN ANALYZE 
+SELECT * FROM trees 
+WHERE to_tsvector('english', name || ' ' || description) @@ to_tsquery('english', 'password');
+```
+
+**Fix:** Add GIN index:
+```sql
+CREATE INDEX idx_trees_fts ON trees USING GIN (to_tsvector('english', name || ' ' || description));
+```
+
+### Issue: "Loading tree nodes is slow"
+
+**Diagnosis:** Missing index on foreign key
+
+**Fix:**
+```sql
+CREATE INDEX idx_tree_nodes_tree_id ON tree_nodes(tree_id);
+```
+
+### Issue: "Dashboard takes forever to load"
+
+**Diagnosis:** Fetching too much data
+
+**Fix:** Add pagination to API:
+```python
+# Instead of: SELECT * FROM trees
+# Use: SELECT * FROM trees LIMIT 20 OFFSET 0
+```
+
+### Issue: "Frontend feels sluggish"
+
+**Diagnosis:** Re-rendering too often
+
+**Fix:** Add React.memo() to components, use proper dependency arrays in useEffect
+
+### Issue: "API crashes under load"
+
+**Diagnosis:** Not enough Railway resources
+
+**Fix:** 
+```
+1. Railway dashboard → API service → Settings
+2. Increase memory limit (default is 512MB, try 1GB)
+3. Enable auto-scaling if needed
+```
+
+---
+
+## Resources
+
+**Tools mentioned:**
+- k6: https://k6.io/docs/
+- Sentry: https://sentry.io/
+- PostgreSQL EXPLAIN: https://www.postgresql.org/docs/current/using-explain.html
+- Chrome Lighthouse: Built into Chrome DevTools (F12)
+
+**When to get help:**
+- k6 test failing badly (> 10% error rate)
+- Database queries consistently > 1 second
+- Sentry showing critical errors
+- Railway CPU/memory maxing out
+
+**Next steps after this checklist:**
+- If all checks pass → Launch beta confidently
+- If warnings found → Document them, monitor during beta
+- If critical issues → Fix before launch, re-run tests