resolutionflow/docs/PERFORMANCE-HEALTH-CHECK.md

# ResolutionFlow Performance Health Check

**Purpose:** Verify application performance and scalability before/during beta testing
**When to run:** Before beta launch, then monthly during growth phase
**Time required:** 2-3 hours first time, 30-60 minutes for routine checks

---

## Prerequisites

- [ ] Docker Desktop running
- [ ] Access to Railway dashboard
- [ ] VS Code open with ResolutionFlow project
- [ ] Python virtual environment activated
- [ ] Node.js installed (for k6)

---

## 1. Database Performance Check

### 1.1 Verify Indexes Exist

**Why:** Indexes are like the index in a book - without them, PostgreSQL scans every row (slow). With them, lookups are instant.

**Commands to run:**
```bash
# Connect to your Railway PostgreSQL database
# Get connection string from Railway dashboard → PostgreSQL service → Variables → DATABASE_URL

# Option 1: Use Railway CLI
railway connect PostgreSQL

# Option 2: Use psql directly
psql "your-database-url-here"
```

**Once connected, run:**
```sql
-- Check what indexes exist
SELECT
    tablename,
    indexname,
    indexdef
FROM pg_indexes
WHERE schemaname = 'public'
ORDER BY tablename, indexname;
```

**What you're looking for:**

✅ **GOOD:** You should see indexes on:
- `users.email` (for login lookups)
- `users.username` (for login lookups)
- `trees.created_by` (for "my trees" queries)
- `tree_nodes.tree_id` (for loading tree structure)
- `sessions.tree_id` (for session lookups)

❌ **BAD:** If these are missing, queries will slow down as data grows

**Fix if needed:**
```sql
-- Example: Add missing index
CREATE INDEX idx_trees_created_by ON trees(created_by);
CREATE INDEX idx_tree_nodes_tree_id ON tree_nodes(tree_id);
CREATE INDEX idx_sessions_tree_id ON sessions(tree_id);
```

### 1.2 Test Query Performance

**Run realistic queries and time them:**
```sql
-- Enable timing
\timing

-- Test: Full-text search on trees (simulates search bar)
SELECT * FROM trees
WHERE to_tsvector('english', name || ' ' || description) @@ to_tsquery('english', 'password');

-- Test: Load tree with all nodes (simulates opening tree editor)
SELECT tn.*
FROM tree_nodes tn
WHERE tn.tree_id = 1  -- Replace with actual tree ID
ORDER BY tn.position;

-- Test: User's tree list (simulates dashboard)
SELECT * FROM trees
WHERE created_by = 1  -- Replace with actual user ID
ORDER BY updated_at DESC
LIMIT 20;
```

**Benchmarks:**
- ✅ **GOOD:** All queries < 50ms
- ⚠️ **WARNING:** Any query 50-200ms (optimize later)
- ❌ **BAD:** Any query > 200ms (optimize NOW)

### 1.3 Check Database Size
```sql
-- See how much data you have
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
```

**What this tells you:** If tables are growing unexpectedly large, you might have data bloat or missing cleanup logic.

---

## 2. Frontend Performance Check

### 2.1 Test Large Tree Rendering

**Create a "stress test" tree:**

1. Log into ResolutionFlow frontend
2. Create a new tree called "Performance Test - Large Tree"
3. Add 50-100 nodes (use copy/paste to speed this up)
4. Save the tree

**What to watch:**

- Does the editor lag when adding nodes?
- Does scrolling feel smooth?
- Does saving take more than 2-3 seconds?

**Tools to use:**

Open Chrome DevTools (F12):
```
1. Go to Performance tab
2. Click Record (red circle)
3. Interact with large tree (scroll, add nodes, expand/collapse)
4. Stop recording
5. Look for red bars (blocking/slow operations)
```

**Benchmarks:**
- ✅ **GOOD:** No operations block for > 100ms
- ⚠️ **WARNING:** Some operations 100-300ms
- ❌ **BAD:** Operations > 300ms (users will notice lag)

### 2.2 Check Bundle Size

**Why:** Large JavaScript bundles = slow initial page load
```bash
# From your React frontend directory
cd frontend
npm run build

# Look at the output - it will show bundle sizes
```

**Benchmarks:**
- ✅ **GOOD:** Main bundle < 500KB gzipped
- ⚠️ **WARNING:** 500KB - 1MB
- ❌ **BAD:** > 1MB (investigate what's bloating it)

### 2.3 Lighthouse Audit

**Chrome has this built-in:**
```
1. Open ResolutionFlow in Chrome
2. F12 → Lighthouse tab
3. Select "Desktop" + "Performance"
4. Click "Analyze page load"
```

**Benchmarks:**
- ✅ **GOOD:** Performance score > 80
- ⚠️ **WARNING:** 60-80
- ❌ **BAD:** < 60

**Common issues and fixes:**
- "Eliminate render-blocking resources" → lazy load components
- "Reduce unused JavaScript" → code splitting needed
- "Serve images in next-gen formats" → use WebP instead of PNG

---

## 3. API Response Time Check

### 3.1 Manual Timing Test

**Use Railway logs:**
```
1. Go to Railway dashboard → API service → Deployments
2. Click "View Logs"
3. Perform actions in ResolutionFlow frontend
4. Watch logs for response times
```

FastAPI logs look like:
```
INFO:     127.0.0.1 - "GET /api/trees HTTP/1.1" 200 OK [0.023s]
```

**Benchmarks:**
- ✅ **GOOD:** Most endpoints < 100ms
- ⚠️ **WARNING:** Some endpoints 100-300ms
- ❌ **BAD:** Any endpoint > 500ms

### 3.2 Automated API Testing

**Create a simple test script:**
```python
# File: tests/performance_test.py

import httpx
import time
from statistics import mean

API_BASE = "https://api.resolutionflow.com"  # Your Railway API URL
TOKEN = "your-jwt-token-here"  # Get from browser DevTools after login

headers = {"Authorization": f"Bearer {TOKEN}"}

def time_endpoint(method, path, **kwargs):
    """Time a single API request"""
    start = time.time()
    response = httpx.request(method, f"{API_BASE}{path}", headers=headers, **kwargs)
    elapsed = (time.time() - start) * 1000  # Convert to milliseconds
    return elapsed, response.status_code

# Test critical endpoints
tests = [
    ("GET", "/api/trees"),
    ("GET", "/api/trees/1"),  # Replace with actual tree ID
    ("GET", "/api/trees/1/nodes"),
    ("POST", "/api/trees/search", json={"query": "password"}),
]

print("API Performance Test Results:")
print("-" * 50)

for method, path in tests:
    times = []
    for i in range(5):  # Run each test 5 times
        elapsed, status = time_endpoint(method, path)
        times.append(elapsed)

    avg_time = mean(times)
    print(f"{method} {path}")
    print(f"  Average: {avg_time:.2f}ms")
    print(f"  Min: {min(times):.2f}ms, Max: {max(times):.2f}ms")
    print()
```

**Run it:**
```bash
python tests/performance_test.py
```

---

## 4. Monitoring Setup

### 4.1 Railway Built-in Monitoring

**What Railway gives you for free:**
```
1. Go to Railway dashboard
2. Click each service (API, Frontend, PostgreSQL)
3. Go to "Metrics" tab
```

**Watch for:**
- CPU usage spikes (should stay < 50% normally)
- Memory usage growing over time (memory leak indicator)
- Request rate (see usage patterns)

**Set up alerts:**
```
1. Railway dashboard → Project Settings → Notifications
2. Add your email
3. Enable "Deployment Failed" and "Service Crashed"
```

### 4.2 Sentry Error Tracking (Recommended)

**Why add Sentry:**
- Free tier = 5,000 errors/month
- Email alerts when things break
- See exact user actions before crash
- Industry standard (your future dev team will expect this)

**Setup (5 minutes):**

**Backend (FastAPI):**
```bash
pip install sentry-sdk[fastapi]
```
```python
# File: main.py (add at the top)

import sentry_sdk

sentry_sdk.init(
    dsn="your-sentry-dsn-here",  # Get from sentry.io after signup
    traces_sample_rate=0.1,  # 10% of requests (free tier friendly)
    environment="production",
)
```

**Frontend (React):**
```bash
npm install @sentry/react
```
```javascript
// File: src/index.js (add at the top)

import * as Sentry from "@sentry/react";

Sentry.init({
  dsn: "your-sentry-dsn-here",
  integrations: [new Sentry.BrowserTracing()],
  tracesSampleRate: 0.1,
  environment: "production",
});
```

**Get your DSN:**
```
1. Sign up at sentry.io (free)
2. Create new project → Select "FastAPI" and "React"
3. Copy the DSN (looks like: https://abc123@o123.ingest.sentry.io/456)
4. Add to Railway environment variables:
   - SENTRY_DSN=your-dsn-here
```

**What you get:**

- Email when errors occur
- Stack traces showing exactly what broke
- User session replay (see what they clicked before crash)
- Performance monitoring (slow API calls flagged automatically)

---

## 5. Load Testing with k6

**Why k6:**
- Industry standard (Grafana Labs maintains it)
- Shows you EXACTLY how many concurrent users your app can handle
- Simple JavaScript syntax
- Free and open source

### 5.1 Install k6

**Windows (using Chocolatey):**
```powershell
choco install k6
```

**Or download directly:**
- Go to: https://k6.io/docs/get-started/installation/
- Download Windows installer
- Run installer

**Verify:**
```bash
k6 version
```

### 5.2 Create Load Test Script

**File: `tests/load_test.js`**
```javascript
import http from 'k6/http';
import { check, sleep } from 'k6';

// Test configuration
export const options = {
  stages: [
    { duration: '30s', target: 10 },  // Ramp up to 10 users over 30s
    { duration: '1m', target: 10 },   // Stay at 10 users for 1 minute
    { duration: '30s', target: 20 },  // Ramp up to 20 users
    { duration: '1m', target: 20 },   // Stay at 20 users for 1 minute
    { duration: '30s', target: 0 },   // Ramp down to 0
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'], // 95% of requests must complete in 500ms
    http_req_failed: ['rate<0.01'],   // Less than 1% of requests can fail
  },
};

const BASE_URL = 'https://api.resolutionflow.com';
let authToken;

// Setup: Login once per virtual user
export function setup() {
  const loginRes = http.post(`${BASE_URL}/api/auth/login`,
    JSON.stringify({
      username: 'test_user',  // Replace with test account
      password: 'test_password',
    }),
    { headers: { 'Content-Type': 'application/json' } }
  );

  return { token: loginRes.json('access_token') };
}

// Main test: Simulate realistic user behavior
export default function (data) {
  const headers = {
    'Authorization': `Bearer ${data.token}`,
    'Content-Type': 'application/json',
  };

  // Scenario 1: Load dashboard (get trees list)
  let res = http.get(`${BASE_URL}/api/trees`, { headers });
  check(res, {
    'dashboard loaded': (r) => r.status === 200,
    'dashboard fast': (r) => r.timings.duration < 300,
  });
  sleep(1);  // User reads for 1 second

  // Scenario 2: Open a tree
  res = http.get(`${BASE_URL}/api/trees/1`, { headers });  // Replace with real tree ID
  check(res, {
    'tree loaded': (r) => r.status === 200,
    'tree load fast': (r) => r.timings.duration < 500,
  });
  sleep(2);  // User reads tree for 2 seconds

  // Scenario 3: Load tree nodes
  res = http.get(`${BASE_URL}/api/trees/1/nodes`, { headers });
  check(res, {
    'nodes loaded': (r) => r.status === 200,
    'nodes fast': (r) => r.timings.duration < 500,
  });
  sleep(1);

  // Scenario 4: Search trees
  res = http.post(
    `${BASE_URL}/api/trees/search`,
    JSON.stringify({ query: 'password reset' }),
    { headers }
  );
  check(res, {
    'search worked': (r) => r.status === 200,
    'search fast': (r) => r.timings.duration < 400,
  });
  sleep(2);
}
```

### 5.3 Run Load Test

**Basic test (10 users):**
```bash
k6 run tests/load_test.js
```

**Aggressive test (50 users):**
```bash
k6 run --vus 50 --duration 2m tests/load_test.js
```

**What the output means:**
```
     ✓ dashboard loaded
     ✓ dashboard fast

     checks.........................: 95.23% ✓ 1234  ✗ 78
     data_received..................: 1.2 MB  20 kB/s
     data_sent......................: 456 kB  7.6 kB/s
     http_req_blocked...............: avg=1.2ms   min=0s   med=0s   max=45ms  p(90)=0s   p(95)=0s
     http_req_duration..............: avg=142ms   min=23ms med=98ms max=1.2s  p(90)=245ms p(95)=387ms
     http_reqs......................: 1234   20.5/s
```

**How to read this:**

- `checks`: % of tests that passed (want > 95%)
- `http_req_duration p(95)`: 95% of requests faster than this (want < 500ms)
- `http_reqs`: Requests per second your app handled
- `http_req_failed`: % of requests that errored (want < 1%)

### 5.4 Interpret Results

**✅ GOOD (Ready for beta):**
```
http_req_duration p(95) < 500ms
http_req_failed < 1%
All checks passing > 95%
```

**⚠️ WARNING (Watch closely during beta):**
```
http_req_duration p(95) 500-1000ms
http_req_failed 1-5%
Some checks failing
```

**❌ BAD (Fix before beta launch):**
```
http_req_duration p(95) > 1000ms
http_req_failed > 5%
Lots of timeouts or 500 errors
```

---

## 6. Pre-Launch Checklist

Run this checklist **before** inviting beta testers:

### Database
- [ ] All critical indexes exist (Section 1.1)
- [ ] Query performance < 200ms (Section 1.2)
- [ ] No unexplained table bloat (Section 1.3)

### Frontend
- [ ] Large tree (100 nodes) renders without lag (Section 2.1)
- [ ] Bundle size < 1MB (Section 2.2)
- [ ] Lighthouse score > 70 (Section 2.3)

### API
- [ ] All endpoints < 500ms under load (Section 3)
- [ ] Railway logs show no errors (Section 4.1)

### Monitoring
- [ ] Railway alerts configured (Section 4.1)
- [ ] Sentry installed (optional but recommended) (Section 4.2)

### Load Testing
- [ ] k6 test passes with 20 concurrent users (Section 5.3)
- [ ] No request failures during load test (Section 5.4)

---

## 7. Monthly Health Check (After Launch)

Once live with beta testers, run this monthly:

**Quick version (30 minutes):**
```bash
# 1. Check Railway metrics
# Look for: CPU/memory trends, error rate spikes

# 2. Review Sentry errors (if installed)
# Look for: New error patterns, increasing error rates

# 3. Run quick load test
k6 run tests/load_test.js

# 4. Check database query times
# Run queries from Section 1.2, watch for slowdowns
```

**When to do deep dive:**
- After adding major new features
- If users report slowness
- Before scaling to new MSP clients
- Every 3 months minimum

---

## 8. Common Performance Issues & Fixes

### Issue: "Search is slow"

**Diagnosis:**
```sql
EXPLAIN ANALYZE
SELECT * FROM trees
WHERE to_tsvector('english', name || ' ' || description) @@ to_tsquery('english', 'password');
```

**Fix:** Add GIN index:
```sql
CREATE INDEX idx_trees_fts ON trees USING GIN (to_tsvector('english', name || ' ' || description));
```

### Issue: "Loading tree nodes is slow"

**Diagnosis:** Missing index on foreign key

**Fix:**
```sql
CREATE INDEX idx_tree_nodes_tree_id ON tree_nodes(tree_id);
```

### Issue: "Dashboard takes forever to load"

**Diagnosis:** Fetching too much data

**Fix:** Add pagination to API:
```python
# Instead of: SELECT * FROM trees
# Use: SELECT * FROM trees LIMIT 20 OFFSET 0
```

### Issue: "Frontend feels sluggish"

**Diagnosis:** Re-rendering too often

**Fix:** Add React.memo() to components, use proper dependency arrays in useEffect

### Issue: "API crashes under load"

**Diagnosis:** Not enough Railway resources

**Fix:**
```
1. Railway dashboard → API service → Settings
2. Increase memory limit (default is 512MB, try 1GB)
3. Enable auto-scaling if needed
```

---

## Resources

**Tools mentioned:**
- k6: https://k6.io/docs/
- Sentry: https://sentry.io/
- PostgreSQL EXPLAIN: https://www.postgresql.org/docs/current/using-explain.html
- Chrome Lighthouse: Built into Chrome DevTools (F12)

**When to get help:**
- k6 test failing badly (> 10% error rate)
- Database queries consistently > 1 second
- Sentry showing critical errors
- Railway CPU/memory maxing out

**Next steps after this checklist:**
- If all checks pass → Launch beta confidently
- If warnings found → Document them, monitor during beta
- If critical issues → Fix before launch, re-run tests