Files
resolutionflow/docs/PERFORMANCE-HEALTH-CHECK.md
2026-02-04 21:46:32 -05:00

15 KiB

ResolutionFlow Performance Health Check

Purpose: Verify application performance and scalability before/during beta testing
When to run: Before beta launch, then monthly during growth phase
Time required: 2-3 hours first time, 30-60 minutes for routine checks


Prerequisites

  • Docker Desktop running
  • Access to Railway dashboard
  • VS Code open with ResolutionFlow project
  • Python virtual environment activated
  • Node.js installed (for k6)

1. Database Performance Check

1.1 Verify Indexes Exist

Why: Indexes are like the index in a book - without them, PostgreSQL scans every row (slow). With them, lookups are instant.

Commands to run:

# Connect to your Railway PostgreSQL database
# Get connection string from Railway dashboard → PostgreSQL service → Variables → DATABASE_URL

# Option 1: Use Railway CLI
railway connect PostgreSQL

# Option 2: Use psql directly
psql "your-database-url-here"

Once connected, run:

-- Check what indexes exist
SELECT 
    tablename, 
    indexname, 
    indexdef 
FROM pg_indexes 
WHERE schemaname = 'public' 
ORDER BY tablename, indexname;

What you're looking for:

GOOD: You should see indexes on:

  • users.email (for login lookups)
  • users.username (for login lookups)
  • trees.created_by (for "my trees" queries)
  • tree_nodes.tree_id (for loading tree structure)
  • sessions.tree_id (for session lookups)

BAD: If these are missing, queries will slow down as data grows

Fix if needed:

-- Example: Add missing index
CREATE INDEX idx_trees_created_by ON trees(created_by);
CREATE INDEX idx_tree_nodes_tree_id ON tree_nodes(tree_id);
CREATE INDEX idx_sessions_tree_id ON sessions(tree_id);

1.2 Test Query Performance

Run realistic queries and time them:

-- Enable timing
\timing

-- Test: Full-text search on trees (simulates search bar)
SELECT * FROM trees 
WHERE to_tsvector('english', name || ' ' || description) @@ to_tsquery('english', 'password');

-- Test: Load tree with all nodes (simulates opening tree editor)
SELECT tn.* 
FROM tree_nodes tn 
WHERE tn.tree_id = 1  -- Replace with actual tree ID
ORDER BY tn.position;

-- Test: User's tree list (simulates dashboard)
SELECT * FROM trees 
WHERE created_by = 1  -- Replace with actual user ID
ORDER BY updated_at DESC 
LIMIT 20;

Benchmarks:

  • GOOD: All queries < 50ms
  • ⚠️ WARNING: Any query 50-200ms (optimize later)
  • BAD: Any query > 200ms (optimize NOW)

1.3 Check Database Size

-- See how much data you have
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

What this tells you: If tables are growing unexpectedly large, you might have data bloat or missing cleanup logic.


2. Frontend Performance Check

2.1 Test Large Tree Rendering

Create a "stress test" tree:

  1. Log into ResolutionFlow frontend
  2. Create a new tree called "Performance Test - Large Tree"
  3. Add 50-100 nodes (use copy/paste to speed this up)
  4. Save the tree

What to watch:

  • Does the editor lag when adding nodes?
  • Does scrolling feel smooth?
  • Does saving take more than 2-3 seconds?

Tools to use:

Open Chrome DevTools (F12):

1. Go to Performance tab
2. Click Record (red circle)
3. Interact with large tree (scroll, add nodes, expand/collapse)
4. Stop recording
5. Look for red bars (blocking/slow operations)

Benchmarks:

  • GOOD: No operations block for > 100ms
  • ⚠️ WARNING: Some operations 100-300ms
  • BAD: Operations > 300ms (users will notice lag)

2.2 Check Bundle Size

Why: Large JavaScript bundles = slow initial page load

# From your React frontend directory
cd frontend
npm run build

# Look at the output - it will show bundle sizes

Benchmarks:

  • GOOD: Main bundle < 500KB gzipped
  • ⚠️ WARNING: 500KB - 1MB
  • BAD: > 1MB (investigate what's bloating it)

2.3 Lighthouse Audit

Chrome has this built-in:

1. Open ResolutionFlow in Chrome
2. F12 → Lighthouse tab
3. Select "Desktop" + "Performance"
4. Click "Analyze page load"

Benchmarks:

  • GOOD: Performance score > 80
  • ⚠️ WARNING: 60-80
  • BAD: < 60

Common issues and fixes:

  • "Eliminate render-blocking resources" → lazy load components
  • "Reduce unused JavaScript" → code splitting needed
  • "Serve images in next-gen formats" → use WebP instead of PNG

3. API Response Time Check

3.1 Manual Timing Test

Use Railway logs:

1. Go to Railway dashboard → API service → Deployments
2. Click "View Logs"
3. Perform actions in ResolutionFlow frontend
4. Watch logs for response times

FastAPI logs look like:

INFO:     127.0.0.1 - "GET /api/trees HTTP/1.1" 200 OK [0.023s]

Benchmarks:

  • GOOD: Most endpoints < 100ms
  • ⚠️ WARNING: Some endpoints 100-300ms
  • BAD: Any endpoint > 500ms

3.2 Automated API Testing

Create a simple test script:

# File: tests/performance_test.py

import httpx
import time
from statistics import mean

API_BASE = "https://api.resolutionflow.com"  # Your Railway API URL
TOKEN = "your-jwt-token-here"  # Get from browser DevTools after login

headers = {"Authorization": f"Bearer {TOKEN}"}

def time_endpoint(method, path, **kwargs):
    """Time a single API request"""
    start = time.time()
    response = httpx.request(method, f"{API_BASE}{path}", headers=headers, **kwargs)
    elapsed = (time.time() - start) * 1000  # Convert to milliseconds
    return elapsed, response.status_code

# Test critical endpoints
tests = [
    ("GET", "/api/trees"),
    ("GET", "/api/trees/1"),  # Replace with actual tree ID
    ("GET", "/api/trees/1/nodes"),
    ("POST", "/api/trees/search", json={"query": "password"}),
]

print("API Performance Test Results:")
print("-" * 50)

for method, path in tests:
    times = []
    for i in range(5):  # Run each test 5 times
        elapsed, status = time_endpoint(method, path)
        times.append(elapsed)
    
    avg_time = mean(times)
    print(f"{method} {path}")
    print(f"  Average: {avg_time:.2f}ms")
    print(f"  Min: {min(times):.2f}ms, Max: {max(times):.2f}ms")
    print()

Run it:

python tests/performance_test.py

4. Monitoring Setup

4.1 Railway Built-in Monitoring

What Railway gives you for free:

1. Go to Railway dashboard
2. Click each service (API, Frontend, PostgreSQL)
3. Go to "Metrics" tab

Watch for:

  • CPU usage spikes (should stay < 50% normally)
  • Memory usage growing over time (memory leak indicator)
  • Request rate (see usage patterns)

Set up alerts:

1. Railway dashboard → Project Settings → Notifications
2. Add your email
3. Enable "Deployment Failed" and "Service Crashed"

Why add Sentry:

  • Free tier = 5,000 errors/month
  • Email alerts when things break
  • See exact user actions before crash
  • Industry standard (your future dev team will expect this)

Setup (5 minutes):

Backend (FastAPI):

pip install sentry-sdk[fastapi]
# File: main.py (add at the top)

import sentry_sdk

sentry_sdk.init(
    dsn="your-sentry-dsn-here",  # Get from sentry.io after signup
    traces_sample_rate=0.1,  # 10% of requests (free tier friendly)
    environment="production",
)

Frontend (React):

npm install @sentry/react
// File: src/index.js (add at the top)

import * as Sentry from "@sentry/react";

Sentry.init({
  dsn: "your-sentry-dsn-here",
  integrations: [new Sentry.BrowserTracing()],
  tracesSampleRate: 0.1,
  environment: "production",
});

Get your DSN:

1. Sign up at sentry.io (free)
2. Create new project → Select "FastAPI" and "React"
3. Copy the DSN (looks like: https://abc123@o123.ingest.sentry.io/456)
4. Add to Railway environment variables:
   - SENTRY_DSN=your-dsn-here

What you get:

  • Email when errors occur
  • Stack traces showing exactly what broke
  • User session replay (see what they clicked before crash)
  • Performance monitoring (slow API calls flagged automatically)

5. Load Testing with k6

Why k6:

  • Industry standard (Grafana Labs maintains it)
  • Shows you EXACTLY how many concurrent users your app can handle
  • Simple JavaScript syntax
  • Free and open source

5.1 Install k6

Windows (using Chocolatey):

choco install k6

Or download directly:

Verify:

k6 version

5.2 Create Load Test Script

File: tests/load_test.js

import http from 'k6/http';
import { check, sleep } from 'k6';

// Test configuration
export const options = {
  stages: [
    { duration: '30s', target: 10 },  // Ramp up to 10 users over 30s
    { duration: '1m', target: 10 },   // Stay at 10 users for 1 minute
    { duration: '30s', target: 20 },  // Ramp up to 20 users
    { duration: '1m', target: 20 },   // Stay at 20 users for 1 minute
    { duration: '30s', target: 0 },   // Ramp down to 0
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'], // 95% of requests must complete in 500ms
    http_req_failed: ['rate<0.01'],   // Less than 1% of requests can fail
  },
};

const BASE_URL = 'https://api.resolutionflow.com';
let authToken;

// Setup: Login once per virtual user
export function setup() {
  const loginRes = http.post(`${BASE_URL}/api/auth/login`, 
    JSON.stringify({
      username: 'test_user',  // Replace with test account
      password: 'test_password',
    }),
    { headers: { 'Content-Type': 'application/json' } }
  );
  
  return { token: loginRes.json('access_token') };
}

// Main test: Simulate realistic user behavior
export default function (data) {
  const headers = {
    'Authorization': `Bearer ${data.token}`,
    'Content-Type': 'application/json',
  };

  // Scenario 1: Load dashboard (get trees list)
  let res = http.get(`${BASE_URL}/api/trees`, { headers });
  check(res, {
    'dashboard loaded': (r) => r.status === 200,
    'dashboard fast': (r) => r.timings.duration < 300,
  });
  sleep(1);  // User reads for 1 second

  // Scenario 2: Open a tree
  res = http.get(`${BASE_URL}/api/trees/1`, { headers });  // Replace with real tree ID
  check(res, {
    'tree loaded': (r) => r.status === 200,
    'tree load fast': (r) => r.timings.duration < 500,
  });
  sleep(2);  // User reads tree for 2 seconds

  // Scenario 3: Load tree nodes
  res = http.get(`${BASE_URL}/api/trees/1/nodes`, { headers });
  check(res, {
    'nodes loaded': (r) => r.status === 200,
    'nodes fast': (r) => r.timings.duration < 500,
  });
  sleep(1);

  // Scenario 4: Search trees
  res = http.post(
    `${BASE_URL}/api/trees/search`,
    JSON.stringify({ query: 'password reset' }),
    { headers }
  );
  check(res, {
    'search worked': (r) => r.status === 200,
    'search fast': (r) => r.timings.duration < 400,
  });
  sleep(2);
}

5.3 Run Load Test

Basic test (10 users):

k6 run tests/load_test.js

Aggressive test (50 users):

k6 run --vus 50 --duration 2m tests/load_test.js

What the output means:

     ✓ dashboard loaded
     ✓ dashboard fast
     
     checks.........................: 95.23% ✓ 1234  ✗ 78
     data_received..................: 1.2 MB  20 kB/s
     data_sent......................: 456 kB  7.6 kB/s
     http_req_blocked...............: avg=1.2ms   min=0s   med=0s   max=45ms  p(90)=0s   p(95)=0s  
     http_req_duration..............: avg=142ms   min=23ms med=98ms max=1.2s  p(90)=245ms p(95)=387ms
     http_reqs......................: 1234   20.5/s

How to read this:

  • checks: % of tests that passed (want > 95%)
  • http_req_duration p(95): 95% of requests faster than this (want < 500ms)
  • http_reqs: Requests per second your app handled
  • http_req_failed: % of requests that errored (want < 1%)

5.4 Interpret Results

GOOD (Ready for beta):

http_req_duration p(95) < 500ms
http_req_failed < 1%
All checks passing > 95%

⚠️ WARNING (Watch closely during beta):

http_req_duration p(95) 500-1000ms
http_req_failed 1-5%
Some checks failing

BAD (Fix before beta launch):

http_req_duration p(95) > 1000ms
http_req_failed > 5%
Lots of timeouts or 500 errors

6. Pre-Launch Checklist

Run this checklist before inviting beta testers:

Database

  • All critical indexes exist (Section 1.1)
  • Query performance < 200ms (Section 1.2)
  • No unexplained table bloat (Section 1.3)

Frontend

  • Large tree (100 nodes) renders without lag (Section 2.1)
  • Bundle size < 1MB (Section 2.2)
  • Lighthouse score > 70 (Section 2.3)

API

  • All endpoints < 500ms under load (Section 3)
  • Railway logs show no errors (Section 4.1)

Monitoring

  • Railway alerts configured (Section 4.1)
  • Sentry installed (optional but recommended) (Section 4.2)

Load Testing

  • k6 test passes with 20 concurrent users (Section 5.3)
  • No request failures during load test (Section 5.4)

7. Monthly Health Check (After Launch)

Once live with beta testers, run this monthly:

Quick version (30 minutes):

# 1. Check Railway metrics
# Look for: CPU/memory trends, error rate spikes

# 2. Review Sentry errors (if installed)
# Look for: New error patterns, increasing error rates

# 3. Run quick load test
k6 run tests/load_test.js

# 4. Check database query times
# Run queries from Section 1.2, watch for slowdowns

When to do deep dive:

  • After adding major new features
  • If users report slowness
  • Before scaling to new MSP clients
  • Every 3 months minimum

8. Common Performance Issues & Fixes

Issue: "Search is slow"

Diagnosis:

EXPLAIN ANALYZE 
SELECT * FROM trees 
WHERE to_tsvector('english', name || ' ' || description) @@ to_tsquery('english', 'password');

Fix: Add GIN index:

CREATE INDEX idx_trees_fts ON trees USING GIN (to_tsvector('english', name || ' ' || description));

Issue: "Loading tree nodes is slow"

Diagnosis: Missing index on foreign key

Fix:

CREATE INDEX idx_tree_nodes_tree_id ON tree_nodes(tree_id);

Issue: "Dashboard takes forever to load"

Diagnosis: Fetching too much data

Fix: Add pagination to API:

# Instead of: SELECT * FROM trees
# Use: SELECT * FROM trees LIMIT 20 OFFSET 0

Issue: "Frontend feels sluggish"

Diagnosis: Re-rendering too often

Fix: Add React.memo() to components, use proper dependency arrays in useEffect

Issue: "API crashes under load"

Diagnosis: Not enough Railway resources

Fix:

1. Railway dashboard → API service → Settings
2. Increase memory limit (default is 512MB, try 1GB)
3. Enable auto-scaling if needed

Resources

Tools mentioned:

When to get help:

  • k6 test failing badly (> 10% error rate)
  • Database queries consistently > 1 second
  • Sentry showing critical errors
  • Railway CPU/memory maxing out

Next steps after this checklist:

  • If all checks pass → Launch beta confidently
  • If warnings found → Document them, monitor during beta
  • If critical issues → Fix before launch, re-run tests