Symptom → likely cause → diagnostic command → fix. Every command runs against a deployed instance.
Status: Implemented. Every diagnostic below has been verified against the canonical Railway+Vercel+Neon topology.
Boot failures
| Symptom | Likely cause | Diagnostic | Fix |
|---|---|---|---|
Process exits immediately, log: Production environment validation failed | A required env var is missing/short | Inspect log — the message names the variable | Set the variable in Railway / your platform; redeploy |
[CONFIG ERROR] loadConfig failed | JWT_SECRET missing or PORT non-numeric | Same log line | Set the variable correctly |
[CONFIG ERROR] loadDeploymentConfig failed | Bad DEPLOYMENT_MODE value or invalid CORS origin format | Inspect log | Fix the value; redeploy |
| Boot loop, > 3 restarts in 10 min | One of the above; platform keeps retrying | Railway → service → Logs | Set the missing var; the loop ends on next start |
Server starts but /health 200 / /api/v1/health 503 with db:"down" | Postgres unreachable | psql "$DATABASE_URL" from your laptop | Check Neon status; check DATABASE_URL; check IP allowlist |
Live failures
| Symptom | Likely cause | Diagnostic | Fix |
|---|---|---|---|
/api/v1/health rls != "ok" | RLS policy drift detected | curl -s $BACKEND/api/v1/health | jq | P1 — investigate immediately. Cross-tenant isolation may be compromised. Roll back recent schema change; check RLS drift detected log line |
| 5xx rate spikes | Recent deploy introduced an exception | Search logs for [ERROR] / Sentry release filter | Roll back per ../rollback-plan.md |
| Persistent 403 on a known-good origin | Origin not in CORS_ALLOWED_ORIGINS | curl -i -H 'Origin: https://your-domain' $BACKEND/health then look for the CORS reject log | Add origin to env, redeploy. DO NOT roll back. |
/health responds, /api/v1/health 503 with boot envelope | Boot integrity gate failed at startup | Search log for the boot envelope block; look at first failed precondition | Fix the precondition (often DB or schema); restart |
| Founder alert email storm (same session ≥ 2 alerts) | One-shot UPDATE claim regressed | SELECT alerted_at, count(*) FROM investor_sessions WHERE alerted_at > now() - interval '1h' GROUP BY 1 HAVING count(*) > 1 | P1 — roll back backend immediately |
| Founder alerts silent for high-intent session | Resend transport unset | Search log for founder_alert: no transport configured; check RESEND_API_KEY, ALERT_EMAIL_TO, EMAIL_FROM | Set all three; engine releases its claim so a config fix heals retroactively |
| Frontend can't reach backend after deploy | Wrong NEXT_PUBLIC_API_URL (build-time!) | Browser network tab → check API request URL | Update env on Vercel; rebuild (not just redeploy) |
Hydration warnings on /, /investor, or /enterprise/pilot | SSR/CSR mismatch | Browser console | Roll back frontend per ../rollback-plan.md |
Performance
| Symptom | Likely cause | Diagnostic | Fix |
|---|---|---|---|
| p95 latency creeping up | DB pool saturation | SELECT count(*) FROM pg_stat_activity WHERE datname=current_database() | Increase DB_MAX_CONNECTIONS if Postgres has headroom; otherwise vertical scale Postgres |
| OOM kills on backend | Memory leak or large response body | Railway memory graph | Inspect recent endpoints with large payloads; cap body size; vertical scale RAM |
| Frequent rate-limit 429s for one tenant | Legit traffic above default limit | Per-route limiter logs | Tune the per-route limiter; consider per-tenant override |
Database
| Symptom | Likely cause | Diagnostic | Fix |
|---|---|---|---|
relation "<x>" does not exist after deploy | Schema bootstrap didn't run | Check boot log for ensureSchema lines | Restart backend (idempotent); if persistent, run ensureSchema manually |
column "<x>" does not exist in production | Dev added a column; production schema drift | Compare \d <table> between dev and prod | Restart backend — ensureSchema adds missing columns idempotently |
connection terminated during bursts | Pool exhausted | pg_stat_activity count | Increase pool, add backend replicas, or move to PgBouncer-pooled Neon URL |
Slow audit_log queries | Table growth without index | Check query plan | The migrations include indexes; verify they exist; consider audit_log_partitioning (Aspirational) |
Investor / tracking surface
| Symptom | Likely cause | Diagnostic | Fix |
|---|---|---|---|
POST /api/v1/investor/track returns 4xx storm | Frontend payload shape changed | Browser network tab → request body | Roll back frontend; tracking schema is in src/routes/investorTracking.ts |
| Tracking returns 413 | Body > 1 MB | Inspect payload size | Cap client-side; do NOT raise the server limit |
Replay deep-link missing stakeholder=… | generateReplayLink() regression | Inspect alert email | Roll back backend; the helper is in src/services/investorResponseEngine.ts |
Observability adapters
| Symptom | Likely cause | Diagnostic | Fix |
|---|---|---|---|
/super-admin/infrastructure/health shows unconfigured for Railway | RAILWAY_API_TOKEN unset | Check env | Set the token; redeploy |
| Same for Vercel / Neon | Respective token unset | Same | Set the token |
Adapter shows down (not unconfigured) | Vendor outage or token revoked | Vendor status page | Wait for vendor or rotate token |
| Sentry receives no events after deploy | SENTRY_DSN not set or release filter mismatch | Sentry → Releases | Set SENTRY_DSN; verify release tag in events |
When to roll back vs. fix forward
See ../rollback-plan.md for the full decision matrix. Quick guide:
- Hydration error / 5xx storm / founder-alert duplicates → roll back
- CORS 403 from a legitimate origin → fix env, do not roll back
- DB down → investigate Neon, do not roll back the backend
- Resend transport missing → set the three env vars; engine heals