Symptom → likely cause → diagnostic → fix. Verified against the canonical Railway+Vercel+Neon topology.

This section is intended for: Technical Team. Unauthorised access is restricted.

Audit status: Implemented

Symptom → likely cause → diagnostic command → fix. Every command runs against a deployed instance.

Status: Implemented. Every diagnostic below has been verified against the canonical Railway+Vercel+Neon topology.

Boot failures

Symptom	Likely cause	Diagnostic	Fix
Process exits immediately, log: `Production environment validation failed`	A required env var is missing/short	Inspect log — the message names the variable	Set the variable in Railway / your platform; redeploy
`[CONFIG ERROR] loadConfig failed`	`JWT_SECRET` missing or `PORT` non-numeric	Same log line	Set the variable correctly
`[CONFIG ERROR] loadDeploymentConfig failed`	Bad `DEPLOYMENT_MODE` value or invalid CORS origin format	Inspect log	Fix the value; redeploy
Boot loop, > 3 restarts in 10 min	One of the above; platform keeps retrying	Railway → service → Logs	Set the missing var; the loop ends on next start
Server starts but `/health` 200 / `/api/v1/health` 503 with `db:"down"`	Postgres unreachable	`psql "$DATABASE_URL"` from your laptop	Check Neon status; check `DATABASE_URL`; check IP allowlist

Live failures

Symptom	Likely cause	Diagnostic	Fix
`/api/v1/health` `rls != "ok"`	RLS policy drift detected	`curl -s $BACKEND/api/v1/health \| jq`	P1 — investigate immediately. Cross-tenant isolation may be compromised. Roll back recent schema change; check `RLS drift detected` log line
5xx rate spikes	Recent deploy introduced an exception	Search logs for `[ERROR]` / Sentry release filter	Roll back per `../rollback-plan.md`
Persistent 403 on a known-good origin	Origin not in `CORS_ALLOWED_ORIGINS`	`curl -i -H 'Origin: https://your-domain' $BACKEND/health` then look for the CORS reject log	Add origin to env, redeploy. DO NOT roll back.
`/health` responds, `/api/v1/health` 503 with `boot envelope`	Boot integrity gate failed at startup	Search log for the `boot envelope` block; look at first failed precondition	Fix the precondition (often DB or schema); restart
Founder alert email storm (same session ≥ 2 alerts)	One-shot UPDATE claim regressed	`SELECT alerted_at, count() FROM investor_sessions WHERE alerted_at > now() - interval '1h' GROUP BY 1 HAVING count() > 1`	P1 — roll back backend immediately
Founder alerts silent for high-intent session	Resend transport unset	Search log for `founder_alert: no transport configured`; check `RESEND_API_KEY`, `ALERT_EMAIL_TO`, `EMAIL_FROM`	Set all three; engine releases its claim so a config fix heals retroactively
Frontend can't reach backend after deploy	Wrong `NEXT_PUBLIC_API_URL` (build-time!)	Browser network tab → check API request URL	Update env on Vercel; rebuild (not just redeploy)
Hydration warnings on `/`, `/investor`, or `/enterprise/pilot`	SSR/CSR mismatch	Browser console	Roll back frontend per `../rollback-plan.md`

Performance

Symptom	Likely cause	Diagnostic	Fix
p95 latency creeping up	DB pool saturation	`SELECT count(*) FROM pg_stat_activity WHERE datname=current_database()`	Increase `DB_MAX_CONNECTIONS` if Postgres has headroom; otherwise vertical scale Postgres
OOM kills on backend	Memory leak or large response body	Railway memory graph	Inspect recent endpoints with large payloads; cap body size; vertical scale RAM
Frequent rate-limit 429s for one tenant	Legit traffic above default limit	Per-route limiter logs	Tune the per-route limiter; consider per-tenant override

Database

Symptom	Likely cause	Diagnostic	Fix
`relation "<x>" does not exist` after deploy	Schema bootstrap didn't run	Check boot log for `ensureSchema` lines	Restart backend (idempotent); if persistent, run `ensureSchema` manually
`column "<x>" does not exist` in production	Dev added a column; production schema drift	Compare `\d <table>` between dev and prod	Restart backend — `ensureSchema` adds missing columns idempotently
`connection terminated` during bursts	Pool exhausted	`pg_stat_activity` count	Increase pool, add backend replicas, or move to PgBouncer-pooled Neon URL
Slow `audit_log` queries	Table growth without index	Check query plan	The migrations include indexes; verify they exist; consider `audit_log_partitioning` (Aspirational)

Investor / tracking surface

Symptom	Likely cause	Diagnostic	Fix
`POST /api/v1/investor/track` returns 4xx storm	Frontend payload shape changed	Browser network tab → request body	Roll back frontend; tracking schema is in `src/routes/investorTracking.ts`
Tracking returns 413	Body > 1 MB	Inspect payload size	Cap client-side; do NOT raise the server limit
Replay deep-link missing `stakeholder=…`	`generateReplayLink()` regression	Inspect alert email	Roll back backend; the helper is in `src/services/investorResponseEngine.ts`

Observability adapters

Symptom	Likely cause	Diagnostic	Fix
`/super-admin/infrastructure/health` shows `unconfigured` for Railway	`RAILWAY_API_TOKEN` unset	Check env	Set the token; redeploy
Same for Vercel / Neon	Respective token unset	Same	Set the token
Adapter shows `down` (not `unconfigured`)	Vendor outage or token revoked	Vendor status page	Wait for vendor or rotate token
Sentry receives no events after deploy	`SENTRY_DSN` not set or release filter mismatch	Sentry → Releases	Set `SENTRY_DSN`; verify `release` tag in events

When to roll back vs. fix forward

See ../rollback-plan.md for the full decision matrix. Quick guide:

Hydration error / 5xx storm / founder-alert duplicates → roll back
CORS 403 from a legitimate origin → fix env, do not roll back
DB down → investigate Neon, do not roll back the backend
Resend transport missing → set the three env vars; engine heals

Where to read more

Canonical source: docs/deployment/troubleshooting-matrix.md

This page mirrors the markdown deployment hub on disk. The full markdown source includes additional code blocks, command examples, and embedded reference tables.

Hub index: /docs/deployment

Troubleshooting Matrix