How to wire each observability surface, and how each one degrades when its credential is unset.
Status: Implemented. Every adapter exists in src/services/infraHealth/* (Phase 1 audit §6); each is timeboxed at 3 s and degrades to unconfigured instead of crashing.
Surfaces, at a glance
| Surface | Code path | Required env | Behaviour when unset |
|---|---|---|---|
| Sentry (errors) | src/observability/sentry.ts | SENTRY_DSN | No-op; structured warn at boot |
| Structured logs (Pino) | src/utils/logger.ts | none | Always on; JSON to stdout |
| Request correlation | src/utils/requestContext.ts, src/middleware/requestId.ts | none | Always on; AsyncLocalStorage propagates request_id / tenant_id / user_id / route |
| Railway adapter | src/services/infraHealth/railwayAdapter.ts | RAILWAY_API_TOKEN, RAILWAY_PROJECT_ID | unconfigured status surfaced in /super-admin/infrastructure/health |
| Vercel adapter | src/services/infraHealth/vercelAdapter.ts | VERCEL_API_TOKEN, VERCEL_TEAM_ID | unconfigured |
| Neon adapter | src/services/infraHealth/neonAdapter.ts | NEON_API_KEY, NEON_PROJECT_ID | Falls back to pg_stat_activity query |
| Health endpoints | src/app.ts, src/routes/health.ts | none | /health always-on; /api/v1/health is boot-envelope-gated |
1. Sentry
Sentry is initialised BEFORE any framework module loads. src/index.ts uses zero static imports and dynamically requires ./observability/sentry first, then ./bootApp. This is intentional — Sentry must wrap process exceptions, not just request errors.
src/observability/sentry.ts's beforeSend strips:
event.request.{data,cookies}Authorization,Cookie,X-API-Keyheadersevent.user.{email,username,ip_address}
To enable: set SENTRY_DSN and redeploy. To disable: leave SENTRY_DSN unset.
2. Structured logs
Pino emits JSON on stdout. Every line carries the AsyncLocalStorage-propagated correlation triple.
Redaction list (src/utils/logger.ts:55-63) covers password, token, secret, authorization, cookie, apiKey at top-level, nested, and metadata.* paths.
Shipping logs to your aggregator:
- Railway: logs are captured automatically; no setup. Forward to Datadog / Better Stack / etc. via Railway's log drains.
- Vercel: frontend logs live in Vercel's dashboard; ship via Vercel Log Drains.
- Self-hosted Docker: set
logging.driver(seedocker.md). For ELK, usegelforfluentd. - Kubernetes: standard container-log harvesting (Fluent Bit / Vector → Loki / Elasticsearch).
3. Request correlation
Every HTTP request gets a request_id (header x-request-id, generated if absent). The ID propagates into:
- Every Pino log line for the request
- Every
audit_logrow written during the request - The Sentry breadcrumb tags
This makes a single request_id the join key across logs, audit, and error reporting. Operators can search audit_log by request_id directly (src/migrations/036_observability.sql adds the partial index).
4. Infra-health adapters
Aggregator at src/services/infraHealth/index.ts calls each adapter under Promise.allSettled so a slow vendor cannot block the operator surface. Each adapter wraps its HTTP call in a 3 s AbortController-backed timeout.
The output is exposed at /api/v1/super-admin/infrastructure/health and rendered in /super-admin/infrastructure/health (mounted as a sibling route BEFORE the broader /super-admin mount so the stricter RBAC gate wins — see ../audits/phase1-readiness-audit.md §1).
Configuring each adapter
# Railway visibility (service status, recent deploy state)
RAILWAY_API_TOKEN=<token from railway.app/account/tokens>
RAILWAY_PROJECT_ID=<project id from railway.app/dashboard>
# Vercel visibility (deployment state)
VERCEL_API_TOKEN=<token from vercel.com/account/tokens>
VERCEL_TEAM_ID=<team id, optional for personal accounts>
# Neon visibility (compute state, branch list, storage)
NEON_API_KEY=<key from neon.tech/app/settings/api-keys>
NEON_PROJECT_ID=<project id from neon.tech/app/projects>
Each token is read-only — none mutates infrastructure.
5. Alerts to actually configure
See ../monitoring-guide.md for the full table. The minimum viable alert set:
| Signal | Threshold | Severity |
|---|---|---|
/api/v1/health non-200 for >2 min | P1 | Page on-call |
/api/v1/health data.rls != "ok" | any single sample | P1 — RLS drift; cross-tenant isolation may be compromised |
| 5xx rate > 1% over 5 min | P2 | Inspect Railway logs |
| Founder alert duplicates | any session alerted >1 time | P2 — one-shot UPDATE has regressed |
| Boot loop > 3 restarts in 10 min | P1 | Inspect logs for Production environment validation failed |
6. Degraded-mode summary
When a vendor token is unset, that surface is unconfigured — not down. The /super-admin/infrastructure/health UI shows the per-surface status; operators get an at-a-glance "what's dark" without the system itself crashing.
This is the right trade-off for air-gapped or self-hosted deployments where reaching out to Railway / Vercel APIs is impossible.
Where to read more
../monitoring-guide.md— alerts + log greps + DB checksproduction-hardening.md— what's enforced at boottroubleshooting-matrix.md— symptom → fix- In-app:
/docs/deployment/observability-setup