Brain Runbook

This runbook covers operational tasks against brain deployed on Railway with a Neon Postgres backend and Redis (Upstash or self-hosted). For background context, read Brain, Deployment Environment, and Observability.

Topology recap

flowchart LR
  subgraph Public
    sirloin
    fennec
    brisket
  end
  sirloin -- gRPC + REST(api/*) --> brain
  fennec -- REST + Clerk --> brain
  brain -- Prisma --> neon[(Neon Postgres / fennec schema)]
  brain -- BullMQ --> redis[(Redis)]
  brain -- gRPC --> round
  brain -- HTTPS --> providers[FAL / RunPod / Vertex / Kling / Hive]
  vigen[VI Generator] -- webhook --> brain

Deploy

CI

.github/workflows/brain.yml runs lint, typecheck, and unit tests on PRs and pushes to main/release whenever apps/brain/** changes. Steps:

pnpm install --frozen-lockfile (working dir apps/brain).
pnpm prisma generate — required before lint and tsc, since generated types are referenced from source.
pnpm lint — ESLint with --max-warnings 0.
pnpm tsc — typecheck only (no emit).
pnpm test — unit tests.

The workflow does not run integration tests against a real DB and does not deploy. Pin: NODE_VERSION: 22, PNPM_VERSION: 10.26.0.

Production deploy (Railway)

Railway builds from the Dockerfile at apps/brain/Dockerfile whenever apps/brain/** changes (apps/brain/railway.json watchPatterns). The CI workflow runs on main and release (.github/workflows/brain.yml:5); Railway’s branch trigger is configured in the Railway service settings (not in railway.json).

Build args: SENTRY_AUTH_TOKEN, SENTRY_RELEASE (Railway derives SENTRY_RELEASE from RAILWAY_GIT_COMMIT_SHA).
Stages: prisma generate (cached on schema hash) → pnpm build → multi-stage runtime.
Healthcheck path: /health, timeout 120s (apps/brain/railway.json deploy.healthcheckPath/healthcheckTimeout).

Brain starts as a hybrid app via app.startAllMicroservices() then app.listen(PORT, '::') — IPv6 dual-stack is required for Railway’s private network.

Manual production deploy

Use Railway CLI (preferred) or dashboard. Local docker-only path:

make dev-build     # builds all images including brain
make dev-up-d      # full stack
docker compose logs -f brain

Rollback

Railway dashboard → brain service → Deployments → choose the previous green deploy → Redeploy.
If a Prisma migration shipped with the bad release, see Prisma migration rollback below.
Confirm via /queues (Bull Board) that workers are draining and via Axiom that 5xx rate has returned to baseline.

A rollback that does not include a DB schema change is safe: brain is stateless beyond its connection pools.

Prisma migration discipline

Schema lives in apps/brain/prisma/schema.prisma. Migration history in apps/brain/prisma/migrations/.
New migrations are produced via the project’s create_migration.sh script (custom Docker workflow — see apps/brain/CLAUDE.md Gotchas).
Migrations are applied at deploy with prisma migrate deploy against DIRECT_DATABASE_URL (non-pooled) — see Brain Env.
Forward-only: never edit a committed migration. To revert a schema change, ship a new migration that compensates.
Two-phase rollouts for breaking changes: ship the additive migration first, deploy reading code, then remove old columns in a follow-up PR.

Prisma migration rollback

Ship a compensating migration. Do not prisma migrate resolve --rolled-back in production unless coordinated.
If the migration partially applied: connect via psql using DIRECT_DATABASE_URL, inspect _prisma_migrations, and align state before rerunning.
Snapshot Neon (preview branches retain history — see Deployment Environment) before destructive interventions.

Queue stuck — failed or stalled jobs

Symptoms: mediaflows backlog grows in /queues; users see generations not progressing; BullMQ stalled events spike.

Open Bull Board at /queues (Clerk-protected). Filter to the affected queue.
Check Failed tab: cluster errors by message. Common causes:
- Provider 4xx/5xx (FAL/RunPod/Kling) → check provider status and key validity.
- ContentModerationException — expected; counts toward “failed” but does not page.
- RoundServiceUnavailableException — round outage; see Round outage below.
Check Stalled tab: jobs whose worker died. Restart brain replicas if stalled count grows continuously (onStalled is logged in media-flows.processor.ts).
Retry: select jobs and click Retry. For mass retry, prefer a one-off script that re-enqueues with the same payload + new id rather than retrying directly (avoids duplicate event log records).
If Redis is the cause: verify REDIS_HOST/REDIS_PORT/REDIS_PASSWORD; check Railway Redis service status.

Job retention is 24h for completed and 7 days for failed (apps/brain/src/config/queue.config.ts).

Clerk session issues

Symptoms: 401 spike; users reporting forced logouts.

Check Clerk status page.
Confirm CLERK_SECRET_KEY matches the Clerk instance for the deployed stage.
If only OAuth (oat_) tokens are failing, Clerk Frontend API exchange is broken — see apps/brain/src/modules/application/auth/strategies/clerk.strategy.ts:40-103. Verify CLERK_FRONTEND_API_URL is reachable.
Roll forward, not back: do not roll Clerk env without rotating clients.
Cross-link: Brain Clerk Flow, Auth Model.

Round outage

Symptoms: surge in RoundServiceUnavailableException; circuit breaker open.

Check round Railway service status and recent deploys.
Confirm gRPC reachability: from a brain shell, attempt a connection on ROUND_HOST (default round:8080).
The circuit breaker in apps/brain/src/modules/application/round/interceptors/circuit-breaker.interceptor.ts:37-43 opens at errorThresholdPercentage: 50 over a 10s rolling window once volumeThreshold: 10 requests have been seen, and probes recovery via half-open after resetTimeout: 15000ms. Per-call timeout is 35000ms.
Generation queue jobs that depend on round will fail and be retained 7 days; once round recovers, retry from Bull Board.

Provider key compromise / rotation

Generate new key in the provider console.
Update Railway env (BRAIN_FAL_AI_KEY, BRAIN_RUNPOD_TOKEN, etc.).
Trigger a redeploy (env changes alone do not always restart).
Revoke old key only after the new deploy is live.

Database connection saturation

Symptoms: P1001/P1017 Prisma errors; latency spike on every endpoint that touches Prisma.

Check Neon dashboard for connection count and CPU.
Brain uses pgbouncer (pooled) via DATABASE_URL; DIRECT_DATABASE_URL is for migrations only.
Reduce queue concurrency temporarily — MediaFlowsProcessor runs at concurrency: 500. Drop replicas or scale Redis/Postgres before raising further.

Emergency disable of an endpoint

There is no feature-flag layer at the brain controller level today. Options:

Mark the controller method @Public() is not suitable for disabling — it bypasses auth.
Ship a hotfix that returns 503 Service Unavailable from the affected controller, or temporarily remove the controller from its module.

TODO

Healthcheck path is /health (apps/brain/railway.json deploy.healthcheckPath).
Brain runs a single replica in production: numReplicas: 1 globally and in the us-east4-eqdc4a region (apps/brain/railway.json deploy.numReplicas and deploy.multiRegionConfig).
TODO(@pawel): Document expected SLOs for mediaflows queue depth and 5xx rate; pull from observability dashboards once defined.