Skip to content

Brain Runbook

This runbook covers operational tasks against brain deployed on Railway with a Neon Postgres backend and Redis (Upstash or self-hosted). For background context, read Brain, Deployment Environment, and Observability.

Topology recap

flowchart LR
subgraph Public
sirloin
fennec
brisket
end
sirloin -- gRPC + REST(api/*) --> brain
fennec -- REST + Clerk --> brain
brain -- Prisma --> neon[(Neon Postgres / fennec schema)]
brain -- BullMQ --> redis[(Redis)]
brain -- gRPC --> round
brain -- HTTPS --> providers[FAL / RunPod / Vertex / Kling / Hive]
vigen[VI Generator] -- webhook --> brain

Deploy

CI

.github/workflows/brain.yml runs lint, typecheck, and unit tests on PRs and pushes to main/release whenever apps/brain/** changes. Steps:

  1. pnpm install --frozen-lockfile (working dir apps/brain).
  2. pnpm prisma generate — required before lint and tsc, since generated types are referenced from source.
  3. pnpm lint — ESLint with --max-warnings 0.
  4. pnpm tsc — typecheck only (no emit).
  5. pnpm test — unit tests.

The workflow does not run integration tests against a real DB and does not deploy. Pin: NODE_VERSION: 22, PNPM_VERSION: 10.26.0.

Production deploy (Railway)

Railway builds from the Dockerfile at apps/brain/Dockerfile whenever apps/brain/** changes (apps/brain/railway.json watchPatterns). The CI workflow runs on main and release (.github/workflows/brain.yml:5); Railway’s branch trigger is configured in the Railway service settings (not in railway.json).

  • Build args: SENTRY_AUTH_TOKEN, SENTRY_RELEASE (Railway derives SENTRY_RELEASE from RAILWAY_GIT_COMMIT_SHA).
  • Stages: prisma generate (cached on schema hash) → pnpm build → multi-stage runtime.
  • Healthcheck path: /health, timeout 120s (apps/brain/railway.json deploy.healthcheckPath/healthcheckTimeout).

Brain starts as a hybrid app via app.startAllMicroservices() then app.listen(PORT, '::') — IPv6 dual-stack is required for Railway’s private network.

Manual production deploy

Use Railway CLI (preferred) or dashboard. Local docker-only path:

Terminal window
make dev-build # builds all images including brain
make dev-up-d # full stack
docker compose logs -f brain

Rollback

  1. Railway dashboard → brain service → Deployments → choose the previous green deploy → Redeploy.
  2. If a Prisma migration shipped with the bad release, see Prisma migration rollback below.
  3. Confirm via /queues (Bull Board) that workers are draining and via Axiom that 5xx rate has returned to baseline.

A rollback that does not include a DB schema change is safe: brain is stateless beyond its connection pools.

Prisma migration discipline

  • Schema lives in apps/brain/prisma/schema.prisma. Migration history in apps/brain/prisma/migrations/.
  • New migrations are produced via the project’s create_migration.sh script (custom Docker workflow — see apps/brain/CLAUDE.md Gotchas).
  • Migrations are applied at deploy with prisma migrate deploy against DIRECT_DATABASE_URL (non-pooled) — see Brain Env.
  • Forward-only: never edit a committed migration. To revert a schema change, ship a new migration that compensates.
  • Two-phase rollouts for breaking changes: ship the additive migration first, deploy reading code, then remove old columns in a follow-up PR.

Prisma migration rollback

  1. Ship a compensating migration. Do not prisma migrate resolve --rolled-back in production unless coordinated.
  2. If the migration partially applied: connect via psql using DIRECT_DATABASE_URL, inspect _prisma_migrations, and align state before rerunning.
  3. Snapshot Neon (preview branches retain history — see Deployment Environment) before destructive interventions.

Queue stuck — failed or stalled jobs

Symptoms: mediaflows backlog grows in /queues; users see generations not progressing; BullMQ stalled events spike.

  1. Open Bull Board at /queues (Clerk-protected). Filter to the affected queue.
  2. Check Failed tab: cluster errors by message. Common causes:
    • Provider 4xx/5xx (FAL/RunPod/Kling) → check provider status and key validity.
    • ContentModerationException — expected; counts toward “failed” but does not page.
    • RoundServiceUnavailableException — round outage; see Round outage below.
  3. Check Stalled tab: jobs whose worker died. Restart brain replicas if stalled count grows continuously (onStalled is logged in media-flows.processor.ts).
  4. Retry: select jobs and click Retry. For mass retry, prefer a one-off script that re-enqueues with the same payload + new id rather than retrying directly (avoids duplicate event log records).
  5. If Redis is the cause: verify REDIS_HOST/REDIS_PORT/REDIS_PASSWORD; check Railway Redis service status.

Job retention is 24h for completed and 7 days for failed (apps/brain/src/config/queue.config.ts).

Clerk session issues

Symptoms: 401 spike; users reporting forced logouts.

  1. Check Clerk status page.
  2. Confirm CLERK_SECRET_KEY matches the Clerk instance for the deployed stage.
  3. If only OAuth (oat_) tokens are failing, Clerk Frontend API exchange is broken — see apps/brain/src/modules/application/auth/strategies/clerk.strategy.ts:40-103. Verify CLERK_FRONTEND_API_URL is reachable.
  4. Roll forward, not back: do not roll Clerk env without rotating clients.
  5. Cross-link: Brain Clerk Flow, Auth Model.

Round outage

Symptoms: surge in RoundServiceUnavailableException; circuit breaker open.

  1. Check round Railway service status and recent deploys.
  2. Confirm gRPC reachability: from a brain shell, attempt a connection on ROUND_HOST (default round:8080).
  3. The circuit breaker in apps/brain/src/modules/application/round/interceptors/circuit-breaker.interceptor.ts:37-43 opens at errorThresholdPercentage: 50 over a 10s rolling window once volumeThreshold: 10 requests have been seen, and probes recovery via half-open after resetTimeout: 15000ms. Per-call timeout is 35000ms.
  4. Generation queue jobs that depend on round will fail and be retained 7 days; once round recovers, retry from Bull Board.

Provider key compromise / rotation

  1. Generate new key in the provider console.
  2. Update Railway env (BRAIN_FAL_AI_KEY, BRAIN_RUNPOD_TOKEN, etc.).
  3. Trigger a redeploy (env changes alone do not always restart).
  4. Revoke old key only after the new deploy is live.

Database connection saturation

Symptoms: P1001/P1017 Prisma errors; latency spike on every endpoint that touches Prisma.

  1. Check Neon dashboard for connection count and CPU.
  2. Brain uses pgbouncer (pooled) via DATABASE_URL; DIRECT_DATABASE_URL is for migrations only.
  3. Reduce queue concurrency temporarily — MediaFlowsProcessor runs at concurrency: 500. Drop replicas or scale Redis/Postgres before raising further.

Emergency disable of an endpoint

There is no feature-flag layer at the brain controller level today. Options:

  • Mark the controller method @Public() is not suitable for disabling — it bypasses auth.
  • Ship a hotfix that returns 503 Service Unavailable from the affected controller, or temporarily remove the controller from its module.

TODO

  • Healthcheck path is /health (apps/brain/railway.json deploy.healthcheckPath).
  • Brain runs a single replica in production: numReplicas: 1 globally and in the us-east4-eqdc4a region (apps/brain/railway.json deploy.numReplicas and deploy.multiRegionConfig).
  • TODO(@pawel): Document expected SLOs for mediaflows queue depth and 5xx rate; pull from observability dashboards once defined.