Brain Runbook
This runbook covers operational tasks against brain deployed on Railway with a Neon Postgres backend and Redis (Upstash or self-hosted). For background context, read Brain, Deployment Environment, and Observability.
Topology recap
flowchart LR subgraph Public sirloin fennec brisket end sirloin -- gRPC + REST(api/*) --> brain fennec -- REST + Clerk --> brain brain -- Prisma --> neon[(Neon Postgres / fennec schema)] brain -- BullMQ --> redis[(Redis)] brain -- gRPC --> round brain -- HTTPS --> providers[FAL / RunPod / Vertex / Kling / Hive] vigen[VI Generator] -- webhook --> brainDeploy
CI
.github/workflows/brain.yml runs lint, typecheck, and unit tests on PRs and pushes to main/release whenever apps/brain/** changes. Steps:
pnpm install --frozen-lockfile(working dirapps/brain).pnpm prisma generate— required before lint and tsc, since generated types are referenced from source.pnpm lint— ESLint with--max-warnings 0.pnpm tsc— typecheck only (no emit).pnpm test— unit tests.
The workflow does not run integration tests against a real DB and does not deploy. Pin: NODE_VERSION: 22, PNPM_VERSION: 10.26.0.
Production deploy (Railway)
Railway builds from the Dockerfile at apps/brain/Dockerfile whenever apps/brain/** changes (apps/brain/railway.json watchPatterns). The CI workflow runs on main and release (.github/workflows/brain.yml:5); Railway’s branch trigger is configured in the Railway service settings (not in railway.json).
- Build args:
SENTRY_AUTH_TOKEN,SENTRY_RELEASE(Railway derivesSENTRY_RELEASEfromRAILWAY_GIT_COMMIT_SHA). - Stages: prisma generate (cached on schema hash) →
pnpm build→ multi-stage runtime. - Healthcheck path:
/health, timeout120s(apps/brain/railway.jsondeploy.healthcheckPath/healthcheckTimeout).
Brain starts as a hybrid app via app.startAllMicroservices() then app.listen(PORT, '::') — IPv6 dual-stack is required for Railway’s private network.
Manual production deploy
Use Railway CLI (preferred) or dashboard. Local docker-only path:
make dev-build # builds all images including brainmake dev-up-d # full stackdocker compose logs -f brainRollback
- Railway dashboard → brain service → Deployments → choose the previous green deploy → Redeploy.
- If a Prisma migration shipped with the bad release, see Prisma migration rollback below.
- Confirm via
/queues(Bull Board) that workers are draining and via Axiom that 5xx rate has returned to baseline.
A rollback that does not include a DB schema change is safe: brain is stateless beyond its connection pools.
Prisma migration discipline
- Schema lives in
apps/brain/prisma/schema.prisma. Migration history inapps/brain/prisma/migrations/. - New migrations are produced via the project’s
create_migration.shscript (custom Docker workflow — seeapps/brain/CLAUDE.mdGotchas). - Migrations are applied at deploy with
prisma migrate deployagainstDIRECT_DATABASE_URL(non-pooled) — see Brain Env. - Forward-only: never edit a committed migration. To revert a schema change, ship a new migration that compensates.
- Two-phase rollouts for breaking changes: ship the additive migration first, deploy reading code, then remove old columns in a follow-up PR.
Prisma migration rollback
- Ship a compensating migration. Do not
prisma migrate resolve --rolled-backin production unless coordinated. - If the migration partially applied: connect via
psqlusingDIRECT_DATABASE_URL, inspect_prisma_migrations, and align state before rerunning. - Snapshot Neon (preview branches retain history — see Deployment Environment) before destructive interventions.
Queue stuck — failed or stalled jobs
Symptoms: mediaflows backlog grows in /queues; users see generations not progressing; BullMQ stalled events spike.
- Open Bull Board at
/queues(Clerk-protected). Filter to the affected queue. - Check Failed tab: cluster errors by message. Common causes:
- Provider 4xx/5xx (FAL/RunPod/Kling) → check provider status and key validity.
ContentModerationException— expected; counts toward “failed” but does not page.RoundServiceUnavailableException— round outage; see Round outage below.
- Check Stalled tab: jobs whose worker died. Restart brain replicas if stalled count grows continuously (
onStalledis logged inmedia-flows.processor.ts). - Retry: select jobs and click Retry. For mass retry, prefer a one-off script that re-enqueues with the same payload + new id rather than retrying directly (avoids duplicate event log records).
- If Redis is the cause: verify
REDIS_HOST/REDIS_PORT/REDIS_PASSWORD; check Railway Redis service status.
Job retention is 24h for completed and 7 days for failed (apps/brain/src/config/queue.config.ts).
Clerk session issues
Symptoms: 401 spike; users reporting forced logouts.
- Check Clerk status page.
- Confirm
CLERK_SECRET_KEYmatches the Clerk instance for the deployed stage. - If only OAuth (
oat_) tokens are failing, Clerk Frontend API exchange is broken — seeapps/brain/src/modules/application/auth/strategies/clerk.strategy.ts:40-103. VerifyCLERK_FRONTEND_API_URLis reachable. - Roll forward, not back: do not roll Clerk env without rotating clients.
- Cross-link: Brain Clerk Flow, Auth Model.
Round outage
Symptoms: surge in RoundServiceUnavailableException; circuit breaker open.
- Check round Railway service status and recent deploys.
- Confirm gRPC reachability: from a brain shell, attempt a connection on
ROUND_HOST(defaultround:8080). - The circuit breaker in
apps/brain/src/modules/application/round/interceptors/circuit-breaker.interceptor.ts:37-43opens aterrorThresholdPercentage: 50over a 10s rolling window oncevolumeThreshold: 10requests have been seen, and probes recovery via half-open afterresetTimeout: 15000ms. Per-call timeout is35000ms. - Generation queue jobs that depend on round will fail and be retained 7 days; once round recovers, retry from Bull Board.
Provider key compromise / rotation
- Generate new key in the provider console.
- Update Railway env (
BRAIN_FAL_AI_KEY,BRAIN_RUNPOD_TOKEN, etc.). - Trigger a redeploy (env changes alone do not always restart).
- Revoke old key only after the new deploy is live.
Database connection saturation
Symptoms: P1001/P1017 Prisma errors; latency spike on every endpoint that touches Prisma.
- Check Neon dashboard for connection count and CPU.
- Brain uses pgbouncer (pooled) via
DATABASE_URL;DIRECT_DATABASE_URLis for migrations only. - Reduce queue concurrency temporarily —
MediaFlowsProcessorruns atconcurrency: 500. Drop replicas or scale Redis/Postgres before raising further.
Emergency disable of an endpoint
There is no feature-flag layer at the brain controller level today. Options:
- Mark the controller method
@Public()is not suitable for disabling — it bypasses auth. - Ship a hotfix that returns
503 Service Unavailablefrom the affected controller, or temporarily remove the controller from its module.
TODO
- Healthcheck path is
/health(apps/brain/railway.jsondeploy.healthcheckPath). - Brain runs a single replica in production:
numReplicas: 1globally and in theus-east4-eqdc4aregion (apps/brain/railway.jsondeploy.numReplicasanddeploy.multiRegionConfig). - TODO(@pawel): Document expected SLOs for
mediaflowsqueue depth and 5xx rate; pull from observability dashboards once defined.