Brain On-Call Quickstart
You are paged for brain. This page tells you what brain does, the top three things to check, and where to escalate. For deep-dive playbooks, jump to Brain Runbook.
What brain is, in one paragraph
Brain is the NestJS service that owns the fennec Postgres schema (characters, generation media, datasets, tags), drives generation pipelines via BullMQ on Redis, and brokers calls to round (gRPC) plus external AI providers (FAL, RunPod, Kling, Wavespeed, Vertex, OpenRouter, Hive). End-users authenticate via Clerk; sirloin and other internal services authenticate via static API keys.
Code: apps/brain/. Stage resolution: apps/brain/src/common/runtime/stage.ts.
First five minutes
- Open the dashboards. TODO(@pawel): paste the canonical Axiom + Sentry + Bull Board URLs here.
- Axiom dataset: traces+logs from
BRAIN_OTEL_TRACES_LOGS_HEADERS. - Axiom dataset: metrics from
BRAIN_OTEL_METRICS_HEADERS. - Sentry project: brain (set by
@sentry/nestjs). - Bull Board:
https://<brain-host>/queues(Clerk-authenticated).
- Axiom dataset: traces+logs from
- Identify the symptom class.
- 5xx spike → Sentry top issues, then Axiom
level=error. - 401/403 spike → likely Clerk or API-key issue (see Auth).
- Slow generations / queue depth → Bull Board
mediaflowsqueue. - Memory growth / OOM restarts → Axiom metrics
process.memory.rss,process.memory.external,process.memory.array_buffers,process.memory.heap_used. - Webhook failures → Axiom filter on
path:/api/webhook/*.
- 5xx spike → Sentry top issues, then Axiom
- Check recent deploys. Railway → brain → Deployments. A deploy in the last hour is the most common root cause.
Top alerts
These are the alert classes operators should expect. Specific thresholds and channels live in observability config (TODO(@pawel): link to alert defs).
| Alert | Usual cause | First action |
|---|---|---|
| Brain 5xx rate > baseline | Bad deploy, provider outage, Prisma error storm | Check Sentry top issue; roll back if deploy-correlated. |
mediaflows queue depth growing | Worker crash, Redis unavailable, round down | Bull Board → check Failed/Stalled; check round and Redis health. |
RoundServiceUnavailableException spike | round down or networking | See Brain Runbook → Round outage. |
| Clerk 401 spike | Clerk outage or wrong secret | Clerk status; verify CLERK_SECRET_KEY. |
Postgres connection errors (P1001/P1017) | Neon saturation or pooler hiccup | Neon dashboard; reduce queue concurrency if needed. |
| Sentry: unhandled exception in processor | Provider regression, schema drift | Inspect stack; coordinate with provider owners. |
Common causes by symptom
- Generations stuck “in progress”:
mediaflowsworker crashed mid-job —onStalledlog entries appear in Axiom. Restart brain. - API keys not accepted:
AUTHORIZED_KEYSwas rotated without redeploy. Both old and new keys can be present (comma-separated) during rotation. - Webhook 4xx from VI Generator: API key mismatch or DTO drift. Check
webhook.http.controller.tsagainst the upstream payload. - All requests 502 on Railway: brain didn’t bind IPv6. Confirm logs show
Listening on [::]:3000— if not, the service was started with default binding and Railway can’t reach it (seemain.tscomments).
Escalation
- Brain owner:
@pawel(perdocs/src/content/docs/services/brain.md). TODO(@pawel): confirm rotation. - Database: Neon owner — see Deployment Environment.
- Redis: TODO(@pawel) Upstash vs. Railway Redis owner.
- Round (ML): separate on-call — see Round.
- Sirloin (calls into brain): see Sirloin on-call (TODO(@pawel) link to sirloin-oncall once published).
- Clerk: external — open ticket in Clerk dashboard if API is implicated.
Communications
- Status: TODO(@pawel) link to internal status channel.
- Customer-facing: coordinate with sirloin/brisket on-call before posting; brain alone does not own customer messaging.
Don’t do during an incident
- Don’t
prisma migrate resolve --rolled-backin production without coordination — see Brain Runbook → Prisma migration rollback. - Don’t bulk-retry failed
mediaflowsjobs without checking root cause; you’ll multiply provider spend. - Don’t rotate
AUTHORIZED_KEYSto a single new value without first appending — sirloin and admin tooling may be holding the old key.
TODO
- TODO(@pawel): Link Axiom dashboards (logs, traces, BullMQ depth).
- TODO(@pawel): Link Sentry project URL.
- TODO(@pawel): Document SLOs (5xx rate, p95 latency, queue depth) once published.
- TODO(@pawel): Confirm primary and secondary on-call rotation source of truth.