Skip to content

Brain On-Call Quickstart

You are paged for brain. This page tells you what brain does, the top three things to check, and where to escalate. For deep-dive playbooks, jump to Brain Runbook.

What brain is, in one paragraph

Brain is the NestJS service that owns the fennec Postgres schema (characters, generation media, datasets, tags), drives generation pipelines via BullMQ on Redis, and brokers calls to round (gRPC) plus external AI providers (FAL, RunPod, Kling, Wavespeed, Vertex, OpenRouter, Hive). End-users authenticate via Clerk; sirloin and other internal services authenticate via static API keys.

Code: apps/brain/. Stage resolution: apps/brain/src/common/runtime/stage.ts.

First five minutes

  1. Open the dashboards. TODO(@pawel): paste the canonical Axiom + Sentry + Bull Board URLs here.
    • Axiom dataset: traces+logs from BRAIN_OTEL_TRACES_LOGS_HEADERS.
    • Axiom dataset: metrics from BRAIN_OTEL_METRICS_HEADERS.
    • Sentry project: brain (set by @sentry/nestjs).
    • Bull Board: https://<brain-host>/queues (Clerk-authenticated).
  2. Identify the symptom class.
    • 5xx spike → Sentry top issues, then Axiom level=error.
    • 401/403 spike → likely Clerk or API-key issue (see Auth).
    • Slow generations / queue depth → Bull Board mediaflows queue.
    • Memory growth / OOM restarts → Axiom metrics process.memory.rss, process.memory.external, process.memory.array_buffers, process.memory.heap_used.
    • Webhook failures → Axiom filter on path:/api/webhook/*.
  3. Check recent deploys. Railway → brain → Deployments. A deploy in the last hour is the most common root cause.

Top alerts

These are the alert classes operators should expect. Specific thresholds and channels live in observability config (TODO(@pawel): link to alert defs).

AlertUsual causeFirst action
Brain 5xx rate > baselineBad deploy, provider outage, Prisma error stormCheck Sentry top issue; roll back if deploy-correlated.
mediaflows queue depth growingWorker crash, Redis unavailable, round downBull Board → check Failed/Stalled; check round and Redis health.
RoundServiceUnavailableException spikeround down or networkingSee Brain Runbook → Round outage.
Clerk 401 spikeClerk outage or wrong secretClerk status; verify CLERK_SECRET_KEY.
Postgres connection errors (P1001/P1017)Neon saturation or pooler hiccupNeon dashboard; reduce queue concurrency if needed.
Sentry: unhandled exception in processorProvider regression, schema driftInspect stack; coordinate with provider owners.

Common causes by symptom

  • Generations stuck “in progress”: mediaflows worker crashed mid-job — onStalled log entries appear in Axiom. Restart brain.
  • API keys not accepted: AUTHORIZED_KEYS was rotated without redeploy. Both old and new keys can be present (comma-separated) during rotation.
  • Webhook 4xx from VI Generator: API key mismatch or DTO drift. Check webhook.http.controller.ts against the upstream payload.
  • All requests 502 on Railway: brain didn’t bind IPv6. Confirm logs show Listening on [::]:3000 — if not, the service was started with default binding and Railway can’t reach it (see main.ts comments).

Escalation

  1. Brain owner: @pawel (per docs/src/content/docs/services/brain.md). TODO(@pawel): confirm rotation.
  2. Database: Neon owner — see Deployment Environment.
  3. Redis: TODO(@pawel) Upstash vs. Railway Redis owner.
  4. Round (ML): separate on-call — see Round.
  5. Sirloin (calls into brain): see Sirloin on-call (TODO(@pawel) link to sirloin-oncall once published).
  6. Clerk: external — open ticket in Clerk dashboard if API is implicated.

Communications

  • Status: TODO(@pawel) link to internal status channel.
  • Customer-facing: coordinate with sirloin/brisket on-call before posting; brain alone does not own customer messaging.

Don’t do during an incident

  • Don’t prisma migrate resolve --rolled-back in production without coordination — see Brain Runbook → Prisma migration rollback.
  • Don’t bulk-retry failed mediaflows jobs without checking root cause; you’ll multiply provider spend.
  • Don’t rotate AUTHORIZED_KEYS to a single new value without first appending — sirloin and admin tooling may be holding the old key.

TODO

  • TODO(@pawel): Link Axiom dashboards (logs, traces, BullMQ depth).
  • TODO(@pawel): Link Sentry project URL.
  • TODO(@pawel): Document SLOs (5xx rate, p95 latency, queue depth) once published.
  • TODO(@pawel): Confirm primary and secondary on-call rotation source of truth.