Skip to content

Fennec On-Call

On-call notes for the fennec admin SPA. Scope: keeping the dashboard reachable and operational for the internal admin/creator audience.

Audience and severity baseline

Fennec is internal-only (admins, moderators, creators). It is not on the end-user critical path. SLO posture reflects that:

  • Sev1 (page now): total outage during business hours blocking moderation, billing review, or ops-critical workflows.
  • Sev2 (page within 1h): auth broken (no one can sign in) or admin pages return 5xx broadly.
  • Sev3 (next business day): single page broken, cosmetic regression, non-blocking errors in console.

If a fennec issue is preventing moderation that gates user-visible content (/moderation-review, /shop-vi/review), escalate one severity level.

Top alerts and signals

Fennec does not have dedicated synthetic checks today. TODO(@law): confirm once Axiom / monitoring is wired. The signals to watch:

Signal sourceWhat it tells youAction
Railway beef-fennec deploy eventsBuild/deploy failed; healthcheck failingSee fennec-runbook.md → Rollback
Brain error rate spikeMost fennec failures are upstreamTriage brain first; fennec usually self-heals
Clerk status (status.clerk.com)Auth-wide outageComms to operators; nothing fennec can do
User reports in #ops SlackPrimary alert channel todayOpen an incident if more than one operator reports
Browser console errors (DevTools)Bundle-shape or env regressionsCross-reference with last deploy in Railway

First five minutes

  1. Confirm scope: one user, one page, or everyone everywhere?
  2. Open Railway → beef-fennec. Check the latest deploy:
    • Status SUCCESS? Healthcheck passing?
    • Recent deploy (< 1h) correlated with reports?
  3. If scope is “everyone everywhere” and a recent deploy exists, roll back on Railway before further investigation. Cheap and reversible.
  4. If no recent deploy, check brain and Strapi backend deploy status — fennec is a thin shell over upstreams.
  5. Check Clerk status if all signs point to auth.

Common failure modes (paired with fennec-errors.md)

ModeDetectionMitigation
Fresh deploy broke the SPAConsole error, blank page after deployRailway → Redeploy previous revision
REACT_APP_* env var missing post-deploy”undefined” in request URL or blank screenSet env in Railway, trigger rebuild
Brain unreachableAll data calls 5xx; auth finePage brain on-call
Clerk widget won’t mount/login blank panelVerify REACT_APP_CLERK_PUBLISHABLE_KEY matches the Clerk env
Stale bundle / mismatched assets404s on /assets/*-<hash>.jsHard reload; if persists, redeploy
Strapi backend downSpecific admin pages 5xx, brain pages OKPage Strapi/backend owner; degrade gracefully — most flows hit brain

Escalation

Owners (per frontmatter and apps/fennec/CLAUDE.md):

  1. Primary: @law (service owner).
  2. Auth issues: brain on-call (Clerk verification lives in brain).
  3. Strapi backend issues: TODO(@law): name the backend owner — apps/fennec/backend/ is referenced from docker-compose.yml but the directory ships no committed sources or runbook.
  4. Railway / infra: platform on-call rotation.

For a paging incident, drop a one-liner in #ops with:

  • Scope (one user / many / total)
  • Last good deploy ID (Railway)
  • Error class (auth / network / 5xx / blank screen)
  • Whether you tried a rollback

Things that look like alerts but aren’t

  • pnpm lint failing in CI for an open PR — that is a contributor problem, not an incident. Comment on the PR; do not page.
  • Knip warnings — dead-code detection only. Not runtime failures.
  • TypeScript errors in dev — only matter when CI fails on release.

After-incident

  • File a brief writeup in the postmortem doc (TODO(@law): pin the location — likely under docs/operations/runbooks/ once standardised).
  • If the incident exposed a new failure mode, add an entry to fennec-errors.md and (if operationally distinct) a row above.
  • If any TODO(@law) on this page was resolved during the incident, fix it.