Fennec On-Call

On-call notes for the fennec admin SPA. Scope: keeping the dashboard reachable and operational for the internal admin/creator audience.

Audience and severity baseline

Fennec is internal-only (admins, moderators, creators). It is not on the end-user critical path. SLO posture reflects that:

Sev1 (page now): total outage during business hours blocking moderation, billing review, or ops-critical workflows.
Sev2 (page within 1h): auth broken (no one can sign in) or admin pages return 5xx broadly.
Sev3 (next business day): single page broken, cosmetic regression, non-blocking errors in console.

If a fennec issue is preventing moderation that gates user-visible content (/moderation-review, /shop-vi/review), escalate one severity level.

Fennec does not have dedicated synthetic checks today. TODO(@law): confirm once Axiom / monitoring is wired. The signals to watch:

Signal source	What it tells you	Action
Railway `beef-fennec` deploy events	Build/deploy failed; healthcheck failing	See `fennec-runbook.md` → Rollback
Brain error rate spike	Most fennec failures are upstream	Triage brain first; fennec usually self-heals
Clerk status (status.clerk.com)	Auth-wide outage	Comms to operators; nothing fennec can do
User reports in `#ops` Slack	Primary alert channel today	Open an incident if more than one operator reports
Browser console errors (DevTools)	Bundle-shape or env regressions	Cross-reference with last deploy in Railway

Confirm scope: one user, one page, or everyone everywhere?
Open Railway → beef-fennec. Check the latest deploy:
- Status SUCCESS? Healthcheck passing?
- Recent deploy (< 1h) correlated with reports?
If scope is “everyone everywhere” and a recent deploy exists, roll back on Railway before further investigation. Cheap and reversible.
If no recent deploy, check brain and Strapi backend deploy status — fennec is a thin shell over upstreams.
Check Clerk status if all signs point to auth.

Mode	Detection	Mitigation
Fresh deploy broke the SPA	Console error, blank page after deploy	Railway → Redeploy previous revision
`REACT_APP_*` env var missing post-deploy	”undefined” in request URL or blank screen	Set env in Railway, trigger rebuild
Brain unreachable	All data calls 5xx; auth fine	Page brain on-call
Clerk widget won’t mount	`/login` blank panel	Verify `REACT_APP_CLERK_PUBLISHABLE_KEY` matches the Clerk env
Stale bundle / mismatched assets	404s on `/assets/*-<hash>.js`	Hard reload; if persists, redeploy
Strapi backend down	Specific admin pages 5xx, brain pages OK	Page Strapi/backend owner; degrade gracefully — most flows hit brain

Owners (per frontmatter and apps/fennec/CLAUDE.md):

Primary: @law (service owner).
Auth issues: brain on-call (Clerk verification lives in brain).
Strapi backend issues: TODO(@law): name the backend owner — apps/fennec/backend/ is referenced from docker-compose.yml but the directory ships no committed sources or runbook.
Railway / infra: platform on-call rotation.

For a paging incident, drop a one-liner in #ops with:

pnpm lint failing in CI for an open PR — that is a contributor problem, not an incident. Comment on the PR; do not page.
Knip warnings — dead-code detection only. Not runtime failures.
TypeScript errors in dev — only matter when CI fails on release.

File a brief writeup in the postmortem doc (TODO(@law): pin the location — likely under docs/operations/runbooks/ once standardised).
If the incident exposed a new failure mode, add an entry to fennec-errors.md and (if operationally distinct) a row above.
If any TODO(@law) on this page was resolved during the incident, fix it.