Fennec On-Call
On-call notes for the fennec admin SPA. Scope: keeping the dashboard reachable and operational for the internal admin/creator audience.
Audience and severity baseline
Fennec is internal-only (admins, moderators, creators). It is not on the end-user critical path. SLO posture reflects that:
- Sev1 (page now): total outage during business hours blocking moderation, billing review, or ops-critical workflows.
- Sev2 (page within 1h): auth broken (no one can sign in) or admin pages return 5xx broadly.
- Sev3 (next business day): single page broken, cosmetic regression, non-blocking errors in console.
If a fennec issue is preventing moderation that gates user-visible
content (/moderation-review, /shop-vi/review), escalate one severity
level.
Top alerts and signals
Fennec does not have dedicated synthetic checks today. TODO(@law): confirm once Axiom / monitoring is wired. The signals to watch:
| Signal source | What it tells you | Action |
|---|---|---|
Railway beef-fennec deploy events | Build/deploy failed; healthcheck failing | See fennec-runbook.md → Rollback |
| Brain error rate spike | Most fennec failures are upstream | Triage brain first; fennec usually self-heals |
| Clerk status (status.clerk.com) | Auth-wide outage | Comms to operators; nothing fennec can do |
User reports in #ops Slack | Primary alert channel today | Open an incident if more than one operator reports |
| Browser console errors (DevTools) | Bundle-shape or env regressions | Cross-reference with last deploy in Railway |
First five minutes
- Confirm scope: one user, one page, or everyone everywhere?
- Open Railway →
beef-fennec. Check the latest deploy:- Status
SUCCESS? Healthcheck passing? - Recent deploy (< 1h) correlated with reports?
- Status
- If scope is “everyone everywhere” and a recent deploy exists, roll back on Railway before further investigation. Cheap and reversible.
- If no recent deploy, check brain and Strapi backend deploy status — fennec is a thin shell over upstreams.
- Check Clerk status if all signs point to auth.
Common failure modes (paired with fennec-errors.md)
| Mode | Detection | Mitigation |
|---|---|---|
| Fresh deploy broke the SPA | Console error, blank page after deploy | Railway → Redeploy previous revision |
REACT_APP_* env var missing post-deploy | ”undefined” in request URL or blank screen | Set env in Railway, trigger rebuild |
| Brain unreachable | All data calls 5xx; auth fine | Page brain on-call |
| Clerk widget won’t mount | /login blank panel | Verify REACT_APP_CLERK_PUBLISHABLE_KEY matches the Clerk env |
| Stale bundle / mismatched assets | 404s on /assets/*-<hash>.js | Hard reload; if persists, redeploy |
| Strapi backend down | Specific admin pages 5xx, brain pages OK | Page Strapi/backend owner; degrade gracefully — most flows hit brain |
Escalation
Owners (per frontmatter and apps/fennec/CLAUDE.md):
- Primary:
@law(service owner). - Auth issues: brain on-call (Clerk verification lives in brain).
- Strapi backend issues: TODO(@law): name the backend owner —
apps/fennec/backend/is referenced fromdocker-compose.ymlbut the directory ships no committed sources or runbook. - Railway / infra: platform on-call rotation.
For a paging incident, drop a one-liner in #ops with:
- Scope (one user / many / total)
- Last good deploy ID (Railway)
- Error class (auth / network / 5xx / blank screen)
- Whether you tried a rollback
Things that look like alerts but aren’t
pnpm lintfailing in CI for an open PR — that is a contributor problem, not an incident. Comment on the PR; do not page.- Knip warnings — dead-code detection only. Not runtime failures.
- TypeScript errors in dev — only matter when CI fails on
release.
After-incident
- File a brief writeup in the postmortem doc (TODO(@law): pin the
location — likely under
docs/operations/runbooks/once standardised). - If the incident exposed a new failure mode, add an entry to
fennec-errors.mdand (if operationally distinct) a row above. - If any
TODO(@law)on this page was resolved during the incident, fix it.