Chuck On-Call
Chuck On-Call
This page is the on-call cheat sheet for Chuck. It complements chuck-runbook (procedures) by setting expectations on tier, alerts, and escalation.
Service Tier
Tier: secondary. Chuck is a marketing-content CMS. It serves brisket public landing pages, FAQ sections, blog/academy entries, and policy pages.
| Dimension | Chuck | Compare to sirloin / brain |
|---|---|---|
| User-facing impact of outage | Marketing pages render with stale or empty content. | sirloin / brain outage breaks signup, billing, generation. |
| Revenue impact | Indirect (conversion of new visitors). | Direct. |
| Page time-to-recover (informal target) | < 4h business hours. | Minutes. |
| Wake on-call at 03:00? | No unless the marketing page is fully broken AND the on-call agrees the impact warrants it. | Yes for sirloin / brain. |
There is no formal SLO doc for chuck in this repo (find docs/src/content/docs -name '*chuck*slo*' returns no matches). The
billing SLO doc sets the model
to follow — TODO(@marty): decide whether chuck warrants its own SLO doc.
Alerts and Signals
There is no alert routing checked into this repo for chuck
(verified by grep -rn 'chuck\|strapi' .github/workflows,
grep -nE 'chuck' docker-compose*.yml, no apps/chuck/railway.json).
Recommended signals to wire (TODO(@marty): confirm which exist in the
runtime — none of the items below are sourced from a checked-in alert
config):
Top alerts to set up
- chuck unreachable —
https://<chuck-host>/_healthreturns non-2xx for > 5 minutes. Source: synthetic check. - brisket → chuck error rate — brisket logs
STRAPI_*fetch failures spike. Source: brisket OTel trace (operations/observability). - chuck 5xx rate — Strapi process emits 500-class responses
(DB / R2 / plugin failures). TODO(@marty): document where chuck
logs ship —
apps/chuck/meat/config/has no log-shipper config and no OTel exporter. - R2 upload failures — provider returns non-2xx; surfaces as admin-side error. Source: R2 provider metrics, if exported.
- Postgres unreachable — DB connection acquire times out.
- Boot failures — process exits during startup. Source: deploy platform logs.
Symptom → triage map
| Pager / report | First check | Doc |
|---|---|---|
| ”Marketing FAQ is empty” | Hit GET /api/checkout-faq-section?populate=* with brisket’s STRAPI_TOKEN. Is the entry published? | chuck-api |
| ”brisket 500s on landing” | Did STRAPI_URL or STRAPI_TOKEN change in brisket env? Did chuck rotate API_TOKEN_SALT? | chuck-env |
| ”Editor can’t log in to /admin” | ADMIN_JWT_SECRET rotated, or APP_KEYS rotation invalidated session cookies | chuck-env, chuck-runbook |
| ”Image upload fails” | R2 credentials, bucket, public URL | chuck-errors |
| ”Process won’t start” | Env vars missing, Node version, Postgres reachable | chuck-errors |
| ”Editor reports save failed” | Field-level validation; ValidationError payload | chuck-errors |
Escalation
There is no @oncall-chuck rotation defined; escalation is best-effort.
- Page primary on-call. Same rotation as the rest of beef. They judge whether to wake other people.
- Loop in service owner. Chuck owner is
@martyper chuck.md frontmatter. - Loop in brisket owner if the visible failure is on brisket landing pages.
- Loop in marketing / content team for editor-side issues
(publish-state, missing copy). TODO(@marty): record the marketing
point of contact — no
MARKETING.mdor marketing handle is referenced in this repo.
If the failure mode is ambiguous (e.g. brisket renders empty hero —
chuck bug or brisket bug?), reproduce the failing fetch with curl
and STRAPI_TOKEN against chuck. If chuck returns the expected
payload, the issue is on brisket.
What is not a chuck on-call concern
- Sign-up / login — handled by Clerk via brisket / fennec / brain.
- Checkout / billing — handled by sirloin (see billing runbook).
- Media generation — handled by brain / round.
- Email delivery for transactional flows — that is sirloin’s
Nodemailer + shank templates, not chuck’s
emailplugin (which exists but is largely idle in this deployment).
Comms During Incidents
- A chuck-only outage typically does not warrant a status-page notice. Use internal Slack instead.
- A multi-service outage where chuck is implicated should follow the
cross-service incident playbook. TODO(@marty): link the playbook —
no cross-service incident doc is currently checked into
docs/src/content/docs/operations/. - Do not paste API tokens, salts, or
APP_KEYSinto incident channels. See security-model.
Anchor Docs
- API surface: chuck-api
- Errors: chuck-errors
- Runbook: chuck-runbook
- Observability: operations/observability