Skip to content

Chuck On-Call

Chuck On-Call

This page is the on-call cheat sheet for Chuck. It complements chuck-runbook (procedures) by setting expectations on tier, alerts, and escalation.

Service Tier

Tier: secondary. Chuck is a marketing-content CMS. It serves brisket public landing pages, FAQ sections, blog/academy entries, and policy pages.

DimensionChuckCompare to sirloin / brain
User-facing impact of outageMarketing pages render with stale or empty content.sirloin / brain outage breaks signup, billing, generation.
Revenue impactIndirect (conversion of new visitors).Direct.
Page time-to-recover (informal target)< 4h business hours.Minutes.
Wake on-call at 03:00?No unless the marketing page is fully broken AND the on-call agrees the impact warrants it.Yes for sirloin / brain.

There is no formal SLO doc for chuck in this repo (find docs/src/content/docs -name '*chuck*slo*' returns no matches). The billing SLO doc sets the model to follow — TODO(@marty): decide whether chuck warrants its own SLO doc.

Alerts and Signals

There is no alert routing checked into this repo for chuck (verified by grep -rn 'chuck\|strapi' .github/workflows, grep -nE 'chuck' docker-compose*.yml, no apps/chuck/railway.json). Recommended signals to wire (TODO(@marty): confirm which exist in the runtime — none of the items below are sourced from a checked-in alert config):

Top alerts to set up

  1. chuck unreachablehttps://<chuck-host>/_health returns non-2xx for > 5 minutes. Source: synthetic check.
  2. brisket → chuck error rate — brisket logs STRAPI_* fetch failures spike. Source: brisket OTel trace (operations/observability).
  3. chuck 5xx rate — Strapi process emits 500-class responses (DB / R2 / plugin failures). TODO(@marty): document where chuck logs ship — apps/chuck/meat/config/ has no log-shipper config and no OTel exporter.
  4. R2 upload failures — provider returns non-2xx; surfaces as admin-side error. Source: R2 provider metrics, if exported.
  5. Postgres unreachable — DB connection acquire times out.
  6. Boot failures — process exits during startup. Source: deploy platform logs.

Symptom → triage map

Pager / reportFirst checkDoc
”Marketing FAQ is empty”Hit GET /api/checkout-faq-section?populate=* with brisket’s STRAPI_TOKEN. Is the entry published?chuck-api
”brisket 500s on landing”Did STRAPI_URL or STRAPI_TOKEN change in brisket env? Did chuck rotate API_TOKEN_SALT?chuck-env
”Editor can’t log in to /admin”ADMIN_JWT_SECRET rotated, or APP_KEYS rotation invalidated session cookieschuck-env, chuck-runbook
”Image upload fails”R2 credentials, bucket, public URLchuck-errors
”Process won’t start”Env vars missing, Node version, Postgres reachablechuck-errors
”Editor reports save failed”Field-level validation; ValidationError payloadchuck-errors

Escalation

There is no @oncall-chuck rotation defined; escalation is best-effort.

  1. Page primary on-call. Same rotation as the rest of beef. They judge whether to wake other people.
  2. Loop in service owner. Chuck owner is @marty per chuck.md frontmatter.
  3. Loop in brisket owner if the visible failure is on brisket landing pages.
  4. Loop in marketing / content team for editor-side issues (publish-state, missing copy). TODO(@marty): record the marketing point of contact — no MARKETING.md or marketing handle is referenced in this repo.

If the failure mode is ambiguous (e.g. brisket renders empty hero — chuck bug or brisket bug?), reproduce the failing fetch with curl and STRAPI_TOKEN against chuck. If chuck returns the expected payload, the issue is on brisket.

What is not a chuck on-call concern

  • Sign-up / login — handled by Clerk via brisket / fennec / brain.
  • Checkout / billing — handled by sirloin (see billing runbook).
  • Media generation — handled by brain / round.
  • Email delivery for transactional flows — that is sirloin’s Nodemailer + shank templates, not chuck’s email plugin (which exists but is largely idle in this deployment).

Comms During Incidents

  • A chuck-only outage typically does not warrant a status-page notice. Use internal Slack instead.
  • A multi-service outage where chuck is implicated should follow the cross-service incident playbook. TODO(@marty): link the playbook — no cross-service incident doc is currently checked into docs/src/content/docs/operations/.
  • Do not paste API tokens, salts, or APP_KEYS into incident channels. See security-model.

Anchor Docs