Strip On-Call
Strip On-Call
Pager guide for the strip Go SSR service. Strip is an internal admin frontend — user impact is limited to operators and creator-success staff, but a strip outage blocks every internal workflow that depends on sirloin admin RPCs.
Severity matrix
| Severity | Trigger | Response time |
|---|---|---|
| SEV-1 | Strip unreachable in production for >5 min, or SECURITY WARNING: Authentication bypassed via UUID observed in production logs. | Page immediately. |
| SEV-2 | All protected routes returning 5xx (sirloin gRPC down) for >5 min; or auth loop affecting all operators. | Page within 15 min. |
| SEV-3 | One feature broken (e.g. Ask Strip disabled, shop-VI links missing, single 5xx route). | Next business hour. |
| SEV-4 | Cosmetic / template render glitch, single-user role issue. | Next business day. |
Top alerts
No apps/strip alert rules exist in the repo (verified: no strip references under any *.yml/*.yaml alert/monitor/axiom/grafana config). Until dedicated alerting is wired, monitor the following in logs / uptime checks:
| Alert | Source | What it usually means |
|---|---|---|
/health non-200 | external uptime check | Process crash; container restart loop. |
| 5xx rate > 5% over 5 min on protected routes | edge / Railway metrics | Sirloin gRPC down → see runbook §1. |
401/302 ratio spiking on /login | edge | Clerk misconfig or attack — see runbook §2. |
transport: connection error log volume | strip stdout | gRPC connection broken. |
SECURITY WARNING: Authentication bypassed via UUID | strip stdout | Page SEV-1. Bypass UUID exercised; rotate. |
SECURITY WARNING: Skipping authentication - development mode without Clerk outside dev | strip stdout | Stage misconfig; treat as SEV-1. |
| 429 spike from a single IP | strip stdout | Possible brute-force on /login. |
| Recover middleware panic logs | strip stdout | Templ/handler regression — open incident. |
First 5 minutes
flowchart TD A[Page fires] --> B{Strip /health 200?} B -- no --> C[Restart container, check Railway events] B -- yes --> D{5xx on /dashboard?} D -- yes --> E[Test sirloin gRPC reachability] D -- no --> F{Auth loop?} F -- yes --> G[Verify Clerk env triplet] F -- no --> H[Inspect logs for SECURITY WARNING] E --> I[Page sirloin oncall if down]Escalation
- Primary: strip code owner (
@zen). - Secondary: sirloin on-call when failures are gRPC/upstream.
- Security: any auth-bypass log line, any unexpected
SECURITY WARNINGoutside dev → security on-call. See/standards/security-model/. - Platform: Railway / network / DNS issues → platform on-call.
Page in Slack #oncall with: stage, last deploy SHA, affected routes, sample request IDs, and any SECURITY WARNING lines verbatim.
Standing instructions
- Never disable
recover.Newto “see the panic” in prod — pull the stack trace from logs (EnableStackTrace: trueis already on). - Never raise
globalRateLimitMaxto silence a 429 alert. Investigate the source. - Never set
STRIP_AUTH_BYPASS_UUIDoutsidedevelopment. If you find one set in staging or prod, treat it as a credential leak: rotate, audit usage, file a security incident. - Strip is stateless — when in doubt, restart and rollback.
Useful commands during an incident
# tail strip logs (Railway / kubectl)railway logs -s strip
# from any host: hit strip healthcurl -i https://strip.<domain>/health
# verify gRPC reachability from a strip podnc -vz "$STRIP_SIRLOIN_GRPC_HOST"Post-incident
- File a 5-line summary: trigger, blast radius, root cause, fix, follow-ups.
- If the incident touched auth bypass or Clerk misconfig, link the security model standard and add a regression test under
apps/strip/internal/app/middleware/auth_test.go. - Update
/services/strip-errors/if a new symptom surfaced that operators will see again.