Strip On-Call

Pager guide for the strip Go SSR service. Strip is an internal admin frontend — user impact is limited to operators and creator-success staff, but a strip outage blocks every internal workflow that depends on sirloin admin RPCs.

Severity matrix

Severity	Trigger	Response time
SEV-1	Strip unreachable in production for >5 min, or `SECURITY WARNING: Authentication bypassed via UUID` observed in production logs.	Page immediately.
SEV-2	All protected routes returning 5xx (sirloin gRPC down) for >5 min; or auth loop affecting all operators.	Page within 15 min.
SEV-3	One feature broken (e.g. Ask Strip disabled, shop-VI links missing, single 5xx route).	Next business hour.
SEV-4	Cosmetic / template render glitch, single-user role issue.	Next business day.

Top alerts

No apps/strip alert rules exist in the repo (verified: no strip references under any *.yml/*.yaml alert/monitor/axiom/grafana config). Until dedicated alerting is wired, monitor the following in logs / uptime checks:

Alert	Source	What it usually means
`/health` non-200	external uptime check	Process crash; container restart loop.
5xx rate > 5% over 5 min on protected routes	edge / Railway metrics	Sirloin gRPC down → see runbook §1.
401/302 ratio spiking on `/login`	edge	Clerk misconfig or attack — see runbook §2.
`transport: connection error` log volume	strip stdout	gRPC connection broken.
`SECURITY WARNING: Authentication bypassed via UUID`	strip stdout	Page SEV-1. Bypass UUID exercised; rotate.
`SECURITY WARNING: Skipping authentication - development mode without Clerk` outside dev	strip stdout	Stage misconfig; treat as SEV-1.
429 spike from a single IP	strip stdout	Possible brute-force on `/login`.
Recover middleware panic logs	strip stdout	Templ/handler regression — open incident.

First 5 minutes

flowchart TD
  A[Page fires] --> B{Strip /health 200?}
  B -- no --> C[Restart container, check Railway events]
  B -- yes --> D{5xx on /dashboard?}
  D -- yes --> E[Test sirloin gRPC reachability]
  D -- no --> F{Auth loop?}
  F -- yes --> G[Verify Clerk env triplet]
  F -- no --> H[Inspect logs for SECURITY WARNING]
  E --> I[Page sirloin oncall if down]

Escalation

Primary: strip code owner (@zen).
Secondary: sirloin on-call when failures are gRPC/upstream.
Security: any auth-bypass log line, any unexpected SECURITY WARNING outside dev → security on-call. See /standards/security-model/.
Platform: Railway / network / DNS issues → platform on-call.

Page in Slack #oncall with: stage, last deploy SHA, affected routes, sample request IDs, and any SECURITY WARNING lines verbatim.

Standing instructions

Never disable recover.New to “see the panic” in prod — pull the stack trace from logs (EnableStackTrace: true is already on).
Never raise globalRateLimitMax to silence a 429 alert. Investigate the source.
Never set STRIP_AUTH_BYPASS_UUID outside development. If you find one set in staging or prod, treat it as a credential leak: rotate, audit usage, file a security incident.
Strip is stateless — when in doubt, restart and rollback.

Useful commands during an incident

# tail strip logs (Railway / kubectl)
railway logs -s strip

# from any host: hit strip health
curl -i https://strip.<domain>/health

# verify gRPC reachability from a strip pod
nc -vz "$STRIP_SIRLOIN_GRPC_HOST"

Post-incident

File a 5-line summary: trigger, blast radius, root cause, fix, follow-ups.
If the incident touched auth bypass or Clerk misconfig, link the security model standard and add a regression test under apps/strip/internal/app/middleware/auth_test.go.
Update /services/strip-errors/ if a new symptom surfaced that operators will see again.