Skip to content

Strip On-Call

Strip On-Call

Pager guide for the strip Go SSR service. Strip is an internal admin frontend — user impact is limited to operators and creator-success staff, but a strip outage blocks every internal workflow that depends on sirloin admin RPCs.

Severity matrix

SeverityTriggerResponse time
SEV-1Strip unreachable in production for >5 min, or SECURITY WARNING: Authentication bypassed via UUID observed in production logs.Page immediately.
SEV-2All protected routes returning 5xx (sirloin gRPC down) for >5 min; or auth loop affecting all operators.Page within 15 min.
SEV-3One feature broken (e.g. Ask Strip disabled, shop-VI links missing, single 5xx route).Next business hour.
SEV-4Cosmetic / template render glitch, single-user role issue.Next business day.

Top alerts

No apps/strip alert rules exist in the repo (verified: no strip references under any *.yml/*.yaml alert/monitor/axiom/grafana config). Until dedicated alerting is wired, monitor the following in logs / uptime checks:

AlertSourceWhat it usually means
/health non-200external uptime checkProcess crash; container restart loop.
5xx rate > 5% over 5 min on protected routesedge / Railway metricsSirloin gRPC down → see runbook §1.
401/302 ratio spiking on /loginedgeClerk misconfig or attack — see runbook §2.
transport: connection error log volumestrip stdoutgRPC connection broken.
SECURITY WARNING: Authentication bypassed via UUIDstrip stdoutPage SEV-1. Bypass UUID exercised; rotate.
SECURITY WARNING: Skipping authentication - development mode without Clerk outside devstrip stdoutStage misconfig; treat as SEV-1.
429 spike from a single IPstrip stdoutPossible brute-force on /login.
Recover middleware panic logsstrip stdoutTempl/handler regression — open incident.

First 5 minutes

flowchart TD
A[Page fires] --> B{Strip /health 200?}
B -- no --> C[Restart container, check Railway events]
B -- yes --> D{5xx on /dashboard?}
D -- yes --> E[Test sirloin gRPC reachability]
D -- no --> F{Auth loop?}
F -- yes --> G[Verify Clerk env triplet]
F -- no --> H[Inspect logs for SECURITY WARNING]
E --> I[Page sirloin oncall if down]

Escalation

  1. Primary: strip code owner (@zen).
  2. Secondary: sirloin on-call when failures are gRPC/upstream.
  3. Security: any auth-bypass log line, any unexpected SECURITY WARNING outside dev → security on-call. See /standards/security-model/.
  4. Platform: Railway / network / DNS issues → platform on-call.

Page in Slack #oncall with: stage, last deploy SHA, affected routes, sample request IDs, and any SECURITY WARNING lines verbatim.

Standing instructions

  • Never disable recover.New to “see the panic” in prod — pull the stack trace from logs (EnableStackTrace: true is already on).
  • Never raise globalRateLimitMax to silence a 429 alert. Investigate the source.
  • Never set STRIP_AUTH_BYPASS_UUID outside development. If you find one set in staging or prod, treat it as a credential leak: rotate, audit usage, file a security incident.
  • Strip is stateless — when in doubt, restart and rollback.

Useful commands during an incident

Terminal window
# tail strip logs (Railway / kubectl)
railway logs -s strip
# from any host: hit strip health
curl -i https://strip.<domain>/health
# verify gRPC reachability from a strip pod
nc -vz "$STRIP_SIRLOIN_GRPC_HOST"

Post-incident

  • File a 5-line summary: trigger, blast radius, root cause, fix, follow-ups.
  • If the incident touched auth bypass or Clerk misconfig, link the security model standard and add a regression test under apps/strip/internal/app/middleware/auth_test.go.
  • Update /services/strip-errors/ if a new symptom surfaced that operators will see again.