Strip Runbook
Strip Runbook
Operational playbook for the strip Go SSR service. Optimised for sirloin connectivity loss, session/auth bugs, deploys, and rollback.
Service shape
- Binary:
apps/strip/bin/stripbuilt fromcmd/app/main.go. - Container:
apps/strip/Dockerfile(Chainguard wolfi-base + chainguard/go). - Listens on
STRIP_PORT(default:8080). Single Fiber process. - One outbound dependency that matters: sirloin gRPC at
STRIP_SIRLOIN_GRPC_HOST(gRPC keepalive 30s/10s). - Auth: Clerk session cookie (
__session). - No dedicated
.github/workflows/strip.ymlexists. CI/CD is wired through Railway’s GitHub integration usingapps/strip/railway.json(builder: DOCKERFILE,watchPatterns: ["/apps/strip/**"]). Pushes touchingapps/strip/**on the connected branch trigger a Railway image build + rollout.
flowchart LR subgraph deploy GHA[GitHub] -->|Railway GitHub integration<br/>watchPatterns: apps/strip/**| Railway[(Railway)] Railway --> Strip[strip container] end Strip -->|gRPC| Sirloin[(sirloin)] Strip -->|HTTPS| Clerk[(Clerk)] Strip --> OpenRouter[(OpenRouter)]Deploy
- Merge to
main. Railway’s GitHub integration detects changes underapps/strip/**(apps/strip/railway.jsonwatchPatterns) and builds the image fromapps/strip/Dockerfile. - Railway picks up new image and rolls one instance at a time.
/health(open route, 204) is the readiness signal.- Boot logs to confirm:
config loaded successfullydatabase migratedis not emitted by strip — that’s sirloin.- No
SECURITY WARNINGlines.
- Smoke: hit
/login(200),/health(204), and/dashboardwhile authenticated (200).
Rollback
- In Railway, redeploy the previous image SHA.
- Strip is stateless — no schema migrations, no queues — rollback is safe.
- Verify session cookies still resolve. If Clerk env changed in the bad release, you may need to clear
__sessioncookies for testers.
Common incidents
1. Sirloin connectivity loss
Symptoms. All protected pages 500 with “Failed to fetch …” strings. Logs show transport: connection error or context-deadline-exceeded from gRPC client.
Diagnose.
- Confirm sirloin health (its own runbook). Strip cannot recover without sirloin.
- From a strip pod: resolve
STRIP_SIRLOIN_GRPC_HOSTand ensure TCP reachability. - Check
requestid-tagged log lines for repeated keepalive ping failures.
Mitigate.
- If sirloin is the root cause, page sirloin oncall; strip needs no action.
- If DNS/network drift only affects strip, restart strip — gRPC client is created once at boot, so a stale endpoint won’t self-heal until the process restarts.
- Last resort: temporarily switch
STRIP_SIRLOIN_GRPC_HOSTto a healthy region.
2. Session/auth bugs
Symptoms. Users redirected to /login in a loop; HTMX panels return 403 permission_denied; “invalid session” toasts.
Diagnose.
- Inspect browser cookies.
__sessionshould be present,Secureoutside dev,SameSite=Lax, 7-day expiry. - Verify Clerk env triplet:
STRIP_CLERK_PUBLISHABLE_KEY,STRIP_CLERK_DOMAIN,STRIP_CLERK_SECRET_KEYall match the same Clerk environment. - Tail strip logs for
SECURITY WARNING— bypass paths in non-dev means the dev fallback or UUID bypass is being triggered. - For 403s, confirm the user’s role grants the required permission (
internal/app/authorization/authorization.go).
Mitigate.
- Wrong Clerk env → re-deploy with corrected secrets.
- Missing role → fix in sirloin role mgmt, not in strip.
- Cookie domain mismatch → ensure
STRIP_CLERK_DOMAINmatches the host strip is served from.
3. Bypass UUID misuse
If STRIP_AUTH_BYPASS_UUID is set in a non-dev stage, rotate immediately and clear the env var. Audit logs for Authentication bypassed via UUID lines, capture source IPs and userAgent, and treat as a security incident per /standards/security-model/.
4. Rate-limit storm
429 spikes on /login or globally. Check getRealClientIP is returning the real client (look for proxies stripping X-Forwarded-For). Don’t raise limits without a security review.
5. Templ render panics
recover.New middleware logs the stack and returns 500. Locate the panicking handler via requestid correlation and revert the offending Templ change. Templ panics often indicate a generated *_templ.go is stale — re-run make build-ui locally to reproduce, then patch.
6. CSP / asset breakage
After a deploy, browser console shows CSP/MIME violations. Cause is usually an inline script that worked in dev (stage == development relaxes CSP) but is blocked in production (Cross-Origin-Embedder-Policy: credentialless). Bundle the asset; don’t relax the policy.
Maintenance ops
- Restart:
kubectl rollout restart/ Railway “Redeploy”. Safe any time. - Drain: scale to zero. No background workers, no in-flight long jobs to drain.
- Cache: in-memory Ristretto only — restart clears it; no external cache to flush.
Observability hooks
Logs only (zerolog → stdout) at the time of writing. apps/strip/cmd/app/main.go wires no OTel SDK and no Sentry; repo grep confirms zero references in apps/strip/. Correlate via X-Request-ID to sirloin spans. TODO(@zen): confirm with /operations/observability/ whether strip is intentionally excluded or if a future PR adds it.
Escalation
See /services/strip-oncall/ for paging targets and severity matrix.