Skip to content

Strip Runbook

Strip Runbook

Operational playbook for the strip Go SSR service. Optimised for sirloin connectivity loss, session/auth bugs, deploys, and rollback.

Service shape

  • Binary: apps/strip/bin/strip built from cmd/app/main.go.
  • Container: apps/strip/Dockerfile (Chainguard wolfi-base + chainguard/go).
  • Listens on STRIP_PORT (default :8080). Single Fiber process.
  • One outbound dependency that matters: sirloin gRPC at STRIP_SIRLOIN_GRPC_HOST (gRPC keepalive 30s/10s).
  • Auth: Clerk session cookie (__session).
  • No dedicated .github/workflows/strip.yml exists. CI/CD is wired through Railway’s GitHub integration using apps/strip/railway.json (builder: DOCKERFILE, watchPatterns: ["/apps/strip/**"]). Pushes touching apps/strip/** on the connected branch trigger a Railway image build + rollout.
flowchart LR
subgraph deploy
GHA[GitHub] -->|Railway GitHub integration<br/>watchPatterns: apps/strip/**| Railway[(Railway)]
Railway --> Strip[strip container]
end
Strip -->|gRPC| Sirloin[(sirloin)]
Strip -->|HTTPS| Clerk[(Clerk)]
Strip --> OpenRouter[(OpenRouter)]

Deploy

  1. Merge to main. Railway’s GitHub integration detects changes under apps/strip/** (apps/strip/railway.json watchPatterns) and builds the image from apps/strip/Dockerfile.
  2. Railway picks up new image and rolls one instance at a time.
  3. /health (open route, 204) is the readiness signal.
  4. Boot logs to confirm:
    • config loaded successfully
    • database migrated is not emitted by strip — that’s sirloin.
    • No SECURITY WARNING lines.
  5. Smoke: hit /login (200), /health (204), and /dashboard while authenticated (200).

Rollback

  1. In Railway, redeploy the previous image SHA.
  2. Strip is stateless — no schema migrations, no queues — rollback is safe.
  3. Verify session cookies still resolve. If Clerk env changed in the bad release, you may need to clear __session cookies for testers.

Common incidents

1. Sirloin connectivity loss

Symptoms. All protected pages 500 with “Failed to fetch …” strings. Logs show transport: connection error or context-deadline-exceeded from gRPC client.

Diagnose.

  1. Confirm sirloin health (its own runbook). Strip cannot recover without sirloin.
  2. From a strip pod: resolve STRIP_SIRLOIN_GRPC_HOST and ensure TCP reachability.
  3. Check requestid-tagged log lines for repeated keepalive ping failures.

Mitigate.

  • If sirloin is the root cause, page sirloin oncall; strip needs no action.
  • If DNS/network drift only affects strip, restart strip — gRPC client is created once at boot, so a stale endpoint won’t self-heal until the process restarts.
  • Last resort: temporarily switch STRIP_SIRLOIN_GRPC_HOST to a healthy region.

2. Session/auth bugs

Symptoms. Users redirected to /login in a loop; HTMX panels return 403 permission_denied; “invalid session” toasts.

Diagnose.

  1. Inspect browser cookies. __session should be present, Secure outside dev, SameSite=Lax, 7-day expiry.
  2. Verify Clerk env triplet: STRIP_CLERK_PUBLISHABLE_KEY, STRIP_CLERK_DOMAIN, STRIP_CLERK_SECRET_KEY all match the same Clerk environment.
  3. Tail strip logs for SECURITY WARNING — bypass paths in non-dev means the dev fallback or UUID bypass is being triggered.
  4. For 403s, confirm the user’s role grants the required permission (internal/app/authorization/authorization.go).

Mitigate.

  • Wrong Clerk env → re-deploy with corrected secrets.
  • Missing role → fix in sirloin role mgmt, not in strip.
  • Cookie domain mismatch → ensure STRIP_CLERK_DOMAIN matches the host strip is served from.

3. Bypass UUID misuse

If STRIP_AUTH_BYPASS_UUID is set in a non-dev stage, rotate immediately and clear the env var. Audit logs for Authentication bypassed via UUID lines, capture source IPs and userAgent, and treat as a security incident per /standards/security-model/.

4. Rate-limit storm

429 spikes on /login or globally. Check getRealClientIP is returning the real client (look for proxies stripping X-Forwarded-For). Don’t raise limits without a security review.

5. Templ render panics

recover.New middleware logs the stack and returns 500. Locate the panicking handler via requestid correlation and revert the offending Templ change. Templ panics often indicate a generated *_templ.go is stale — re-run make build-ui locally to reproduce, then patch.

6. CSP / asset breakage

After a deploy, browser console shows CSP/MIME violations. Cause is usually an inline script that worked in dev (stage == development relaxes CSP) but is blocked in production (Cross-Origin-Embedder-Policy: credentialless). Bundle the asset; don’t relax the policy.

Maintenance ops

  • Restart: kubectl rollout restart / Railway “Redeploy”. Safe any time.
  • Drain: scale to zero. No background workers, no in-flight long jobs to drain.
  • Cache: in-memory Ristretto only — restart clears it; no external cache to flush.

Observability hooks

Logs only (zerolog → stdout) at the time of writing. apps/strip/cmd/app/main.go wires no OTel SDK and no Sentry; repo grep confirms zero references in apps/strip/. Correlate via X-Request-ID to sirloin spans. TODO(@zen): confirm with /operations/observability/ whether strip is intentionally excluded or if a future PR adds it.

Escalation

See /services/strip-oncall/ for paging targets and severity matrix.