Sirloin Runbook
Operational guide for the sirloin Go service. For the deeper billing runbook (saga state, payment reconciliation) see runbooks/billing and runbooks/billing-pitfalls. For service overview see Sirloin.
Deploy
CI workflow: .github/workflows/sirloin.yml. Triggers on push or PR
touching apps/sirloin/**.
Pipeline jobs:
lint—golangci-lintv2 againstapps/sirloin.checks—go mod tidycleanliness,make verify-migrations,govulncheck.test—go test -race -coverprofile=coverage.out -count=1 -timeout=10m ./..., coverage gate>= 3%.
Image build / push to ECR is currently commented out in the workflow
(docker-build job and ECR login steps in
.github/workflows/sirloin.yml:110-:146). Sirloin is deployed via Railway
only. Production deploys come from Railway tied to the
release branch — see deployment-env and
railway. Railway picks up apps/sirloin/railway.json.
Index migrations on large tables
go-pg-migrate runs each migration inside a transaction, so migration files
use plain CREATE INDEX (which takes ACCESS EXCLUSIVE). For indexes on hot,
large tables (e.g. media.media, 11M+ rows) build them out-of-band before
deploying the migration so the rollout takes no blocking lock:
psql "$DATABASE_URL" -f apps/sirloin/ops/strip_perf_indexes_concurrent.sqlThe script uses CREATE INDEX CONCURRENTLY IF NOT EXISTS, so the matching
migration (e.g. apps/sirloin/internal/app/migrate/schema/116_strip_perf_indexes.sql)
no-ops for indexes already built. Otherwise schedule the deploy off-peak.
Strip popularity rollup
StripListPopularExamples / StripListPopularCategories read precomputed
COUNT(DISTINCT user_id) from media.example_popularity / media.tag_popularity
instead of aggregating media.media live. The worker job
TaskRefreshMediaPopularity (apps/sirloin/internal/app/worker/refreshmediapopularity.go)
rebuilds both tables every 10 minutes; the endpoints fall back to the live
query when a count-affecting filter is set or before the first refresh
completes. If the popular pages look stale or empty, confirm the job is
running and that media.example_popularity is non-empty.
sequenceDiagram participant Dev participant GH as GitHub Actions participant Railway participant Prod as sirloin (prod) Dev->>GH: push to release GH->>GH: lint, checks, test (sirloin.yml) GH-->>Dev: green Railway->>Prod: build + rollout Prod-->>Railway: /health (gRPC + HTTP)Manual deploy / promote
Promote main → release via PR; Railway watches release. No manual
gh step required for rollout. To force a rebuild without code change,
trigger a Railway redeploy.
Rollback
Railway → service → deployments → previous green → Redeploy. Sirloin is
stateless except for distributed locks (Redis) and inflight HTTP servers,
so rollback is safe within the same DB schema. Do not roll back across a
DB migration boundary — make verify-migrations enforces unique numbers,
but down-migrations are not run automatically. Coordinate with the data
owner or accept forward-fix only.
Scaling
- Container memory ceiling: 12 GB (per
docker-compose.yml); Railway setting independently. CPU/replica count: TODO(@zen) Railway plan. - Horizontal scaling is safe: state lives in Postgres + Redis; locks are
distributed (
internal/pkg/locks). - Single-replica concerns: the Chargebee event poller is leader-elected
via lock — fine to run multi-replica, only one will poll at a time. Same
for the monitor probe (
SIRLOIN_MONITOR_ENABLED). - Watch for noisy retries from
httpclientwhen scaling up — Chargebee rate limits will trigger circuit breaker (ErrServiceUnavailable).
Common Incidents
Use observability for the canonical Axiom
queries; this section names the failure mode and points to the right
dashboard. All claims here trace to source code under
apps/sirloin/internal/app/services/billing/.
Chargebee polling lag
- Symptom:
GetCurrentUsagereturns stale credits; users report “I paid and didn’t get credit”. - Detect: Axiom —
service.name = "sirloin"ANDevent_type starts_with "chargebee."over last 30 minutes; expect steady cadence ≤ 15s. Look for gaps inevents.pollerlogs. - Mitigate:
- Confirm leader replica is alive (locks namespace
events.poller). - Check Chargebee status; circuit breaker may be open
(
ErrServiceUnavailable). - Run
cmd/scripts/wallet-recovery-collectagainst affected users. - Worst case: bounce sirloin to force re-election.
- Confirm leader replica is alive (locks namespace
Primer webhook backlog
- Symptom: Payment confirmations delayed, dunning retries fail.
- Detect: 401s on
/webhooks/primer/paymentsor 5xx from sirloin. Check Primer dashboard for retry queue depth. - Mitigate:
- Verify
SIRLOIN_PRIMER_WEBHOOK_SECRETmatches Primer dashboard. - Verify host clock — 3-minute skew window is enforced
(
primer.WebhookSignedAtWithinSkew). - Replay from Primer dashboard once root cause fixed.
- Manual recovery via
SubmitPaidInvoiceper affected invoice.
- Verify
Payment saga stuck
See billing for the deep version.
- Symptom: Subscription stuck in
future; payment recorded; no activation. - Detect: query Postgres for
subscription.status = 'future'ANDlast_payment_at > checkout_expiration. Cross-reference Chargebee. - Mitigate:
- Acquire the per-user lock (
internal/pkg/locks) via admin tool. - Replay through
payments/processor.go(idempotent on transactionID). - If Chargebee believes paid + active, force-reconcile DB.
- Acquire the per-user lock (
S3 / R2 upload failures
- Symptom: Character reference upload URLs returning 5xx.
- Detect: error rate on
GetCharacterReferenceImageUploadURL; look fors3.PutObject/ pre-sign errors ininternal/pkg/s3. - Mitigate: rotate
SIRLOIN_S3_*keys if 403; check R2 status; fail open by serving cached URLs only when safe.
Auth verification failures (Clerk)
- Symptom: spike in
UNAUTHENTICATEDfrom brisket. - Detect: log query on
clerk.verifyerrors;internal/pkg/clerk. - Mitigate: verify
SIRLOIN_CLERK_API_KEY; confirm Clerk JWKS reachable; rotate key only with brisket coordination (auth-model).
High DB load
- Symptom: latency jumps across all RPCs.
- Detect: BUN auto-instrumented spans (
bunotel); long queries surface in Axiom traces. - Mitigate: identify hot query; consider read-runtime split via
SIRLOIN_DATABASE_RUNTIME_URL; engage Neon support if connection ceiling hit.
Background Jobs
All run in-process under internal/app/worker/. Restarting sirloin restarts
them. Multi-replica safe via leader-elected locks. Notable workers:
chargebeesync— Chargebee event poller (see incident above).checkmediageneration— polls round/brain for media completion.updatecharacterstatus— moves character state machine forward.monitorprobe_scoring_generation— synthetic media probe (gated bySIRLOIN_MONITOR_ENABLED).
Resolved
- Hosting topology: Railway only. The
docker-buildECR job in.github/workflows/sirloin.yml:110-:146is fully commented out; onlylint,checks, andtestrun on push/PR. Railway picks upapps/sirloin/railway.jsonfrom thereleasebranch.
TODO(@zen)
- TODO(@zen): replica count and HPA settings for prod.
- TODO(@zen): authoritative migration rollback policy — do we ever run down migrations in prod?