Skip to content

Sirloin Runbook

Operational guide for the sirloin Go service. For the deeper billing runbook (saga state, payment reconciliation) see runbooks/billing and runbooks/billing-pitfalls. For service overview see Sirloin.

Deploy

CI workflow: .github/workflows/sirloin.yml. Triggers on push or PR touching apps/sirloin/**.

Pipeline jobs:

  1. lintgolangci-lint v2 against apps/sirloin.
  2. checksgo mod tidy cleanliness, make verify-migrations, govulncheck.
  3. testgo test -race -coverprofile=coverage.out -count=1 -timeout=10m ./..., coverage gate >= 3%.

Image build / push to ECR is currently commented out in the workflow (docker-build job and ECR login steps in .github/workflows/sirloin.yml:110-:146). Sirloin is deployed via Railway only. Production deploys come from Railway tied to the release branch — see deployment-env and railway. Railway picks up apps/sirloin/railway.json.

Index migrations on large tables

go-pg-migrate runs each migration inside a transaction, so migration files use plain CREATE INDEX (which takes ACCESS EXCLUSIVE). For indexes on hot, large tables (e.g. media.media, 11M+ rows) build them out-of-band before deploying the migration so the rollout takes no blocking lock:

Terminal window
psql "$DATABASE_URL" -f apps/sirloin/ops/strip_perf_indexes_concurrent.sql

The script uses CREATE INDEX CONCURRENTLY IF NOT EXISTS, so the matching migration (e.g. apps/sirloin/internal/app/migrate/schema/116_strip_perf_indexes.sql) no-ops for indexes already built. Otherwise schedule the deploy off-peak.

Strip popularity rollup

StripListPopularExamples / StripListPopularCategories read precomputed COUNT(DISTINCT user_id) from media.example_popularity / media.tag_popularity instead of aggregating media.media live. The worker job TaskRefreshMediaPopularity (apps/sirloin/internal/app/worker/refreshmediapopularity.go) rebuilds both tables every 10 minutes; the endpoints fall back to the live query when a count-affecting filter is set or before the first refresh completes. If the popular pages look stale or empty, confirm the job is running and that media.example_popularity is non-empty.

sequenceDiagram
participant Dev
participant GH as GitHub Actions
participant Railway
participant Prod as sirloin (prod)
Dev->>GH: push to release
GH->>GH: lint, checks, test (sirloin.yml)
GH-->>Dev: green
Railway->>Prod: build + rollout
Prod-->>Railway: /health (gRPC + HTTP)

Manual deploy / promote

Promote mainrelease via PR; Railway watches release. No manual gh step required for rollout. To force a rebuild without code change, trigger a Railway redeploy.

Rollback

Railway → service → deployments → previous green → Redeploy. Sirloin is stateless except for distributed locks (Redis) and inflight HTTP servers, so rollback is safe within the same DB schema. Do not roll back across a DB migration boundarymake verify-migrations enforces unique numbers, but down-migrations are not run automatically. Coordinate with the data owner or accept forward-fix only.

Scaling

  • Container memory ceiling: 12 GB (per docker-compose.yml); Railway setting independently. CPU/replica count: TODO(@zen) Railway plan.
  • Horizontal scaling is safe: state lives in Postgres + Redis; locks are distributed (internal/pkg/locks).
  • Single-replica concerns: the Chargebee event poller is leader-elected via lock — fine to run multi-replica, only one will poll at a time. Same for the monitor probe (SIRLOIN_MONITOR_ENABLED).
  • Watch for noisy retries from httpclient when scaling up — Chargebee rate limits will trigger circuit breaker (ErrServiceUnavailable).

Common Incidents

Use observability for the canonical Axiom queries; this section names the failure mode and points to the right dashboard. All claims here trace to source code under apps/sirloin/internal/app/services/billing/.

Chargebee polling lag

  • Symptom: GetCurrentUsage returns stale credits; users report “I paid and didn’t get credit”.
  • Detect: Axiom — service.name = "sirloin" AND event_type starts_with "chargebee." over last 30 minutes; expect steady cadence ≤ 15s. Look for gaps in events.poller logs.
  • Mitigate:
    1. Confirm leader replica is alive (locks namespace events.poller).
    2. Check Chargebee status; circuit breaker may be open (ErrServiceUnavailable).
    3. Run cmd/scripts/wallet-recovery-collect against affected users.
    4. Worst case: bounce sirloin to force re-election.

Primer webhook backlog

  • Symptom: Payment confirmations delayed, dunning retries fail.
  • Detect: 401s on /webhooks/primer/payments or 5xx from sirloin. Check Primer dashboard for retry queue depth.
  • Mitigate:
    1. Verify SIRLOIN_PRIMER_WEBHOOK_SECRET matches Primer dashboard.
    2. Verify host clock — 3-minute skew window is enforced (primer.WebhookSignedAtWithinSkew).
    3. Replay from Primer dashboard once root cause fixed.
    4. Manual recovery via SubmitPaidInvoice per affected invoice.

Payment saga stuck

See billing for the deep version.

  • Symptom: Subscription stuck in future; payment recorded; no activation.
  • Detect: query Postgres for subscription.status = 'future' AND last_payment_at > checkout_expiration. Cross-reference Chargebee.
  • Mitigate:
    1. Acquire the per-user lock (internal/pkg/locks) via admin tool.
    2. Replay through payments/processor.go (idempotent on transactionID).
    3. If Chargebee believes paid + active, force-reconcile DB.

S3 / R2 upload failures

  • Symptom: Character reference upload URLs returning 5xx.
  • Detect: error rate on GetCharacterReferenceImageUploadURL; look for s3.PutObject / pre-sign errors in internal/pkg/s3.
  • Mitigate: rotate SIRLOIN_S3_* keys if 403; check R2 status; fail open by serving cached URLs only when safe.

Auth verification failures (Clerk)

  • Symptom: spike in UNAUTHENTICATED from brisket.
  • Detect: log query on clerk.verify errors; internal/pkg/clerk.
  • Mitigate: verify SIRLOIN_CLERK_API_KEY; confirm Clerk JWKS reachable; rotate key only with brisket coordination (auth-model).

High DB load

  • Symptom: latency jumps across all RPCs.
  • Detect: BUN auto-instrumented spans (bunotel); long queries surface in Axiom traces.
  • Mitigate: identify hot query; consider read-runtime split via SIRLOIN_DATABASE_RUNTIME_URL; engage Neon support if connection ceiling hit.

Background Jobs

All run in-process under internal/app/worker/. Restarting sirloin restarts them. Multi-replica safe via leader-elected locks. Notable workers:

  • chargebeesync — Chargebee event poller (see incident above).
  • checkmediageneration — polls round/brain for media completion.
  • updatecharacterstatus — moves character state machine forward.
  • monitorprobe_scoring_generation — synthetic media probe (gated by SIRLOIN_MONITOR_ENABLED).

Resolved

  • Hosting topology: Railway only. The docker-build ECR job in .github/workflows/sirloin.yml:110-:146 is fully commented out; only lint, checks, and test run on push/PR. Railway picks up apps/sirloin/railway.json from the release branch.

TODO(@zen)

  • TODO(@zen): replica count and HPA settings for prod.
  • TODO(@zen): authoritative migration rollback policy — do we ever run down migrations in prod?