Skip to content

Sirloin On-call

You just got paged on sirloin. Read this first.

What sirloin is

Main Go API. gRPC :50051, webhooks :8087, MCP :8086. Owns billing, characters, media user-state, credits, admin/audit. Stateless except for Postgres (Neon) and Redis. See Sirloin for the full dossier and sirloin-runbook for incident detail. For deep billing scenarios go straight to runbooks/billing.

Check first (5 min)

  1. Is the service up?
    • Railway dashboard → sirloin → green deployments
    • grpc_health_probe -addr=<host>:50051 (container healthcheck uses this)
  2. Is anything obviously broken in logs?
    • Axiom dataset = SIRLOIN_AXIOM_DATASET (see observability)
    • Filter service.name = "sirloin" and severity >= "error"
  3. Are upstreams healthy?
    • Chargebee status page
    • Primer status page
    • Neon (Postgres) status
    • Clerk status
  4. Is it a known recurring incident?
    • Chargebee polling lag, Primer webhook 401s, payment saga stuck — all in sirloin-runbook.
  5. Recent deploy?
    • gh run list --workflow=sirloin.yml --limit 5
    • Railway deploys list — last 24h. If the pager fires within 30 min of a deploy, rollback first, debug second.

Top alerts and what they mean

Dashboards / alert routing: TODO(@zen) — link Axiom monitors and Sentry projects here once formalised.

  1. Sirloin gRPC error rate elevated Most likely a downstream (Chargebee/Primer) outage or a recent regression. Check Axiom error breakdown by RPC. If concentrated on BillingService, skip to runbooks/billing.
  2. Chargebee event poller silent (> 60s gap) Multi-replica leader election failed, or Chargebee 5xx. See runbook → Chargebee polling lag.
  3. Primer webhook 401 spike Webhook secret rotation drift, or NTP drift > 3 min. See runbook → Primer webhook backlog.
  4. Payment saga stuck (subscription future past expiration) See billing-pitfalls and the runbook → Payment saga stuck section.
  5. Sirloin pod OOM / restart loop Memory ceiling 12 GB. Check for runaway worker (likely checkmediageneration or a brain client retry storm). Bounce + collect heap profile via the pprof debug endpoint (apps/sirloin/internal/app/debug/pprof.go:18-:28). Enabled by setting SIRLOIN_PPROF_PORT to a listen address (e.g. :6060); empty value disables the server (apps/sirloin/internal/app/config/config.go:421-:425, apps/sirloin/cmd/app/main.go:346-:348). Routes mounted under /debug/pprof/*.

Triage cheatsheet

SymptomFirst checkDoc
Users report “I paid, no credit”Chargebee event poller logsbilling-pitfalls
Brisket showing UNAUTHENTICATEDClerk JWKS reachable; SIRLOIN_CLERK_API_KEY validauth-model
Strip admin RPCs failingStrip Clerk admin token / roleauth-model
Character upload brokenSIRLOIN_S3_*, R2 statussirloin-runbook
All RPCs slowPostgres connections / slow queriessirloin-runbook
Webhook 401 stormSecret drift, clock driftsirloin-errors

Rollback

Railway → sirloin → deployments → previous green → Redeploy. Safe so long as no DB migration was applied since. If a migration shipped with the bad release, don’t roll back — forward-fix, escalate to data owner (sirloin-runbook → Rollback).

Escalation

flowchart LR
A[On-call] -->|<10 min stuck| B[Service owner: @zen]
A -->|billing-specific| C[Billing owner]
A -->|infra / Railway| D[Platform]
B --> E[CTO]
  • Owner of record: @zen (per docs/.../services/sirloin.md frontmatter).
  • Billing-specific incidents: paging tree TODO(@zen).
  • Provider escalations: open ticket with Chargebee / Primer / Neon / Clerk in parallel — do not wait for an internal answer to engage them.

Useful commands

Terminal window
# Tail health
curl -fsS https://<sirloin>/health
# gRPC health (matches container probe)
grpc_health_probe -addr=<sirloin>:50051
# Recent CI runs
gh run list --workflow=sirloin.yml --limit 10
# Local repro of CI gate
cd apps/sirloin && make ci-all

Dashboards

TODO(@zen): canonical Axiom dashboard URLs, Sentry project link, and Railway alerting policy aren’t yet captured. Add them here once owned.

Resolved

  • pprof endpoint: gated by SIRLOIN_PPROF_PORT (empty = disabled), served at /debug/pprof/*apps/sirloin/internal/app/debug/pprof.go:18-:28, apps/sirloin/internal/app/config/config.go:421-:425.

TODO(@zen)

  • TODO(@zen): authoritative paging policy (PagerDuty / Linear / Slack).
  • TODO(@zen): Axiom dashboard ids for sirloin RPC error rate, poller lag, and webhook health.
  • TODO(@zen): rotation procedure for Clerk and Chargebee webhook creds.