Sirloin On-call
You just got paged on sirloin. Read this first.
What sirloin is
Main Go API. gRPC :50051, webhooks :8087, MCP :8086. Owns billing,
characters, media user-state, credits, admin/audit. Stateless except for
Postgres (Neon) and Redis. See Sirloin for the full
dossier and sirloin-runbook for incident
detail. For deep billing scenarios go straight to
runbooks/billing.
Check first (5 min)
- Is the service up?
- Railway dashboard → sirloin → green deployments
grpc_health_probe -addr=<host>:50051(container healthcheck uses this)
- Is anything obviously broken in logs?
- Axiom dataset =
SIRLOIN_AXIOM_DATASET(see observability) - Filter
service.name = "sirloin"andseverity >= "error"
- Axiom dataset =
- Are upstreams healthy?
- Chargebee status page
- Primer status page
- Neon (Postgres) status
- Clerk status
- Is it a known recurring incident?
- Chargebee polling lag, Primer webhook 401s, payment saga stuck — all in sirloin-runbook.
- Recent deploy?
gh run list --workflow=sirloin.yml --limit 5- Railway deploys list — last 24h. If the pager fires within 30 min of a deploy, rollback first, debug second.
Top alerts and what they mean
Dashboards / alert routing: TODO(@zen) — link Axiom monitors and Sentry projects here once formalised.
- Sirloin gRPC error rate elevated
Most likely a downstream (Chargebee/Primer) outage or a recent regression.
Check Axiom error breakdown by RPC. If concentrated on
BillingService, skip to runbooks/billing. - Chargebee event poller silent (> 60s gap) Multi-replica leader election failed, or Chargebee 5xx. See runbook → Chargebee polling lag.
- Primer webhook 401 spike Webhook secret rotation drift, or NTP drift > 3 min. See runbook → Primer webhook backlog.
- Payment saga stuck (subscription
futurepast expiration) See billing-pitfalls and the runbook → Payment saga stuck section. - Sirloin pod OOM / restart loop
Memory ceiling 12 GB. Check for runaway worker (likely
checkmediagenerationor a brain client retry storm). Bounce + collect heap profile via the pprof debug endpoint (apps/sirloin/internal/app/debug/pprof.go:18-:28). Enabled by settingSIRLOIN_PPROF_PORTto a listen address (e.g.:6060); empty value disables the server (apps/sirloin/internal/app/config/config.go:421-:425,apps/sirloin/cmd/app/main.go:346-:348). Routes mounted under/debug/pprof/*.
Triage cheatsheet
| Symptom | First check | Doc |
|---|---|---|
| Users report “I paid, no credit” | Chargebee event poller logs | billing-pitfalls |
Brisket showing UNAUTHENTICATED | Clerk JWKS reachable; SIRLOIN_CLERK_API_KEY valid | auth-model |
| Strip admin RPCs failing | Strip Clerk admin token / role | auth-model |
| Character upload broken | SIRLOIN_S3_*, R2 status | sirloin-runbook |
| All RPCs slow | Postgres connections / slow queries | sirloin-runbook |
| Webhook 401 storm | Secret drift, clock drift | sirloin-errors |
Rollback
Railway → sirloin → deployments → previous green → Redeploy. Safe so long as no DB migration was applied since. If a migration shipped with the bad release, don’t roll back — forward-fix, escalate to data owner (sirloin-runbook → Rollback).
Escalation
flowchart LR A[On-call] -->|<10 min stuck| B[Service owner: @zen] A -->|billing-specific| C[Billing owner] A -->|infra / Railway| D[Platform] B --> E[CTO]- Owner of record:
@zen(perdocs/.../services/sirloin.mdfrontmatter). - Billing-specific incidents: paging tree TODO(@zen).
- Provider escalations: open ticket with Chargebee / Primer / Neon / Clerk in parallel — do not wait for an internal answer to engage them.
Useful commands
# Tail healthcurl -fsS https://<sirloin>/health
# gRPC health (matches container probe)grpc_health_probe -addr=<sirloin>:50051
# Recent CI runsgh run list --workflow=sirloin.yml --limit 10
# Local repro of CI gatecd apps/sirloin && make ci-allDashboards
TODO(@zen): canonical Axiom dashboard URLs, Sentry project link, and Railway alerting policy aren’t yet captured. Add them here once owned.
Resolved
- pprof endpoint: gated by
SIRLOIN_PPROF_PORT(empty = disabled), served at/debug/pprof/*—apps/sirloin/internal/app/debug/pprof.go:18-:28,apps/sirloin/internal/app/config/config.go:421-:425.
TODO(@zen)
- TODO(@zen): authoritative paging policy (PagerDuty / Linear / Slack).
- TODO(@zen): Axiom dashboard ids for sirloin RPC error rate, poller lag, and webhook health.
- TODO(@zen): rotation procedure for Clerk and Chargebee webhook creds.