Billing Operational Runbook
Billing Operational Runbook
This document provides step-by-step procedures for common operational incidents involving the billing system.
Operational Architecture
Sirloin’s billing package owns subscription management, payment processing, checkout cleanup, renewal retries, product listing, credit allocation, cache invalidation, and billing analytics. The core package areas are:
domain/: shared billing entities, sentinel errors, plan parsing, and analytics interfaces.chargebee/: Chargebee client wrapper and retry behavior.checkout/: Primer checkout creation and expired pending-subscription cleanup.payments/: unified payment recording, idempotency checks, and subscription activation.events/: Chargebee polling, credit extraction, refund detection, and analytics notifications.subscriptions/,renewals/, andproducts/: subscription lifecycle, renewal retry, and product display operations.
Operational incidents usually cross Chargebee, Primer, the Sirloin database, Redis/cache invalidation, and billing background workers. Treat Chargebee as the authoritative subscription and invoice source, and use local purchase records plus distributed locks to verify whether credits were applied exactly once.
Table of Contents
- Orphaned Payments
- Stuck Subscriptions
- Chargebee Sync Drift
- Circuit Breaker Triage
- Rate Limit Spike
- Fraud Alert Response
Orphaned Payments
Symptom
User reports: “I paid but my subscription isn’t active” or “Credits didn’t appear”
Root Causes
- Primer webhook didn’t arrive (network issue)
- Polling worker crashed (missing fallback detection)
- Idempotency key mismatch (Primer txn ID not recorded)
- Chargebee invoice not marked as paid
- EventPoller hasn’t run yet (payment too recent; <15 seconds old)
Investigation
# 1. Check Chargebee for invoice statuschargebee-cli invoice get <invoice_id># Look for: status = "paid", payment_received, transaction_id
# 2. Check local DB for purchase recordSELECT * FROM purchases WHERE invoice_id = '<invoice_id>';# If empty: payment not recorded locally
# 3. Check logs for payment recording errorsgrep "invoice_id=<invoice_id>" logs/billing.log# Look for: errors, rate limiting, lock failures
# 4. Check polling worker statusSELECT * FROM event_poller_state ORDER BY created_at DESC LIMIT 10;# If > 5 min old: polling worker may be stuck
# 5. Check if payment is recentSELECT TIMESTAMPDIFF(SECOND, paid_at, NOW()) FROM invoices WHERE id='<invoice_id>';# If < 15 seconds: polling hasn't run yet; waitResolution
Case 1: Chargebee shows paid; DB shows no purchase
# Trigger manual event processingcurl -X POST http://localhost:8080/internal/billing/poll \ -H "Content-Type: application/json" \ -d '{"all_invoices": false, "since": "2026-04-08T12:00:00Z"}'
# Wait 30 seconds, check if credits appearedSELECT credits FROM users WHERE id = '<user_id>';Case 2: Chargebee shows unpaid
# Check if payment was actually captured in Primerprimer-cli transaction get <primer_transaction_id># If status = "authorized" (not "captured"):# → Contact Primer support; payment not actually settled
# If status = "captured":# → Payment was captured by Primer but not recorded in Chargebee# → See "Sync Drift" section belowCase 3: Polling worker stuck
# Check worker healthsystemctl status billing-poller# If not running: systemctl start billing-poller
# Check logsjournalctl -u billing-poller -n 100 --no-pager# Look for: rate limit errors, API errors, crashes
# If rate-limited: Wait 60 seconds for backoff, then restartsystemctl restart billing-pollerCase 4: Idempotency check failed
# Verify transaction ID formatSELECT transaction_id FROM purchases WHERE invoice_id='<invoice_id>';# Should match Primer transaction ID
# If mismatch: Data corruption issue# → Contact engineering; may need DB cleanupPrevention
- Monitor
payment_recording_latency_seconds(alert if > 60s) - Monitor
events_polled_total(alert if 0 for > 5 min) - Set up Primer webhook retry alerts
Stuck Subscriptions
Symptom
User’s subscription is in “future” or “non_renewing” state when it should be “active” or “cancelled”
Root Causes
- Pending checkout never completed (user started but didn’t finish)
- Activation failed (update_term_end API error)
- Cancellation API error (returned error but was actually canceled)
- Chargebee-side state mismatch (DB disagrees with Chargebee)
Investigation
# 1. Check local DBSELECT id, status, start_date, current_term_end FROM subscriptionsWHERE customer_id='<customer_id>';
# 2. Check Chargebeechargebee-cli subscription get <subscription_id># Compare status between DB and Chargebee
# 3. Check if subscription is pending checkoutSELECT * FROM subscriptions WHERE status='future' AND start_date > NOW() + INTERVAL '1 YEAR';# If matches: this is a pending checkout
# 4. Check ageSELECT TIMESTAMPDIFF(DAY, created_at, NOW()) FROM subscriptions WHERE id='<subscription_id>';# If > 1 day: eligible for cleanupResolution
Case 1: Subscription is future + start_date > 1 year (pending checkout)
# Is checkout still active?curl -X GET http://localhost:8080/api/checkout/status?order_id=<invoice_id># If "expired": User didn't complete payment
# Option A: User wants to retry# → Recreate checkout (new Primer session)curl -X POST http://localhost:8080/api/billing/checkout \ -d '{"user_id": "<user_id>", "item_price_id": "<item_price_id>"}'
# Option B: Cleanup (if > 1 day old)# → Manually cancelchargebee-cli subscription cancel <subscription_id># Then re-create if user wants to retryCase 2: Subscription should be active but is still future
# Payment was recorded but activation failed?SELECT * FROM purchases WHERE subscription_id='<subscription_id>';# If empty: payment never recorded
# If purchase exists: activation failed# → Try to manually activatecurl -X POST http://localhost:8080/internal/billing/activate \ -d '{"subscription_id": "<subscription_id>"}'
# If successful: Check Chargebee status updatedchargebee-cli subscription get <subscription_id> | grep statusCase 3: Subscription should be cancelled but is still active
# Was cancellation API called?grep "subscription_id=<subscription_id>" logs/billing.log | grep cancel# If no matches: cancellation was never initiated
# If matches but status still "active":# → Chargebee API returned success but didn't actually update# → Try againchargebee-cli subscription cancel <subscription_id> --force
# If still doesn't work: Chargebee issue# → Contact Chargebee support with subscription IDCase 4: DB/Chargebee mismatch
# Force sync from Chargebee to DBcurl -X POST http://localhost:8080/internal/billing/sync \ -d '{"subscription_id": "<subscription_id>"}'
# Verify matchchargebee-cli subscription get <subscription_id> > /tmp/cb.jsoncurl http://localhost:8080/api/subscriptions/<subscription_id> > /tmp/db.jsondiff /tmp/cb.json /tmp/db.jsonPrevention
- Monitor subscriptions in “future” state (alert if > 1 day old + start_date not near now)
- Monitor activation errors in logs (alert if > 1% of payments)
- Test pending checkout cleanup regularly
Chargebee Sync Drift
Symptom
DB shows subscription in different state than Chargebee, or invoice amounts don’t match
Root Causes
- Failed API call that partially succeeded (Chargebee changed, DB didn’t)
- Stale cache (local cache hasn’t been invalidated)
- Manual change in Chargebee UI (not synced back to DB)
- Eventual consistency window (recent change, not synced yet)
Investigation
# 1. Get current state from both systemschargebee-cli subscription get <subscription_id> | jq . > /tmp/chargebee.jsoncurl -X GET http://localhost:8080/api/subscriptions/<subscription_id> | jq . > /tmp/db.json
# 2. Compare critical fieldsdiff <(jq '.status, .current_term_end, .coupon_ids' /tmp/chargebee.json) \ <(jq '.status, .current_term_end, .coupon_ids' /tmp/db.json)
# 3. Check invoice amountschargebee-cli invoice get <invoice_id> | jq '.total'SELECT total_amount FROM invoices WHERE id='<invoice_id>';Resolution
Case 1: Chargebee is newer (DB is stale)
# Option A: Invalidate cache + re-fetchcurl -X POST http://localhost:8080/internal/cache/invalidate \ -d '{"customer_id": "<customer_id>"}'
# Then fetch again (will re-query Chargebee)curl http://localhost:8080/api/subscriptions/<subscription_id>
# Option B: Full synccurl -X POST http://localhost:8080/internal/billing/sync-all \ -d '{"since": "2026-04-08T00:00:00Z"}'Case 2: DB is newer (Chargebee is stale)
This should be rare (we sync from Chargebee, not push to it). But can happen if:
- Manual update was attempted but failed halfway
- Chargebee API returned success but didn’t apply
# Re-apply the update to Chargebeechargebee-cli subscription update <subscription_id> \ --new-field-name="<expected_value>"
# Then verify synccurl -X POST http://localhost:8080/internal/billing/sync \ -d '{"subscription_id": "<subscription_id>"}'Case 3: Manual change in Chargebee UI
# User or support made manual changes in Chargebee# Sync DB to matchcurl -X POST http://localhost:8080/internal/billing/sync \ -d '{"subscription_id": "<subscription_id>"}'
# Verify matchchargebee-cli subscription get <subscription_id> | jq '.status' > /tmp/cb_statuscurl http://localhost:8080/api/subscriptions/<subscription_id> | jq '.status' > /tmp/db_statusdiff /tmp/cb_status /tmp/db_statusPrevention
- Monitor sync errors in logs (alert if > 0 per hour)
- Monitor DB/Chargebee consistency gaps (audit query every 1 hour)
- Disable manual Chargebee changes for “system” subscriptions; route through API
Circuit Breaker Triage
Symptom
“Circuit breaker is open” error in logs; Chargebee API calls failing
Root Causes
- Chargebee API is down (service issue)
- Our API key is invalid or rate-limited (configuration issue)
- Network connectivity problem (firewall, DNS, proxy)
- Sustained error rate too high (threshold exceeded; e.g., > 50% failures)
Investigation
# 1. Check circuit breaker statecurl http://localhost:8080/internal/health/circuit-breaker
# Output might show: {"chargebee": {"state": "open", "error_rate": 0.75}}
# 2. Check error logsgrep "circuit.*open" logs/billing.log | tail -20
# 3. Test Chargebee API directlycurl -H "Authorization: Bearer $CHARGEBEE_API_KEY" \ https://api.chargebee.com/api/v2/health
# 4. Check rate limitinggrep "rate.*limit\|429\|Please try after" logs/billing.log | tail -10
# 5. Check our API key configurationecho $CHARGEBEE_API_KEY | head -c 10 # Show first 10 chars (don't log full key)# If missing or changed recently: configuration issueResolution
Case 1: Chargebee API is actually down
# Wait for Chargebee to recover (~5-30 minutes typically)# Monitor status page: https://status.chargebee.com
# In the meantime:# - Primer payments may still work (Primer is independent)# - Event polling will retry automatically (exponential backoff)# - User-facing requests will fail; serve cached data if available
# Alert users (if outage > 30 minutes)curl -X POST http://localhost:8080/internal/notifications/alert \ -d '{"message": "Subscription operations temporarily unavailable"}'Case 2: Rate limiting
# Chargebee rate limit is per API key, per minute# Standard limit: ~60 requests/minute
# Check rate limit errorsgrep "Please try after some time" logs/billing.log | wc -l
# Reduce request frequency if possible:# - Increase polling interval (15s → 30s)# - Batch operations (list instead of individual gets)# - Implement caching more aggressively
# Contact Chargebee if sustained: request rate limit increaseCase 3: Invalid/rotated API key
# Check if key was recently rotatedgit log --all --grep="API_KEY\|chargebee" --oneline | head -5
# If key was rotated but env var not updated:export CHARGEBEE_API_KEY="<new_key>"# Or update .env file and restart servicesystemctl restart billing-service
# Verify key workscurl -H "Authorization: Bearer $CHARGEBEE_API_KEY" \ https://api.chargebee.com/api/v2/items?limit=1Case 4: Network connectivity
# Test DNS resolutionnslookup api.chargebee.com
# Test raw connectivitync -zv api.chargebee.com 443
# Check proxy settings (if behind proxy)curl -v -x <proxy> https://api.chargebee.com/api/v2/health
# Check firewall rules# If behind corporate firewall, ensure api.chargebee.com is whitelistedPrevention
- Monitor circuit breaker state (alert if open > 5 minutes)
- Set up status page monitoring (Chargebee status page)
- Implement graceful degradation (cached data when CB unavailable)
- Test disaster scenario quarterly (simulate Chargebee down)
Rate Limit Spike
Symptom
Logs show "Please try after some time" errors; multiple retries with exponential backoff; user requests slow down
Root Causes
- Sudden traffic spike (load test, marketing campaign, viral growth)
- Polling worker making too many requests (config error, infinite loop)
- Chargebee’s rate limit lowered (API downgrade, or our usage miscounted)
- Retry storms (exponential backoff causing thundering herd)
Investigation
# 1. Check request volume in the last hourgrep "chargebee.*request" logs/billing.log | wc -l# Compare to baseline
# 2. Count requests by operationgrep "chargebee.*request" logs/billing.log | \ sed 's/.*operation=//' | cut -d' ' -f1 | sort | uniq -c
# 3. Check retry attemptsgrep "retry.*attempt" logs/billing.log | tail -20
# 4. Check if polling worker is loopinggrep "events_polled" logs/billing.log | tail -20 | awk '{print $1}' | uniq | wc -l# Should be ~60 entries over 15 seconds; if > 100: loopingResolution
Case 1: Normal traffic spike
# Chargebee will recover automatically (1-2 minutes)# Our system retries with backoff; requests will eventually succeed
# Monitor recoverygrep "rate.*limit\|Please try after" logs/billing.log | tail -1 | awk '{print $1}'# If timestamp is > 2 minutes ago: recovered
# Check if user requests are succeeding nowcurl -X GET http://localhost:8080/api/subscriptions/<sub_id> -w "%{http_code}"# Should be 200Case 2: Polling worker misconfiguration
# Check polling frequencyps aux | grep polling-worker# Look for: --frequency=15s (should be 15+ seconds)
# If frequency too low: update configsystemctl stop billing-poller# Edit /etc/billing/config.yaml: set frequency to 30ssystemctl start billing-poller
# Monitor againgrep "events_polled" logs/billing.log | tail -5Case 3: Chargebee rate limit was lowered
# This is rare; would require Chargebee proactively lowering your limit# Check account status: https://app.chargebee.com/settings/your-account
# If limit was lowered:# Option A: Request increase (contact support)# Option B: Reduce request frequency (increase poll interval, batch requests)
# Calculate required frequency reductioncurrent_rps = $(grep "chargebee.*request" logs/billing.log | wc -l) / 3600new_rps = min(60 / 60, 1.0) # 60 requests/minute = 1 RPSreduction_factor = new_rps / current_rps
# Example: If we're doing 2 RPS and limit is 1 RPS, reduce by 50%Case 4: Retry storms (thundering herd)
# If multiple requests all hit rate limit and retry simultaneously,# backoff + jitter helps, but may still spike
# Check for synchronization:grep "retry.*exponential" logs/billing.log | \ sed 's/.*attempt=//' | sort | uniq -c | sort -rn | head# If one attempt number dominates: synchronized retries
# Add random jitter to retry delay# (Already implemented in retry/retry.go with exponential backoff)
# If problem persists: contact engineering for backoff tuningPrevention
- Monitor
chargebee_rate_limit_errors_total(alert if > 0) - Implement request batching (list instead of individual lookups)
- Cache aggressively (extend TTLs, pre-warm cache)
- Load test with Chargebee to understand rate limit behavior
Fraud Alert Response
Symptom
Multiple failed payment attempts; potential fraudster; fraud alert triggered (alert from Primer or PostHog)
Root Causes
- Testing (developer testing payment flows)
- Card decline (legitimate user’s card being rejected)
- Actual fraud (stolen card or account takeover)
- Billing system error (charging multiple times unintentionally)
Investigation
# 1. Get user detailsSELECT user_id, email, created_at FROM users WHERE id='<user_id>';
# 2. Check payment historySELECT transaction_id, amount, status, created_at FROM transactionsWHERE customer_id='<chargebee_customer_id>'ORDER BY created_at DESC LIMIT 10;
# 3. Check if account is newly created (more likely to be fraud)SELECT TIMESTAMPDIFF(MINUTE, created_at, NOW()) FROM users WHERE id='<user_id>';# If < 30 min: new account (higher fraud risk)
# 4. Check activity pattern (testing vs real usage)SELECT * FROM subscriptions WHERE customer_id='<chargebee_customer_id>';# Multiple subscriptions in short time = potential testing
# 5. Check IP/device changesSELECT ip_address, COUNT(*) FROM login_attemptsWHERE user_id='<user_id>'GROUP BY ip_address;# Multiple IPs = possible account compromiseResolution
Case 1: Testing (developer)
# If internal user: No action needed# If external user: Contact support to explain
# To prevent: Use separate test account with test API key# Make sure test environment doesn't use production ChargebeeCase 2: Legitimate card decline
# Card was declined (insufficient funds, expired, etc.)# User will retry naturally; no action needed
# But: Check if we're charging multiple times on declineSELECT COUNT(*) FROM transactionsWHERE customer_id='<id>' AND status='failed'ORDER BY created_at DESC LIMIT 5;# Should be 1 failed transaction per user attempt
# If multiple failures from single user attempt: Bug in payment retry logic# Contact engineeringCase 3: Actual fraud
# Steps:# 1. Freeze account (disable further payments)curl -X POST http://localhost:8080/internal/users/<user_id>/freeze
# 2. Void any pending invoiceschargebee-cli invoice void <invoice_id>
# 3. Notify user (send email)curl -X POST http://localhost:8080/internal/notifications/alert \ -d '{"user_id": "<user_id>", "message": "We detected suspicious activity"}'
# 4. Contact payment processor (Primer)# Report transaction ID to Primer support with details
# 5. Review transaction logsgrep "user_id=<user_id>" logs/billing.log logs/auth.log | tail -50# Look for: unusual patterns, high-frequency attempts, geographic anomalies
# 6. Consider additional verification# If high-value account: require 2FA, manual review before unlockingCase 4: Billing system error (double-charging)
# Check for duplicate transactionsSELECT transaction_id, COUNT(*) FROM transactionsWHERE customer_id='<id>'GROUP BY transaction_idHAVING COUNT(*) > 1;
# If duplicates found: Data corruption issue# 1. Contact engineering# 2. Issue refund for duplicate chargeschargebee-cli credit-note create \ --customer-id=<customer_id> \ --amount=<duplicate_amount>
# 3. Fix root cause (idempotency check, locks, etc.)Prevention
- Set up fraud alerts in Primer dashboard (monitor > N failed attempts)
- Monitor unusual payment patterns (>$1000/min, >10 charges/min)
- Require verification for high-value accounts
- Rate-limit payment attempts per user (max 5/hour)
- Regular fraud report reviews (quarterly)
Quick Reference: Common Commands
# Check subscription statuschargebee-cli subscription get <subscription_id> | jq '.status, .coupon_ids'
# List unpaid invoiceschargebee-cli invoice list --filter='status:payment_due' --limit=50
# Void invoicechargebee-cli invoice void <invoice_id>
# Create credit note (refund)chargebee-cli credit-note create --customer-id=<id> --amount=<cents>
# Check payment transactionchargebee-cli transaction get <transaction_id>
# Trigger manual event pollingcurl -X POST http://localhost:8080/internal/billing/poll \ -H "Content-Type: application/json" \ -d '{"all_invoices": false}'
# Invalidate user cachecurl -X POST http://localhost:8080/internal/cache/invalidate \ -d '{"customer_id": "<customer_id>"}'
# Sync subscription from Chargebeecurl -X POST http://localhost:8080/internal/billing/sync \ -d '{"subscription_id": "<subscription_id>"}'
# Check circuit breaker statuscurl http://localhost:8080/internal/health/circuit-breaker | jq .
# View billing logsjournalctl -u billing-service -n 100 -f
# Restart billing servicesystemctl restart billing-service