Skip to content

Billing Operational Runbook

Billing Operational Runbook

This document provides step-by-step procedures for common operational incidents involving the billing system.

Operational Architecture

Sirloin’s billing package owns subscription management, payment processing, checkout cleanup, renewal retries, product listing, credit allocation, cache invalidation, and billing analytics. The core package areas are:

  • domain/: shared billing entities, sentinel errors, plan parsing, and analytics interfaces.
  • chargebee/: Chargebee client wrapper and retry behavior.
  • checkout/: Primer checkout creation and expired pending-subscription cleanup.
  • payments/: unified payment recording, idempotency checks, and subscription activation.
  • events/: Chargebee polling, credit extraction, refund detection, and analytics notifications.
  • subscriptions/, renewals/, and products/: subscription lifecycle, renewal retry, and product display operations.

Operational incidents usually cross Chargebee, Primer, the Sirloin database, Redis/cache invalidation, and billing background workers. Treat Chargebee as the authoritative subscription and invoice source, and use local purchase records plus distributed locks to verify whether credits were applied exactly once.

Table of Contents

  1. Orphaned Payments
  2. Stuck Subscriptions
  3. Chargebee Sync Drift
  4. Circuit Breaker Triage
  5. Rate Limit Spike
  6. Fraud Alert Response

Orphaned Payments

Symptom

User reports: “I paid but my subscription isn’t active” or “Credits didn’t appear”

Root Causes

  1. Primer webhook didn’t arrive (network issue)
  2. Polling worker crashed (missing fallback detection)
  3. Idempotency key mismatch (Primer txn ID not recorded)
  4. Chargebee invoice not marked as paid
  5. EventPoller hasn’t run yet (payment too recent; <15 seconds old)

Investigation

Terminal window
# 1. Check Chargebee for invoice status
chargebee-cli invoice get <invoice_id>
# Look for: status = "paid", payment_received, transaction_id
# 2. Check local DB for purchase record
SELECT * FROM purchases WHERE invoice_id = '<invoice_id>';
# If empty: payment not recorded locally
# 3. Check logs for payment recording errors
grep "invoice_id=<invoice_id>" logs/billing.log
# Look for: errors, rate limiting, lock failures
# 4. Check polling worker status
SELECT * FROM event_poller_state ORDER BY created_at DESC LIMIT 10;
# If > 5 min old: polling worker may be stuck
# 5. Check if payment is recent
SELECT TIMESTAMPDIFF(SECOND, paid_at, NOW()) FROM invoices WHERE id='<invoice_id>';
# If < 15 seconds: polling hasn't run yet; wait

Resolution

Case 1: Chargebee shows paid; DB shows no purchase

Terminal window
# Trigger manual event processing
curl -X POST http://localhost:8080/internal/billing/poll \
-H "Content-Type: application/json" \
-d '{"all_invoices": false, "since": "2026-04-08T12:00:00Z"}'
# Wait 30 seconds, check if credits appeared
SELECT credits FROM users WHERE id = '<user_id>';

Case 2: Chargebee shows unpaid

Terminal window
# Check if payment was actually captured in Primer
primer-cli transaction get <primer_transaction_id>
# If status = "authorized" (not "captured"):
# → Contact Primer support; payment not actually settled
# If status = "captured":
# → Payment was captured by Primer but not recorded in Chargebee
# → See "Sync Drift" section below

Case 3: Polling worker stuck

Terminal window
# Check worker health
systemctl status billing-poller
# If not running: systemctl start billing-poller
# Check logs
journalctl -u billing-poller -n 100 --no-pager
# Look for: rate limit errors, API errors, crashes
# If rate-limited: Wait 60 seconds for backoff, then restart
systemctl restart billing-poller

Case 4: Idempotency check failed

Terminal window
# Verify transaction ID format
SELECT transaction_id FROM purchases WHERE invoice_id='<invoice_id>';
# Should match Primer transaction ID
# If mismatch: Data corruption issue
# → Contact engineering; may need DB cleanup

Prevention

  • Monitor payment_recording_latency_seconds (alert if > 60s)
  • Monitor events_polled_total (alert if 0 for > 5 min)
  • Set up Primer webhook retry alerts

Stuck Subscriptions

Symptom

User’s subscription is in “future” or “non_renewing” state when it should be “active” or “cancelled”

Root Causes

  1. Pending checkout never completed (user started but didn’t finish)
  2. Activation failed (update_term_end API error)
  3. Cancellation API error (returned error but was actually canceled)
  4. Chargebee-side state mismatch (DB disagrees with Chargebee)

Investigation

Terminal window
# 1. Check local DB
SELECT id, status, start_date, current_term_end FROM subscriptions
WHERE customer_id='<customer_id>';
# 2. Check Chargebee
chargebee-cli subscription get <subscription_id>
# Compare status between DB and Chargebee
# 3. Check if subscription is pending checkout
SELECT * FROM subscriptions WHERE status='future' AND start_date > NOW() + INTERVAL '1 YEAR';
# If matches: this is a pending checkout
# 4. Check age
SELECT TIMESTAMPDIFF(DAY, created_at, NOW()) FROM subscriptions WHERE id='<subscription_id>';
# If > 1 day: eligible for cleanup

Resolution

Case 1: Subscription is future + start_date > 1 year (pending checkout)

Terminal window
# Is checkout still active?
curl -X GET http://localhost:8080/api/checkout/status?order_id=<invoice_id>
# If "expired": User didn't complete payment
# Option A: User wants to retry
# → Recreate checkout (new Primer session)
curl -X POST http://localhost:8080/api/billing/checkout \
-d '{"user_id": "<user_id>", "item_price_id": "<item_price_id>"}'
# Option B: Cleanup (if > 1 day old)
# → Manually cancel
chargebee-cli subscription cancel <subscription_id>
# Then re-create if user wants to retry

Case 2: Subscription should be active but is still future

Terminal window
# Payment was recorded but activation failed?
SELECT * FROM purchases WHERE subscription_id='<subscription_id>';
# If empty: payment never recorded
# If purchase exists: activation failed
# → Try to manually activate
curl -X POST http://localhost:8080/internal/billing/activate \
-d '{"subscription_id": "<subscription_id>"}'
# If successful: Check Chargebee status updated
chargebee-cli subscription get <subscription_id> | grep status

Case 3: Subscription should be cancelled but is still active

Terminal window
# Was cancellation API called?
grep "subscription_id=<subscription_id>" logs/billing.log | grep cancel
# If no matches: cancellation was never initiated
# If matches but status still "active":
# → Chargebee API returned success but didn't actually update
# → Try again
chargebee-cli subscription cancel <subscription_id> --force
# If still doesn't work: Chargebee issue
# → Contact Chargebee support with subscription ID

Case 4: DB/Chargebee mismatch

Terminal window
# Force sync from Chargebee to DB
curl -X POST http://localhost:8080/internal/billing/sync \
-d '{"subscription_id": "<subscription_id>"}'
# Verify match
chargebee-cli subscription get <subscription_id> > /tmp/cb.json
curl http://localhost:8080/api/subscriptions/<subscription_id> > /tmp/db.json
diff /tmp/cb.json /tmp/db.json

Prevention

  • Monitor subscriptions in “future” state (alert if > 1 day old + start_date not near now)
  • Monitor activation errors in logs (alert if > 1% of payments)
  • Test pending checkout cleanup regularly

Chargebee Sync Drift

Symptom

DB shows subscription in different state than Chargebee, or invoice amounts don’t match

Root Causes

  1. Failed API call that partially succeeded (Chargebee changed, DB didn’t)
  2. Stale cache (local cache hasn’t been invalidated)
  3. Manual change in Chargebee UI (not synced back to DB)
  4. Eventual consistency window (recent change, not synced yet)

Investigation

Terminal window
# 1. Get current state from both systems
chargebee-cli subscription get <subscription_id> | jq . > /tmp/chargebee.json
curl -X GET http://localhost:8080/api/subscriptions/<subscription_id> | jq . > /tmp/db.json
# 2. Compare critical fields
diff <(jq '.status, .current_term_end, .coupon_ids' /tmp/chargebee.json) \
<(jq '.status, .current_term_end, .coupon_ids' /tmp/db.json)
# 3. Check invoice amounts
chargebee-cli invoice get <invoice_id> | jq '.total'
SELECT total_amount FROM invoices WHERE id='<invoice_id>';

Resolution

Case 1: Chargebee is newer (DB is stale)

Terminal window
# Option A: Invalidate cache + re-fetch
curl -X POST http://localhost:8080/internal/cache/invalidate \
-d '{"customer_id": "<customer_id>"}'
# Then fetch again (will re-query Chargebee)
curl http://localhost:8080/api/subscriptions/<subscription_id>
# Option B: Full sync
curl -X POST http://localhost:8080/internal/billing/sync-all \
-d '{"since": "2026-04-08T00:00:00Z"}'

Case 2: DB is newer (Chargebee is stale)

This should be rare (we sync from Chargebee, not push to it). But can happen if:

  • Manual update was attempted but failed halfway
  • Chargebee API returned success but didn’t apply
Terminal window
# Re-apply the update to Chargebee
chargebee-cli subscription update <subscription_id> \
--new-field-name="<expected_value>"
# Then verify sync
curl -X POST http://localhost:8080/internal/billing/sync \
-d '{"subscription_id": "<subscription_id>"}'

Case 3: Manual change in Chargebee UI

Terminal window
# User or support made manual changes in Chargebee
# Sync DB to match
curl -X POST http://localhost:8080/internal/billing/sync \
-d '{"subscription_id": "<subscription_id>"}'
# Verify match
chargebee-cli subscription get <subscription_id> | jq '.status' > /tmp/cb_status
curl http://localhost:8080/api/subscriptions/<subscription_id> | jq '.status' > /tmp/db_status
diff /tmp/cb_status /tmp/db_status

Prevention

  • Monitor sync errors in logs (alert if > 0 per hour)
  • Monitor DB/Chargebee consistency gaps (audit query every 1 hour)
  • Disable manual Chargebee changes for “system” subscriptions; route through API

Circuit Breaker Triage

Symptom

“Circuit breaker is open” error in logs; Chargebee API calls failing

Root Causes

  1. Chargebee API is down (service issue)
  2. Our API key is invalid or rate-limited (configuration issue)
  3. Network connectivity problem (firewall, DNS, proxy)
  4. Sustained error rate too high (threshold exceeded; e.g., > 50% failures)

Investigation

Terminal window
# 1. Check circuit breaker state
curl http://localhost:8080/internal/health/circuit-breaker
# Output might show: {"chargebee": {"state": "open", "error_rate": 0.75}}
# 2. Check error logs
grep "circuit.*open" logs/billing.log | tail -20
# 3. Test Chargebee API directly
curl -H "Authorization: Bearer $CHARGEBEE_API_KEY" \
https://api.chargebee.com/api/v2/health
# 4. Check rate limiting
grep "rate.*limit\|429\|Please try after" logs/billing.log | tail -10
# 5. Check our API key configuration
echo $CHARGEBEE_API_KEY | head -c 10 # Show first 10 chars (don't log full key)
# If missing or changed recently: configuration issue

Resolution

Case 1: Chargebee API is actually down

Terminal window
# Wait for Chargebee to recover (~5-30 minutes typically)
# Monitor status page: https://status.chargebee.com
# In the meantime:
# - Primer payments may still work (Primer is independent)
# - Event polling will retry automatically (exponential backoff)
# - User-facing requests will fail; serve cached data if available
# Alert users (if outage > 30 minutes)
curl -X POST http://localhost:8080/internal/notifications/alert \
-d '{"message": "Subscription operations temporarily unavailable"}'

Case 2: Rate limiting

Terminal window
# Chargebee rate limit is per API key, per minute
# Standard limit: ~60 requests/minute
# Check rate limit errors
grep "Please try after some time" logs/billing.log | wc -l
# Reduce request frequency if possible:
# - Increase polling interval (15s → 30s)
# - Batch operations (list instead of individual gets)
# - Implement caching more aggressively
# Contact Chargebee if sustained: request rate limit increase

Case 3: Invalid/rotated API key

Terminal window
# Check if key was recently rotated
git log --all --grep="API_KEY\|chargebee" --oneline | head -5
# If key was rotated but env var not updated:
export CHARGEBEE_API_KEY="<new_key>"
# Or update .env file and restart service
systemctl restart billing-service
# Verify key works
curl -H "Authorization: Bearer $CHARGEBEE_API_KEY" \
https://api.chargebee.com/api/v2/items?limit=1

Case 4: Network connectivity

Terminal window
# Test DNS resolution
nslookup api.chargebee.com
# Test raw connectivity
nc -zv api.chargebee.com 443
# Check proxy settings (if behind proxy)
curl -v -x <proxy> https://api.chargebee.com/api/v2/health
# Check firewall rules
# If behind corporate firewall, ensure api.chargebee.com is whitelisted

Prevention

  • Monitor circuit breaker state (alert if open > 5 minutes)
  • Set up status page monitoring (Chargebee status page)
  • Implement graceful degradation (cached data when CB unavailable)
  • Test disaster scenario quarterly (simulate Chargebee down)

Rate Limit Spike

Symptom

Logs show "Please try after some time" errors; multiple retries with exponential backoff; user requests slow down

Root Causes

  1. Sudden traffic spike (load test, marketing campaign, viral growth)
  2. Polling worker making too many requests (config error, infinite loop)
  3. Chargebee’s rate limit lowered (API downgrade, or our usage miscounted)
  4. Retry storms (exponential backoff causing thundering herd)

Investigation

Terminal window
# 1. Check request volume in the last hour
grep "chargebee.*request" logs/billing.log | wc -l
# Compare to baseline
# 2. Count requests by operation
grep "chargebee.*request" logs/billing.log | \
sed 's/.*operation=//' | cut -d' ' -f1 | sort | uniq -c
# 3. Check retry attempts
grep "retry.*attempt" logs/billing.log | tail -20
# 4. Check if polling worker is looping
grep "events_polled" logs/billing.log | tail -20 | awk '{print $1}' | uniq | wc -l
# Should be ~60 entries over 15 seconds; if > 100: looping

Resolution

Case 1: Normal traffic spike

Terminal window
# Chargebee will recover automatically (1-2 minutes)
# Our system retries with backoff; requests will eventually succeed
# Monitor recovery
grep "rate.*limit\|Please try after" logs/billing.log | tail -1 | awk '{print $1}'
# If timestamp is > 2 minutes ago: recovered
# Check if user requests are succeeding now
curl -X GET http://localhost:8080/api/subscriptions/<sub_id> -w "%{http_code}"
# Should be 200

Case 2: Polling worker misconfiguration

Terminal window
# Check polling frequency
ps aux | grep polling-worker
# Look for: --frequency=15s (should be 15+ seconds)
# If frequency too low: update config
systemctl stop billing-poller
# Edit /etc/billing/config.yaml: set frequency to 30s
systemctl start billing-poller
# Monitor again
grep "events_polled" logs/billing.log | tail -5

Case 3: Chargebee rate limit was lowered

Terminal window
# This is rare; would require Chargebee proactively lowering your limit
# Check account status: https://app.chargebee.com/settings/your-account
# If limit was lowered:
# Option A: Request increase (contact support)
# Option B: Reduce request frequency (increase poll interval, batch requests)
# Calculate required frequency reduction
current_rps = $(grep "chargebee.*request" logs/billing.log | wc -l) / 3600
new_rps = min(60 / 60, 1.0) # 60 requests/minute = 1 RPS
reduction_factor = new_rps / current_rps
# Example: If we're doing 2 RPS and limit is 1 RPS, reduce by 50%

Case 4: Retry storms (thundering herd)

Terminal window
# If multiple requests all hit rate limit and retry simultaneously,
# backoff + jitter helps, but may still spike
# Check for synchronization:
grep "retry.*exponential" logs/billing.log | \
sed 's/.*attempt=//' | sort | uniq -c | sort -rn | head
# If one attempt number dominates: synchronized retries
# Add random jitter to retry delay
# (Already implemented in retry/retry.go with exponential backoff)
# If problem persists: contact engineering for backoff tuning

Prevention

  • Monitor chargebee_rate_limit_errors_total (alert if > 0)
  • Implement request batching (list instead of individual lookups)
  • Cache aggressively (extend TTLs, pre-warm cache)
  • Load test with Chargebee to understand rate limit behavior

Fraud Alert Response

Symptom

Multiple failed payment attempts; potential fraudster; fraud alert triggered (alert from Primer or PostHog)

Root Causes

  1. Testing (developer testing payment flows)
  2. Card decline (legitimate user’s card being rejected)
  3. Actual fraud (stolen card or account takeover)
  4. Billing system error (charging multiple times unintentionally)

Investigation

Terminal window
# 1. Get user details
SELECT user_id, email, created_at FROM users WHERE id='<user_id>';
# 2. Check payment history
SELECT transaction_id, amount, status, created_at FROM transactions
WHERE customer_id='<chargebee_customer_id>'
ORDER BY created_at DESC LIMIT 10;
# 3. Check if account is newly created (more likely to be fraud)
SELECT TIMESTAMPDIFF(MINUTE, created_at, NOW()) FROM users WHERE id='<user_id>';
# If < 30 min: new account (higher fraud risk)
# 4. Check activity pattern (testing vs real usage)
SELECT * FROM subscriptions WHERE customer_id='<chargebee_customer_id>';
# Multiple subscriptions in short time = potential testing
# 5. Check IP/device changes
SELECT ip_address, COUNT(*) FROM login_attempts
WHERE user_id='<user_id>'
GROUP BY ip_address;
# Multiple IPs = possible account compromise

Resolution

Case 1: Testing (developer)

Terminal window
# If internal user: No action needed
# If external user: Contact support to explain
# To prevent: Use separate test account with test API key
# Make sure test environment doesn't use production Chargebee

Case 2: Legitimate card decline

Terminal window
# Card was declined (insufficient funds, expired, etc.)
# User will retry naturally; no action needed
# But: Check if we're charging multiple times on decline
SELECT COUNT(*) FROM transactions
WHERE customer_id='<id>' AND status='failed'
ORDER BY created_at DESC LIMIT 5;
# Should be 1 failed transaction per user attempt
# If multiple failures from single user attempt: Bug in payment retry logic
# Contact engineering

Case 3: Actual fraud

Terminal window
# Steps:
# 1. Freeze account (disable further payments)
curl -X POST http://localhost:8080/internal/users/<user_id>/freeze
# 2. Void any pending invoices
chargebee-cli invoice void <invoice_id>
# 3. Notify user (send email)
curl -X POST http://localhost:8080/internal/notifications/alert \
-d '{"user_id": "<user_id>", "message": "We detected suspicious activity"}'
# 4. Contact payment processor (Primer)
# Report transaction ID to Primer support with details
# 5. Review transaction logs
grep "user_id=<user_id>" logs/billing.log logs/auth.log | tail -50
# Look for: unusual patterns, high-frequency attempts, geographic anomalies
# 6. Consider additional verification
# If high-value account: require 2FA, manual review before unlocking

Case 4: Billing system error (double-charging)

Terminal window
# Check for duplicate transactions
SELECT transaction_id, COUNT(*) FROM transactions
WHERE customer_id='<id>'
GROUP BY transaction_id
HAVING COUNT(*) > 1;
# If duplicates found: Data corruption issue
# 1. Contact engineering
# 2. Issue refund for duplicate charges
chargebee-cli credit-note create \
--customer-id=<customer_id> \
--amount=<duplicate_amount>
# 3. Fix root cause (idempotency check, locks, etc.)

Prevention

  • Set up fraud alerts in Primer dashboard (monitor > N failed attempts)
  • Monitor unusual payment patterns (>$1000/min, >10 charges/min)
  • Require verification for high-value accounts
  • Rate-limit payment attempts per user (max 5/hour)
  • Regular fraud report reviews (quarterly)

Quick Reference: Common Commands

Terminal window
# Check subscription status
chargebee-cli subscription get <subscription_id> | jq '.status, .coupon_ids'
# List unpaid invoices
chargebee-cli invoice list --filter='status:payment_due' --limit=50
# Void invoice
chargebee-cli invoice void <invoice_id>
# Create credit note (refund)
chargebee-cli credit-note create --customer-id=<id> --amount=<cents>
# Check payment transaction
chargebee-cli transaction get <transaction_id>
# Trigger manual event polling
curl -X POST http://localhost:8080/internal/billing/poll \
-H "Content-Type: application/json" \
-d '{"all_invoices": false}'
# Invalidate user cache
curl -X POST http://localhost:8080/internal/cache/invalidate \
-d '{"customer_id": "<customer_id>"}'
# Sync subscription from Chargebee
curl -X POST http://localhost:8080/internal/billing/sync \
-d '{"subscription_id": "<subscription_id>"}'
# Check circuit breaker status
curl http://localhost:8080/internal/health/circuit-breaker | jq .
# View billing logs
journalctl -u billing-service -n 100 -f
# Restart billing service
systemctl restart billing-service