Billing Operational Runbook
Billing Operational Runbook
This document provides step-by-step procedures for common operational incidents involving the billing system.
Operational Architecture
Sirloin’s billing package owns subscription management, payment processing, checkout cleanup, renewal retries, product listing, credit allocation, cache invalidation, and billing analytics. The core package areas are:
domain/: shared billing entities, sentinel errors, plan parsing, and analytics interfaces.chargebee/: Chargebee client wrapper and retry behavior.checkout/: Primer checkout creation and expired pending-subscription cleanup.payments/: unified payment recording, idempotency checks, and subscription activation.events/: Chargebee polling, credit extraction, refund detection, and analytics notifications.subscriptions/,renewals/, andproducts/: subscription lifecycle, renewal retry, and product display operations.
Operational incidents usually cross Chargebee, Primer, the Sirloin database, Redis/cache invalidation, and billing background workers. Treat Chargebee as the authoritative subscription and invoice source, and use local purchase records plus distributed locks to verify whether credits were applied exactly once.
Table of Contents
- Orphaned Payments
- Stuck Subscriptions
- Chargebee Sync Drift
- Circuit Breaker Triage
- Rate Limit Spike
- Fraud Alert Response
Orphaned Payments
Symptom
User reports: “I paid but my subscription isn’t active” or “Credits didn’t appear”
Root Causes
- Primer webhook didn’t arrive (network issue)
- Polling worker crashed (missing fallback detection)
- Idempotency key mismatch (Primer txn ID not recorded)
- Chargebee invoice not marked as paid
- EventPoller hasn’t run yet (payment too recent; <15 seconds old)
Investigation
# 1. Check Chargebee for invoice statuschargebee-cli invoice get <invoice_id># Look for: status = "paid", payment_received, transaction_id
# 2. Check local DB for purchase recordSELECT * FROM purchases WHERE invoice_id = '<invoice_id>';# If empty: payment not recorded locally
# 3. Check logs for payment recording errorsgrep "invoice_id=<invoice_id>" logs/billing.log# Look for: errors, rate limiting, lock failures
# 4. Check polling worker statusSELECT * FROM event_poller_state ORDER BY created_at DESC LIMIT 10;# If > 5 min old: polling worker may be stuck
# 5. Check if payment is recentSELECT TIMESTAMPDIFF(SECOND, paid_at, NOW()) FROM invoices WHERE id='<invoice_id>';# If < 15 seconds: polling hasn't run yet; waitResolution
Case 1: Chargebee shows paid; DB shows no purchase
# Trigger manual event processingcurl -X POST http://localhost:8080/internal/billing/poll \ -H "Content-Type: application/json" \ -d '{"all_invoices": false, "since": "2026-04-08T12:00:00Z"}'
# Wait 30 seconds, check if credits appearedSELECT credits FROM users WHERE id = '<user_id>';Case 2: Chargebee shows unpaid
# Check if payment was actually captured in Primerprimer-cli transaction get <primer_transaction_id># If status = "authorized" (not "captured"):# → Contact Primer support; payment not actually settled
# If status = "captured":# → Payment was captured by Primer but not recorded in Chargebee# → See "Sync Drift" section belowCase 3: Polling worker stuck
# Check worker healthsystemctl status billing-poller# If not running: systemctl start billing-poller
# Check logsjournalctl -u billing-poller -n 100 --no-pager# Look for: rate limit errors, API errors, crashes
# If rate-limited: Wait 60 seconds for backoff, then restartsystemctl restart billing-pollerCase 4: Idempotency check failed
# Verify transaction ID formatSELECT transaction_id FROM purchases WHERE invoice_id='<invoice_id>';# Should match Primer transaction ID
# If mismatch: Data corruption issue# → Contact engineering; may need DB cleanupSlack Alert: “ORPHANED PAYMENT RECOVERY EXHAUSTED - manual review required”
The Primer settled-payments poller (TaskPollPrimerPayments) auto-recovers payments that settled for a canceled subscription (checkout-retry orphans). Recovery retries on exponential backoff (1m → 30m cap), 10 attempts over ~2.5h; the polling cursor holds behind the payment so every attempt actually runs. Typical orphans clear on the first one or two attempts (within minutes); the ~2.5h figure is the worst-case ceiling before exhaustion, not the usual wait — the user-facing checkout-timeout toast intentionally quotes an optimistic ~30 min rather than that ceiling so it does not alarm users, so treat that copy as deliberate, not a stale number. This alert means all 10 attempts failed — the payment is now human-owned and the poller has moved on.
# 1. Pull the payment from Primer using the ID in the alertprimer-cli payment get <payment_id># Confirm status (SETTLED = money captured) and customerId
# 2. Check whether the user already got what they paid forSELECT * FROM purchases WHERE user_id = '<customer_id>' ORDER BY created_at DESC LIMIT 5;chargebee-cli subscription list --customer <customer_id>
# 3a. User has an active sub for the same plan → duplicate charge, refund via Primer# 3b. User has nothing → manual recovery: create subscription + record payment# (mirror recoverOrphanedPayment steps), or refund if the user re-purchasedNote: the retry counter is in-memory — a deploy mid-cycle restarts the 10 attempts and can produce a second exhaustion alert for the same payment. Treat duplicates by payment ID.
Slack Alert: same-plan duplicate charge during orphan recovery
Recovery found the user already holds an active subscription for the plan the orphaned payment paid for — the user paid twice for one intent (checkout double-click). The payment is marked processed; the alert is the refund workflow trigger. Refund the alerted payment ID in Primer.
Prevention
- Monitor
payment_recording_latency_seconds(alert if > 60s) - Monitor
events_polled_total(alert if 0 for > 5 min) - Set up Primer webhook retry alerts
Stuck Subscriptions
Symptom
User’s subscription is in “future” or “non_renewing” state when it should be “active” or “cancelled”
Root Causes
- Pending checkout never completed (user started but didn’t finish)
- Activation failed (update_term_end API error)
- Cancellation API error (returned error but was actually canceled)
- Chargebee-side state mismatch (DB disagrees with Chargebee)
Investigation
# 1. Check local DBSELECT id, status, start_date, current_term_end FROM subscriptionsWHERE customer_id='<customer_id>';
# 2. Check Chargebeechargebee-cli subscription get <subscription_id># Compare status between DB and Chargebee
# 3. Check if subscription is pending checkoutSELECT * FROM subscriptions WHERE status='future' AND start_date > NOW() + INTERVAL '1 YEAR';# If matches: this is a pending checkout
# 4. Check ageSELECT TIMESTAMPDIFF(DAY, created_at, NOW()) FROM subscriptions WHERE id='<subscription_id>';# If > 1 day: eligible for cleanupResolution
Case 1: Subscription is future + start_date > 1 year (pending checkout)
# Is checkout still active?curl -X GET http://localhost:8080/api/checkout/status?order_id=<invoice_id># If "expired": User didn't complete payment
# Option A: User wants to retry# → Recreate checkout (new Primer session)curl -X POST http://localhost:8080/api/billing/checkout \ -d '{"user_id": "<user_id>", "item_price_id": "<item_price_id>"}'
# Option B: Cleanup (if > 1 day old)# → Manually cancelchargebee-cli subscription cancel <subscription_id># Then re-create if user wants to retryCase 2: Subscription should be active but is still future
# Payment was recorded but activation failed?SELECT * FROM purchases WHERE subscription_id='<subscription_id>';# If empty: payment never recorded
# If purchase exists: activation failed# → Try to manually activatecurl -X POST http://localhost:8080/internal/billing/activate \ -d '{"subscription_id": "<subscription_id>"}'
# If successful: Check Chargebee status updatedchargebee-cli subscription get <subscription_id> | grep statusCase 3: Subscription should be cancelled but is still active
# Was cancellation API called?grep "subscription_id=<subscription_id>" logs/billing.log | grep cancel# If no matches: cancellation was never initiated
# If matches but status still "active":# → Chargebee API returned success but didn't actually update# → Try againchargebee-cli subscription cancel <subscription_id> --force
# If still doesn't work: Chargebee issue# → Contact Chargebee support with subscription IDCase 4: DB/Chargebee mismatch
# Force sync from Chargebee to DBcurl -X POST http://localhost:8080/internal/billing/sync \ -d '{"subscription_id": "<subscription_id>"}'
# Verify matchchargebee-cli subscription get <subscription_id> > /tmp/cb.jsoncurl http://localhost:8080/api/subscriptions/<subscription_id> > /tmp/db.jsondiff /tmp/cb.json /tmp/db.jsonPrevention
- Monitor subscriptions in “future” state (alert if > 1 day old + start_date not near now)
- Monitor activation errors in logs (alert if > 1% of payments)
- Test pending checkout cleanup regularly
Chargebee Sync Drift
Symptom
DB shows subscription in different state than Chargebee, or invoice amounts don’t match
Root Causes
- Failed API call that partially succeeded (Chargebee changed, DB didn’t)
- Stale cache (local cache hasn’t been invalidated)
- Manual change in Chargebee UI (not synced back to DB)
- Eventual consistency window (recent change, not synced yet)
Investigation
# 1. Get current state from both systemschargebee-cli subscription get <subscription_id> | jq . > /tmp/chargebee.jsoncurl -X GET http://localhost:8080/api/subscriptions/<subscription_id> | jq . > /tmp/db.json
# 2. Compare critical fieldsdiff <(jq '.status, .current_term_end, .coupon_ids' /tmp/chargebee.json) \ <(jq '.status, .current_term_end, .coupon_ids' /tmp/db.json)
# 3. Check invoice amountschargebee-cli invoice get <invoice_id> | jq '.total'SELECT total_amount FROM invoices WHERE id='<invoice_id>';Resolution
Case 1: Chargebee is newer (DB is stale)
# Option A: Invalidate cache + re-fetchcurl -X POST http://localhost:8080/internal/cache/invalidate \ -d '{"customer_id": "<customer_id>"}'
# Then fetch again (will re-query Chargebee)curl http://localhost:8080/api/subscriptions/<subscription_id>
# Option B: Full synccurl -X POST http://localhost:8080/internal/billing/sync-all \ -d '{"since": "2026-04-08T00:00:00Z"}'Case 2: DB is newer (Chargebee is stale)
This should be rare (we sync from Chargebee, not push to it). But can happen if:
- Manual update was attempted but failed halfway
- Chargebee API returned success but didn’t apply
# Re-apply the update to Chargebeechargebee-cli subscription update <subscription_id> \ --new-field-name="<expected_value>"
# Then verify synccurl -X POST http://localhost:8080/internal/billing/sync \ -d '{"subscription_id": "<subscription_id>"}'Case 3: Manual change in Chargebee UI
# User or support made manual changes in Chargebee# Sync DB to matchcurl -X POST http://localhost:8080/internal/billing/sync \ -d '{"subscription_id": "<subscription_id>"}'
# Verify matchchargebee-cli subscription get <subscription_id> | jq '.status' > /tmp/cb_statuscurl http://localhost:8080/api/subscriptions/<subscription_id> | jq '.status' > /tmp/db_statusdiff /tmp/cb_status /tmp/db_statusPrevention
- Monitor sync errors in logs (alert if > 0 per hour)
- Monitor DB/Chargebee consistency gaps (audit query every 1 hour)
- Disable manual Chargebee changes for “system” subscriptions; route through API
Circuit Breaker Triage
Symptom
“Circuit breaker is open” error in logs; Chargebee API calls failing
Root Causes
- Chargebee API is down (service issue)
- Our API key is invalid or rate-limited (configuration issue)
- Network connectivity problem (firewall, DNS, proxy)
- Sustained error rate too high (threshold exceeded; e.g., > 50% failures)
Investigation
# 1. Check circuit breaker statecurl http://localhost:8080/internal/health/circuit-breaker
# Output might show: {"chargebee": {"state": "open", "error_rate": 0.75}}
# 2. Check error logsgrep "circuit.*open" logs/billing.log | tail -20
# 3. Test Chargebee API directlycurl -H "Authorization: Bearer $CHARGEBEE_API_KEY" \ https://api.chargebee.com/api/v2/health
# 4. Check rate limitinggrep "rate.*limit\|429\|Please try after" logs/billing.log | tail -10
# 5. Check our API key configurationecho $CHARGEBEE_API_KEY | head -c 10 # Show first 10 chars (don't log full key)# If missing or changed recently: configuration issueResolution
Case 1: Chargebee API is actually down
# Wait for Chargebee to recover (~5-30 minutes typically)# Monitor status page: https://status.chargebee.com
# In the meantime:# - Primer payments may still work (Primer is independent)# - Event polling will retry automatically (exponential backoff)# - User-facing requests will fail; serve cached data if available
# Alert users (if outage > 30 minutes)curl -X POST http://localhost:8080/internal/notifications/alert \ -d '{"message": "Subscription operations temporarily unavailable"}'Case 2: Rate limiting
# Chargebee rate limit is per API key, per minute# Standard limit: ~60 requests/minute
# Check rate limit errorsgrep "Please try after some time" logs/billing.log | wc -l
# Reduce request frequency if possible:# - Increase polling interval (15s → 30s)# - Batch operations (list instead of individual gets)# - Implement caching more aggressively
# Contact Chargebee if sustained: request rate limit increaseCase 3: Invalid/rotated API key
# Check if key was recently rotatedgit log --all --grep="API_KEY\|chargebee" --oneline | head -5
# If key was rotated but env var not updated:export CHARGEBEE_API_KEY="<new_key>"# Or update .env file and restart servicesystemctl restart billing-service
# Verify key workscurl -H "Authorization: Bearer $CHARGEBEE_API_KEY" \ https://api.chargebee.com/api/v2/items?limit=1Case 4: Network connectivity
# Test DNS resolutionnslookup api.chargebee.com
# Test raw connectivitync -zv api.chargebee.com 443
# Check proxy settings (if behind proxy)curl -v -x <proxy> https://api.chargebee.com/api/v2/health
# Check firewall rules# If behind corporate firewall, ensure api.chargebee.com is whitelistedPrevention
- Monitor circuit breaker state (alert if open > 5 minutes)
- Set up status page monitoring (Chargebee status page)
- Implement graceful degradation (cached data when CB unavailable)
- Test disaster scenario quarterly (simulate Chargebee down)
Rate Limit Spike
Symptom
Logs show "Please try after some time" errors; multiple retries with exponential backoff; user requests slow down
Root Causes
- Sudden traffic spike (load test, marketing campaign, viral growth)
- Polling worker making too many requests (config error, infinite loop)
- Chargebee’s rate limit lowered (API downgrade, or our usage miscounted)
- Retry storms (exponential backoff causing thundering herd)
Investigation
# 1. Check request volume in the last hourgrep "chargebee.*request" logs/billing.log | wc -l# Compare to baseline
# 2. Count requests by operationgrep "chargebee.*request" logs/billing.log | \ sed 's/.*operation=//' | cut -d' ' -f1 | sort | uniq -c
# 3. Check retry attemptsgrep "retry.*attempt" logs/billing.log | tail -20
# 4. Check if polling worker is loopinggrep "events_polled" logs/billing.log | tail -20 | awk '{print $1}' | uniq | wc -l# Should be ~60 entries over 15 seconds; if > 100: loopingResolution
Case 1: Normal traffic spike
# Chargebee will recover automatically (1-2 minutes)# Our system retries with backoff; requests will eventually succeed
# Monitor recoverygrep "rate.*limit\|Please try after" logs/billing.log | tail -1 | awk '{print $1}'# If timestamp is > 2 minutes ago: recovered
# Check if user requests are succeeding nowcurl -X GET http://localhost:8080/api/subscriptions/<sub_id> -w "%{http_code}"# Should be 200Case 2: Polling worker misconfiguration
# Check polling frequencyps aux | grep polling-worker# Look for: --frequency=15s (should be 15+ seconds)
# If frequency too low: update configsystemctl stop billing-poller# Edit /etc/billing/config.yaml: set frequency to 30ssystemctl start billing-poller
# Monitor againgrep "events_polled" logs/billing.log | tail -5Case 3: Chargebee rate limit was lowered
# This is rare; would require Chargebee proactively lowering your limit# Check account status: https://app.chargebee.com/settings/your-account
# If limit was lowered:# Option A: Request increase (contact support)# Option B: Reduce request frequency (increase poll interval, batch requests)
# Calculate required frequency reductioncurrent_rps = $(grep "chargebee.*request" logs/billing.log | wc -l) / 3600new_rps = min(60 / 60, 1.0) # 60 requests/minute = 1 RPSreduction_factor = new_rps / current_rps
# Example: If we're doing 2 RPS and limit is 1 RPS, reduce by 50%Case 4: Retry storms (thundering herd)
# If multiple requests all hit rate limit and retry simultaneously,# backoff + jitter helps, but may still spike
# Check for synchronization:grep "retry.*exponential" logs/billing.log | \ sed 's/.*attempt=//' | sort | uniq -c | sort -rn | head# If one attempt number dominates: synchronized retries
# Add random jitter to retry delay# (Already implemented in retry/retry.go with exponential backoff)
# If problem persists: contact engineering for backoff tuningPrevention
- Monitor
chargebee_rate_limit_errors_total(alert if > 0) - Implement request batching (list instead of individual lookups)
- Cache aggressively (extend TTLs, pre-warm cache)
- Load test with Chargebee to understand rate limit behavior
Fraud Alert Response
Symptom
Multiple failed payment attempts; potential fraudster; fraud alert triggered (alert from Primer or PostHog)
Root Causes
- Testing (developer testing payment flows)
- Card decline (legitimate user’s card being rejected)
- Actual fraud (stolen card or account takeover)
- Billing system error (charging multiple times unintentionally)
Investigation
# 1. Get user detailsSELECT user_id, email, created_at FROM users WHERE id='<user_id>';
# 2. Check payment historySELECT transaction_id, amount, status, created_at FROM transactionsWHERE customer_id='<chargebee_customer_id>'ORDER BY created_at DESC LIMIT 10;
# 3. Check if account is newly created (more likely to be fraud)SELECT TIMESTAMPDIFF(MINUTE, created_at, NOW()) FROM users WHERE id='<user_id>';# If < 30 min: new account (higher fraud risk)
# 4. Check activity pattern (testing vs real usage)SELECT * FROM subscriptions WHERE customer_id='<chargebee_customer_id>';# Multiple subscriptions in short time = potential testing
# 5. Check IP/device changesSELECT ip_address, COUNT(*) FROM login_attemptsWHERE user_id='<user_id>'GROUP BY ip_address;# Multiple IPs = possible account compromiseResolution
Case 1: Testing (developer)
# If internal user: No action needed# If external user: Contact support to explain
# To prevent: Use separate test account with test API key# Make sure test environment doesn't use production ChargebeeCase 2: Legitimate card decline
# Card was declined (insufficient funds, expired, etc.)# User will retry naturally; no action needed
# But: Check if we're charging multiple times on declineSELECT COUNT(*) FROM transactionsWHERE customer_id='<id>' AND status='failed'ORDER BY created_at DESC LIMIT 5;# Should be 1 failed transaction per user attempt
# If multiple failures from single user attempt: Bug in payment retry logic# Contact engineeringCase 3: Actual fraud
# Steps:# 1. Freeze account (disable further payments)curl -X POST http://localhost:8080/internal/users/<user_id>/freeze
# 2. Void any pending invoiceschargebee-cli invoice void <invoice_id>
# 3. Notify user (send email)curl -X POST http://localhost:8080/internal/notifications/alert \ -d '{"user_id": "<user_id>", "message": "We detected suspicious activity"}'
# 4. Contact payment processor (Primer)# Report transaction ID to Primer support with details
# 5. Review transaction logsgrep "user_id=<user_id>" logs/billing.log logs/auth.log | tail -50# Look for: unusual patterns, high-frequency attempts, geographic anomalies
# 6. Consider additional verification# If high-value account: require 2FA, manual review before unlockingCase 4: Billing system error (double-charging)
# Check for duplicate transactionsSELECT transaction_id, COUNT(*) FROM transactionsWHERE customer_id='<id>'GROUP BY transaction_idHAVING COUNT(*) > 1;
# If duplicates found: Data corruption issue# 1. Contact engineering# 2. Issue refund for duplicate chargeschargebee-cli credit-note create \ --customer-id=<customer_id> \ --amount=<duplicate_amount>
# 3. Fix root cause (idempotency check, locks, etc.)Prevention
- Set up fraud alerts in Primer dashboard (monitor > N failed attempts)
- Monitor unusual payment patterns (>$1000/min, >10 charges/min)
- Require verification for high-value accounts
- Authorization attempt quota enforced per user (NSFW tier, production only): 5 daily / 20 weekly / 30 monthly. CIT reserves 1 slot for MIT renewal (effective CIT limits: 4/19/29)
- Regular fraud report reviews (quarterly)
Quota Exhaustion (Authorization Attempt Limits)
Symptom
User sees “checkout temporarily blocked due to too many payment attempts” or renewal retry logs show MIT blocked: authorization count quota exhausted.
Investigation
-- Check user's authorization attempts in last 24h / 7d / 30dSELECT event_type, created_at, metadata->>'merchant_context' as ctx, metadata->>'amount' as amountFROM fraud_eventsWHERE user_id = '<user_id>' AND event_type = 'authorization_attempt' AND created_at > NOW() - INTERVAL '30 days'ORDER BY created_at DESC;
-- Check if velocity block was triggeredSELECT * FROM fraud_eventsWHERE user_id = '<user_id>' AND event_type = 'velocity_triggered'ORDER BY created_at DESC LIMIT 5;
-- Check dunning attempts with quota_blocked statusSELECT * FROM dunning_attemptsWHERE invoice_id IN ( SELECT invoice_id FROM chargebee_invoices WHERE customer_id = '<customer_id>')ORDER BY created_at DESC LIMIT 10;
-- Count per windowSELECT COUNT(*) FILTER (WHERE created_at > NOW() - INTERVAL '1 day') as daily, COUNT(*) FILTER (WHERE created_at > NOW() - INTERVAL '7 days') as weekly, COUNT(*) FILTER (WHERE created_at > NOW() - INTERVAL '30 days') as monthlyFROM fraud_eventsWHERE user_id = '<user_id>' AND event_type = 'authorization_attempt' AND metadata->>'merchant_context' = 'full';-- Limits: 5 daily / 20 weekly / 30 monthly (CIT: 4/19/29)Resolution
Legitimate user hit limit: Wait for window to expire. Daily resets after 24h from oldest attempt in window.
Fraud/abuse: No action needed — quota is working as intended. Monitor for escalation.
Bug (false blocks): Check if authorization_attempt rows are being written without corresponding payment attempts. Look for duplicate inserts or stale events.
NSFW Renewal Card Substitution Issues
Symptom
NSFW renewal fails with EMP hard decline despite user having a saved card. Or: renewal unexpectedly uses a different card than the user’s primary.
Investigation
# 1. Check if card substitution was attemptedgrep "nsfw_renewal" logs/billing.log | grep "<customer_id>"# Look for: nsfw_renewal_card_substitution (swapped) or nsfw_renewal_no_card_found
# 2. Check user's saved payment instruments# Query Primer for customer's vaulted methodsprimer-cli customer payment-instruments list <customer_id># Look for: PAYMENT_CARD type instruments that are not deleted/expired
# 3. Check if primary was auto-set after checkoutgrep "nsfw_checkout_set_primary" logs/billing.log | grep "<customer_id>"# If nsfw_checkout_set_primary_failed: Primer SetDefault API failed
# 4. Check what the user's current default instrument isprimer-cli customer payment-instruments list <customer_id> | grep '"default": true'Resolution
Case 1: Card substitution worked but EMP still declined
Card may be expired or have insufficient funds. Check the decline code in the dunning attempt record. This is normal dunning behavior — not a substitution bug.
Case 2: No card found, wallet sent to EMP
User only has wallet methods saved. Expected behavior — EMP rejects, subscription enters dunning. User needs to add a card through checkout (frontend forces fresh card entry on SFW→NSFW switch).
Case 3: Auto-set primary failed after checkout
maybeSetNSFWCheckoutPrimary is non-fatal. If it failed, the user’s old primary remains. Next scheduled renewal will attempt card substitution via resolveNSFWRenewalPaymentToken. No manual action needed unless the user has no saved cards at all.
Prevention
- Monitor
nsfw_renewal_no_card_foundlog frequency (rising trend = users without compatible cards) - Monitor
nsfw_checkout_set_primary_failed(Primer API issues)
Processor Assignment Issues
Symptom
- User routed to wrong PSP (e.g., NSFW payment hitting Cybersource instead of EMP/NMI)
metadata["psp"]in Primer dashboard doesn’t match expected processor- Checkout fails with processor-related error
Investigation
-- Check user's assigned processorsSELECT user_id, sfw_processor, nsfw_processor FROM users.credits WHERE user_id = '<user_id>';
-- Find users with no processor assignment (pre-backfill)SELECT COUNT(*) FROM users.credits WHERE nsfw_processor IS NULL;Check Primer dashboard for the transaction’s metadata.psp field to confirm what was sent.
Resolution
- Wrong processor stored: Manual fix via SQL (only if user hasn’t transacted on the stored processor):
UPDATE users.credits SET nsfw_processor = 'emp' WHERE user_id = '<user_id>';
- Missing assignment (pre-backfill): Run backfill:
UPDATE users.credits SET sfw_processor = 'cybersource' WHERE sfw_processor IS NULL;UPDATE users.credits SET nsfw_processor = 'emp' WHERE nsfw_processor IS NULL;
- ResolveProcessor returning error: Check DB connectivity. First
GetCreditscall is fail-hard (blocks checkout). Subsequent calls in the ensure/assign path are soft fallbacks.
Prevention
- Backfill all existing users before switching
decideProcessorForNewUserto return NMI - Monitor
failed to ensure credits rowandfailed to persist processor assignmentwarn logs
Quick Reference: Common Commands
# Check subscription statuschargebee-cli subscription get <subscription_id> | jq '.status, .coupon_ids'
# List unpaid invoiceschargebee-cli invoice list --filter='status:payment_due' --limit=50
# Void invoicechargebee-cli invoice void <invoice_id>
# Create credit note (refund)chargebee-cli credit-note create --customer-id=<id> --amount=<cents>
# Check payment transactionchargebee-cli transaction get <transaction_id>
# Trigger manual event pollingcurl -X POST http://localhost:8080/internal/billing/poll \ -H "Content-Type: application/json" \ -d '{"all_invoices": false}'
# Invalidate user cachecurl -X POST http://localhost:8080/internal/cache/invalidate \ -d '{"customer_id": "<customer_id>"}'
# Sync subscription from Chargebeecurl -X POST http://localhost:8080/internal/billing/sync \ -d '{"subscription_id": "<subscription_id>"}'
# Check circuit breaker statuscurl http://localhost:8080/internal/health/circuit-breaker | jq .
# View billing logsjournalctl -u billing-service -n 100 -f
# Restart billing servicesystemctl restart billing-service