Billing Service Level Objectives
Billing Service Level Objectives (SLOs)
This draft defines target SLOs, SLIs, error budgets, and alerting policies for the billing system.
The current code registers base OpenTelemetry instruments in apps/sirloin/internal/app/services/billing/metrics/: billing_payment_total, billing_checkout_duration_seconds, billing_subscription_activated_total, billing_dunning_retry_total, billing_credit_applied_total, billing_fraud_event_total, and billing_rate_limit_hits_total. Derived success-rate, consistency, latency, and dashboard artifacts below are target observability work unless explicitly listed as existing.
Overview
Billing is a critical path for revenue. The system must be highly reliable and observable.
SLO Philosophy
- User-focused: Measure what users experience (credit latency, success rate)
- Conservative: Set realistic targets; miss them rarely
- Actionable: Alert when trending toward breach; include remediation steps
Core SLOs
1. Payment Success Rate
Definition: Percentage of completed Primer payments that successfully record in our system.
SLI: (successful_payments) / (payments_completed_in_primer)
Target: 99.9% (allow 0.1% error rate = 1 failure per 1000 payments)
Measurement:
- Count: Primer webhooks + polling results
- Include: First attempt + retries
- Exclude: User-canceled payments, declined cards (not our system’s fault)
Current Metric: billing_payment_total
Required Metric: add paired success/failure counters or a derived success-rate gauge before enforcing this SLO.
Planned Dashboard: internal/dashboards/billing-payment-success.json
Draft Alert Threshold:
- Warning: 99.5% (trending down)
- Critical: 99.0% (below target)
- Duration: 5 minutes
Remediation:
- If webhook failures: Check network/firewall to Primer
- If polling failures: Check Chargebee API health, our rate limiting
- If idempotency failures: Check distributed lock service
- If activation failures: Check Chargebee subscription API
2. Credit Application Latency
Definition: Time from payment completion in Primer to credits appearing in user account.
Fast-path SLI: P(latency_seconds < 15) (percentage of payments processed in under 15 seconds)
Fast-path Target: 95% within 15 seconds.
Eventual-consistency SLI: P(latency_seconds < 1800) (percentage of payments processed within 30 minutes)
Eventual-consistency Target: 99.9% within 30 minutes.
Measurement:
- From: Primer webhook timestamp OR Chargebee invoice paid_at
- To: EventPoller processes invoice, applies credits
- Current code records
billing_credit_applied_totalwhen credits are applied. - Required: add a latency histogram, for example
billing_credits_applied_latency_seconds, before enforcing latency alerts.
Planned Metric Name: billing_credits_applied_latency_seconds
Planned Dashboard: internal/dashboards/billing-credit-latency.json
Draft Alert Threshold:
- Warning: fast-path P95 > 15 seconds
- Critical: fast-path P99 > 15 seconds or eventual-consistency P99.9 > 30 minutes
- Duration: 10 minutes
Remediation:
- If polling frequency issue: Check polling worker schedule
- If Chargebee slow: Monitor Chargebee API latency; contact support if > 10s
- If DB slow: Check purchase record write performance
- If cache invalidation slow: Check Redis latency
3. Subscription State Consistency
Definition: Percentage of subscriptions where DB status matches Chargebee status.
Planned SLI: (consistent_subscriptions) / (total_subscriptions)
Target: 99.95% consistency once a consistency checker exists.
Measurement:
- Current code runs
TaskChargebeeSyncSubsAlldaily in protected stages and syncs all subscriptions from a Chargebee export. - Current code does not perform an hourly sampled field-level consistency check.
- Required: add an explicit consistency checker before enforcing this SLO.
Planned Metric Name: billing_subscription_consistency_ratio
Planned Dashboard: internal/dashboards/billing-subscription-consistency.json
Draft Alert Threshold:
- Warning: > 0.5% mismatch
- Critical: > 1% mismatch
- Duration: 2 check cycles after a checker is implemented
Remediation:
- Run the Chargebee subscription sync task manually if supported
- Check for failed API calls in logs (partial success, API down)
- Investigate specific mismatches (manual changes, bugs)
4. Checkout Success Rate
Definition: Percentage of users who successfully create a Primer checkout session.
SLI: (completed_checkouts) / (initiated_checkouts)
Target: 99.5% (allow 0.5% creation failures)
Measurement:
- Count: CreatePrimerCheckout API calls
- Success: Returns client_token to user
- Exclude: Network errors during API call (client’s issue)
Current Metric: billing_checkout_duration_seconds
Required Metric: add initiated/completed checkout counters or a derived checkout success-rate gauge before enforcing this SLO.
Planned Dashboard: internal/dashboards/billing-checkout.json
Draft Alert Threshold:
- Warning: 99.0%
- Critical: 98.0%
- Duration: 10 minutes
Remediation:
- If Chargebee errors: Check customer creation, invoice creation
- If Primer errors: Check client token generation
- If cache errors: Check Redis availability
- If coupon validation: Check coupon fetch/validate logic
5. Dunning Retry Success Rate
Definition: Percentage of retry attempts that either succeed or are legitimately unrecoverable.
SLI: (successful_retries + canceled_subscriptions) / (total_retry_attempts)
Target: 95% (allow 5% “stuck” invoices requiring manual intervention)
Measurement:
- From: Dunning retry worker triggers renewal invoice payment
- To: Payment recorded OR subscription canceled (max retries exhausted)
- Current code records
billing_dunning_retry_totalwhen dunning retries run.
Current Metric: billing_dunning_retry_total
Required Metric: add retry outcome counters, for example success, pending, and exhausted, or a derived success-rate gauge before enforcing this SLO.
Planned Dashboard: internal/dashboards/billing-dunning.json
Draft Alert Threshold:
- Warning: 90% (trending down)
- Critical: 80% (failing too many retries)
- Duration: 1 hour
Remediation:
- If payment failures: Check Primer vaulting, card validity
- If subscription cancellation failures: Check Chargebee API
- If stuck invoices: Manual review + support ticket
SLI Dashboards
All SLIs should be visualized on Grafana dashboards before this runbook is promoted from draft. Each dashboard should include:
- Current status (latest value, SLO target)
- Trend (last 7 days, 30 days)
- Alert status (warning, critical)
- Error budget (remaining budget for month)
- Remediation links (runbook, logs, relevant metrics)
Planned Dashboard Locations
These dashboard JSON files are not currently checked into the repository. Add them with the corresponding metrics before treating this runbook as active.
- Payment Success:
internal/dashboards/billing-payment-success.json - Credit Latency:
internal/dashboards/billing-credit-latency.json - Subscription Consistency:
internal/dashboards/billing-subscription-consistency.json - Checkout Success:
internal/dashboards/billing-checkout.json - Dunning Retry:
internal/dashboards/billing-dunning.json - System Health:
internal/dashboards/billing-system-health.json(all metrics)
Error Budget
Monthly Error Budget Calculation
For each SLO, calculate allowed downtime per month:
Error Budget (%) = 100% - SLO TargetError Budget (minutes/month) = Error Budget (%) × 30 days × 1440 minutes/dayExample: Payment Success Rate (99.9% SLO)
Error Budget = 100% - 99.9% = 0.1%Error Budget = 0.1% × 43,200 minutes = 43.2 minutes/monthIf payment success rate drops below 99.9%, we consume error budget. When budget hits zero, we’re in SLO violation.
Error Budget Tracking
- Track monthly error budget spent per SLO
- Alert when budget > 50% spent (mid-month warning)
- Report monthly SLO compliance to stakeholders
- Plan remediation when budget trends zero
Key Metrics to Monitor
In addition to SLOs, monitor these operational metrics. This table mixes current and target metrics; add explicit instruments before installing alerts for target-only metrics.
Performance Metrics
| Metric | Target | Alert |
|---|---|---|
chargebee_api_latency_p95_ms | < 1000ms | > 2000ms |
primer_api_latency_p95_ms | < 500ms | > 1000ms |
payment_recording_latency_p95_s | < 5s | > 10s |
chargebee_list_subscriptions_latency_p95_ms | < 2000ms | > 5000ms |
event_polling_latency_p95_ms | < 5000ms | > 10000ms |
Error Rate Metrics
| Metric | Target | Alert |
|---|---|---|
chargebee_api_error_rate | < 0.1% | > 1% |
primer_api_error_rate | < 0.1% | > 1% |
distributed_lock_timeout_rate | < 0.01% | > 0.1% |
idempotency_check_error_rate | < 0.01% | > 0.1% |
circuit_breaker_open_count | 0 | > 0 (immediate) |
Business Metrics
| Metric | Purpose |
|---|---|
billing_revenue_usd | Total revenue processed (daily, monthly) |
billing_mrr_usd | Monthly recurring revenue |
billing_churn_rate | % subscriptions canceled (monthly) |
billing_refund_rate | % of revenue refunded (monthly) |
billing_failed_payment_rate | % of renewal payments that failed (daily) |
Alerting Policies
All alerts include:
- Symptom: What’s failing (e.g., “Payment success rate below 99%”)
- Impact: Why it matters (e.g., “Users can’t pay for subscriptions”)
- Severity: Warning, Critical, or Page (wake up on-call)
- Duration: Min time in breach before alert fires (avoid flapping)
- Remediation: Steps to investigate (see Billing Runbook)
Alert Routing
SLO Breaches (Critical) → Page on-call engineer immediatelyPerformance Warnings (Warning) → Slack #billing-alerts (next business day OK)System Health (Info) → Log only, no alert (for debugging)Alert Rule Status
Do not install alert rules for this runbook until the matching metrics and dashboards exist. In particular, billing_payment_success_total, billing_credits_applied_latency_seconds, billing_subscription_consistency_ratio, billing_dunning_success_rate, and billing_circuit_breaker_open are not currently registered by the billing metrics package.
Quarterly SLO Review
Each quarter, review and update SLOs:
- Historical compliance: Did we meet targets? By how much?
- Incidents: Review billing incidents; were SLOs predictive?
- Trends: Are metrics improving or degrading?
- Adjustments: Update targets if sustainable level changes
- Roadmap: Plan improvements to increase reliability
Review Checklist
- Pull monthly compliance reports (last 3 months)
- Review incident post-mortems (any billing-related?)
- Check if error budgets match reality (e.g., were warnings accurate?)
- Identify systemic issues (e.g., “Chargebee always slow at 8am”)
- Propose SLO changes (tighter/looser targets)
- Plan Q4 reliability improvements
SLO Documentation
For Engineers
- Building SLOs: Use
apps/sirloin/internal/app/services/billing/metrics - Logging metrics: record with the existing OTel instruments or add explicit instruments before documenting new SLOs
- Testing: Mock metrics in unit tests; validate in integration tests
- Dashboard: Update Grafana when adding new SLI
For Operations
- Monitoring: Check dashboards during on-call rotation
- Alerting: Configure PagerDuty/Slack routing per severity
- Runbooks: Keep the Billing Runbook up to date
- Reporting: Monthly SLO compliance report to stakeholders
For Product
- Understanding: SLOs represent what users experience
- Roadmap: Reliability improvements reduce error budget burn
- Incidents: Major incidents typically indicate SLO miss
Emergency Response
If Multiple SLOs Breach Simultaneously
- Page on-call immediately
- Assess: Check system health dashboard
- Decide: Continue operation (with manual intervention) or rollback?
- Communicate: Update status page; notify support team
- Remediate: See the Billing Runbook for specific issues
- Post-mortem: Schedule within 24 hours
If Chargebee Is Down
Estimate impact:
- Payment recording: Paused (polling can’t reach API)
- Subscription queries: Paused
- Primer payments: Continue (independent system)
- User access: Continues (based on cached subscription state)
Actions:
- Activate cache for subscription queries (extend TTL)
- Queue payments for retry after Chargebee recovers
- Update status page (“Subscription features temporarily unavailable”)
- Estimate recovery time based on Chargebee status page
Appendix: SLO Targets Summary
| Objective | SLI | Target | Error Budget |
|---|---|---|---|
| Payment Success Rate | Success % | 99.9% | 43 min/month |
| Credit Application Latency | P95 < 15s | 95% | 2,160 min (36 hours)/month |
| Subscription Consistency | Consistency % | 99.95% | 22 min/month |
| Checkout Success Rate | Success % | 99.5% | 216 min/month |
| Dunning Retry Success | Success % | 95% | 2,160 min (36 hours)/month |
Note: Error budgets vary widely by objective. Payment Success Rate and Subscription Consistency have tight budgets (~43 min and ~22 min), while Credit Application Latency and Dunning Retry Success have generous budgets (~36 hours each) reflecting their lower targets. Tightest budgets should drive operational priorities.