Skip to content

Billing Service Level Objectives

Billing Service Level Objectives (SLOs)

This draft defines target SLOs, SLIs, error budgets, and alerting policies for the billing system.

The current code registers base OpenTelemetry instruments in apps/sirloin/internal/app/services/billing/metrics/: billing_payment_total, billing_checkout_duration_seconds, billing_subscription_activated_total, billing_dunning_retry_total, billing_credit_applied_total, billing_fraud_event_total, and billing_rate_limit_hits_total. Derived success-rate, consistency, latency, and dashboard artifacts below are target observability work unless explicitly listed as existing.

Overview

Billing is a critical path for revenue. The system must be highly reliable and observable.

SLO Philosophy

  • User-focused: Measure what users experience (credit latency, success rate)
  • Conservative: Set realistic targets; miss them rarely
  • Actionable: Alert when trending toward breach; include remediation steps

Core SLOs

1. Payment Success Rate

Definition: Percentage of completed Primer payments that successfully record in our system.

SLI: (successful_payments) / (payments_completed_in_primer)

Target: 99.9% (allow 0.1% error rate = 1 failure per 1000 payments)

Measurement:

  • Count: Primer webhooks + polling results
  • Include: First attempt + retries
  • Exclude: User-canceled payments, declined cards (not our system’s fault)

Current Metric: billing_payment_total

Required Metric: add paired success/failure counters or a derived success-rate gauge before enforcing this SLO.

Planned Dashboard: internal/dashboards/billing-payment-success.json

Draft Alert Threshold:

  • Warning: 99.5% (trending down)
  • Critical: 99.0% (below target)
  • Duration: 5 minutes

Remediation:

  • If webhook failures: Check network/firewall to Primer
  • If polling failures: Check Chargebee API health, our rate limiting
  • If idempotency failures: Check distributed lock service
  • If activation failures: Check Chargebee subscription API

2. Credit Application Latency

Definition: Time from payment completion in Primer to credits appearing in user account.

Fast-path SLI: P(latency_seconds < 15) (percentage of payments processed in under 15 seconds)

Fast-path Target: 95% within 15 seconds.

Eventual-consistency SLI: P(latency_seconds < 1800) (percentage of payments processed within 30 minutes)

Eventual-consistency Target: 99.9% within 30 minutes.

Measurement:

  • From: Primer webhook timestamp OR Chargebee invoice paid_at
  • To: EventPoller processes invoice, applies credits
  • Current code records billing_credit_applied_total when credits are applied.
  • Required: add a latency histogram, for example billing_credits_applied_latency_seconds, before enforcing latency alerts.

Planned Metric Name: billing_credits_applied_latency_seconds

Planned Dashboard: internal/dashboards/billing-credit-latency.json

Draft Alert Threshold:

  • Warning: fast-path P95 > 15 seconds
  • Critical: fast-path P99 > 15 seconds or eventual-consistency P99.9 > 30 minutes
  • Duration: 10 minutes

Remediation:

  • If polling frequency issue: Check polling worker schedule
  • If Chargebee slow: Monitor Chargebee API latency; contact support if > 10s
  • If DB slow: Check purchase record write performance
  • If cache invalidation slow: Check Redis latency

3. Subscription State Consistency

Definition: Percentage of subscriptions where DB status matches Chargebee status.

Planned SLI: (consistent_subscriptions) / (total_subscriptions)

Target: 99.95% consistency once a consistency checker exists.

Measurement:

  • Current code runs TaskChargebeeSyncSubsAll daily in protected stages and syncs all subscriptions from a Chargebee export.
  • Current code does not perform an hourly sampled field-level consistency check.
  • Required: add an explicit consistency checker before enforcing this SLO.

Planned Metric Name: billing_subscription_consistency_ratio

Planned Dashboard: internal/dashboards/billing-subscription-consistency.json

Draft Alert Threshold:

  • Warning: > 0.5% mismatch
  • Critical: > 1% mismatch
  • Duration: 2 check cycles after a checker is implemented

Remediation:

  • Run the Chargebee subscription sync task manually if supported
  • Check for failed API calls in logs (partial success, API down)
  • Investigate specific mismatches (manual changes, bugs)

4. Checkout Success Rate

Definition: Percentage of users who successfully create a Primer checkout session.

SLI: (completed_checkouts) / (initiated_checkouts)

Target: 99.5% (allow 0.5% creation failures)

Measurement:

  • Count: CreatePrimerCheckout API calls
  • Success: Returns client_token to user
  • Exclude: Network errors during API call (client’s issue)

Current Metric: billing_checkout_duration_seconds

Required Metric: add initiated/completed checkout counters or a derived checkout success-rate gauge before enforcing this SLO.

Planned Dashboard: internal/dashboards/billing-checkout.json

Draft Alert Threshold:

  • Warning: 99.0%
  • Critical: 98.0%
  • Duration: 10 minutes

Remediation:

  • If Chargebee errors: Check customer creation, invoice creation
  • If Primer errors: Check client token generation
  • If cache errors: Check Redis availability
  • If coupon validation: Check coupon fetch/validate logic

5. Dunning Retry Success Rate

Definition: Percentage of retry attempts that either succeed or are legitimately unrecoverable.

SLI: (successful_retries + canceled_subscriptions) / (total_retry_attempts)

Target: 95% (allow 5% “stuck” invoices requiring manual intervention)

Measurement:

  • From: Dunning retry worker triggers renewal invoice payment
  • To: Payment recorded OR subscription canceled (max retries exhausted)
  • Current code records billing_dunning_retry_total when dunning retries run.

Current Metric: billing_dunning_retry_total

Required Metric: add retry outcome counters, for example success, pending, and exhausted, or a derived success-rate gauge before enforcing this SLO.

Planned Dashboard: internal/dashboards/billing-dunning.json

Draft Alert Threshold:

  • Warning: 90% (trending down)
  • Critical: 80% (failing too many retries)
  • Duration: 1 hour

Remediation:

  • If payment failures: Check Primer vaulting, card validity
  • If subscription cancellation failures: Check Chargebee API
  • If stuck invoices: Manual review + support ticket

SLI Dashboards

All SLIs should be visualized on Grafana dashboards before this runbook is promoted from draft. Each dashboard should include:

  1. Current status (latest value, SLO target)
  2. Trend (last 7 days, 30 days)
  3. Alert status (warning, critical)
  4. Error budget (remaining budget for month)
  5. Remediation links (runbook, logs, relevant metrics)

Planned Dashboard Locations

These dashboard JSON files are not currently checked into the repository. Add them with the corresponding metrics before treating this runbook as active.

  • Payment Success: internal/dashboards/billing-payment-success.json
  • Credit Latency: internal/dashboards/billing-credit-latency.json
  • Subscription Consistency: internal/dashboards/billing-subscription-consistency.json
  • Checkout Success: internal/dashboards/billing-checkout.json
  • Dunning Retry: internal/dashboards/billing-dunning.json
  • System Health: internal/dashboards/billing-system-health.json (all metrics)

Error Budget

Monthly Error Budget Calculation

For each SLO, calculate allowed downtime per month:

Error Budget (%) = 100% - SLO Target
Error Budget (minutes/month) = Error Budget (%) × 30 days × 1440 minutes/day

Example: Payment Success Rate (99.9% SLO)

Error Budget = 100% - 99.9% = 0.1%
Error Budget = 0.1% × 43,200 minutes = 43.2 minutes/month

If payment success rate drops below 99.9%, we consume error budget. When budget hits zero, we’re in SLO violation.

Error Budget Tracking

  • Track monthly error budget spent per SLO
  • Alert when budget > 50% spent (mid-month warning)
  • Report monthly SLO compliance to stakeholders
  • Plan remediation when budget trends zero

Key Metrics to Monitor

In addition to SLOs, monitor these operational metrics. This table mixes current and target metrics; add explicit instruments before installing alerts for target-only metrics.

Performance Metrics

MetricTargetAlert
chargebee_api_latency_p95_ms< 1000ms> 2000ms
primer_api_latency_p95_ms< 500ms> 1000ms
payment_recording_latency_p95_s< 5s> 10s
chargebee_list_subscriptions_latency_p95_ms< 2000ms> 5000ms
event_polling_latency_p95_ms< 5000ms> 10000ms

Error Rate Metrics

MetricTargetAlert
chargebee_api_error_rate< 0.1%> 1%
primer_api_error_rate< 0.1%> 1%
distributed_lock_timeout_rate< 0.01%> 0.1%
idempotency_check_error_rate< 0.01%> 0.1%
circuit_breaker_open_count0> 0 (immediate)

Business Metrics

MetricPurpose
billing_revenue_usdTotal revenue processed (daily, monthly)
billing_mrr_usdMonthly recurring revenue
billing_churn_rate% subscriptions canceled (monthly)
billing_refund_rate% of revenue refunded (monthly)
billing_failed_payment_rate% of renewal payments that failed (daily)

Alerting Policies

All alerts include:

  1. Symptom: What’s failing (e.g., “Payment success rate below 99%”)
  2. Impact: Why it matters (e.g., “Users can’t pay for subscriptions”)
  3. Severity: Warning, Critical, or Page (wake up on-call)
  4. Duration: Min time in breach before alert fires (avoid flapping)
  5. Remediation: Steps to investigate (see Billing Runbook)

Alert Routing

SLO Breaches (Critical) → Page on-call engineer immediately
Performance Warnings (Warning) → Slack #billing-alerts (next business day OK)
System Health (Info) → Log only, no alert (for debugging)

Alert Rule Status

Do not install alert rules for this runbook until the matching metrics and dashboards exist. In particular, billing_payment_success_total, billing_credits_applied_latency_seconds, billing_subscription_consistency_ratio, billing_dunning_success_rate, and billing_circuit_breaker_open are not currently registered by the billing metrics package.


Quarterly SLO Review

Each quarter, review and update SLOs:

  1. Historical compliance: Did we meet targets? By how much?
  2. Incidents: Review billing incidents; were SLOs predictive?
  3. Trends: Are metrics improving or degrading?
  4. Adjustments: Update targets if sustainable level changes
  5. Roadmap: Plan improvements to increase reliability

Review Checklist

  • Pull monthly compliance reports (last 3 months)
  • Review incident post-mortems (any billing-related?)
  • Check if error budgets match reality (e.g., were warnings accurate?)
  • Identify systemic issues (e.g., “Chargebee always slow at 8am”)
  • Propose SLO changes (tighter/looser targets)
  • Plan Q4 reliability improvements

SLO Documentation

For Engineers

  • Building SLOs: Use apps/sirloin/internal/app/services/billing/metrics
  • Logging metrics: record with the existing OTel instruments or add explicit instruments before documenting new SLOs
  • Testing: Mock metrics in unit tests; validate in integration tests
  • Dashboard: Update Grafana when adding new SLI

For Operations

  • Monitoring: Check dashboards during on-call rotation
  • Alerting: Configure PagerDuty/Slack routing per severity
  • Runbooks: Keep the Billing Runbook up to date
  • Reporting: Monthly SLO compliance report to stakeholders

For Product

  • Understanding: SLOs represent what users experience
  • Roadmap: Reliability improvements reduce error budget burn
  • Incidents: Major incidents typically indicate SLO miss

Emergency Response

If Multiple SLOs Breach Simultaneously

  1. Page on-call immediately
  2. Assess: Check system health dashboard
  3. Decide: Continue operation (with manual intervention) or rollback?
  4. Communicate: Update status page; notify support team
  5. Remediate: See the Billing Runbook for specific issues
  6. Post-mortem: Schedule within 24 hours

If Chargebee Is Down

Estimate impact:

  • Payment recording: Paused (polling can’t reach API)
  • Subscription queries: Paused
  • Primer payments: Continue (independent system)
  • User access: Continues (based on cached subscription state)

Actions:

  1. Activate cache for subscription queries (extend TTL)
  2. Queue payments for retry after Chargebee recovers
  3. Update status page (“Subscription features temporarily unavailable”)
  4. Estimate recovery time based on Chargebee status page

Appendix: SLO Targets Summary

ObjectiveSLITargetError Budget
Payment Success RateSuccess %99.9%43 min/month
Credit Application LatencyP95 < 15s95%2,160 min (36 hours)/month
Subscription ConsistencyConsistency %99.95%22 min/month
Checkout Success RateSuccess %99.5%216 min/month
Dunning Retry SuccessSuccess %95%2,160 min (36 hours)/month

Note: Error budgets vary widely by objective. Payment Success Rate and Subscription Consistency have tight budgets (~43 min and ~22 min), while Credit Application Latency and Dunning Retry Success have generous budgets (~36 hours each) reflecting their lower targets. Tightest budgets should drive operational priorities.