Brisket On-Call
Brisket is a customer-facing Next.js app. Most user-impacting incidents present as elevated tRPC error rates, a spike in 5xx from route handlers, or front-end performance regressions. This page lists what an on-call should watch and how to respond.
What’s Instrumented
- OpenTelemetry:
server/api/trpc.tstracingMiddlewareemitstrpc.${type}.${path}spans with status codes;recordTrpcRequest(src/lib/metrics.ts) emits a metric per request. Exporters are bootstrapped inapps/brisket/otel.server.config.ts(URLs fromBRISKET_OTEL_URL; trace/log + metric headers parsed fromBRISKET_OTEL_TRACES_LOGS_HEADERS/BRISKET_OTEL_METRICS_HEADERS, which carry the Axiomx-axiom-datasettoken) and gated byapps/brisket/src/otel-runtime-config.ts. - Sentry (
@sentry/nextjs): captures unhandled errors on both server and browser; production errors carry adigestfor cross-reference. - PostHog: product analytics + lifecycle events (
account_created,logged_in) emitted from the Clerk webhook. - Health probe:
GET /api/health— used by Railway healthcheck.
Top Alerts
Targets below are starting points — calibrate against historical baselines.
| # | Alert | Signal | Threshold | Why it matters |
|---|---|---|---|---|
| 1 | tRPC error rate (overall) | Axiom: % of trpc.* spans with status=ERROR | > 2 % over 5 min | Indicates a broad outage (sirloin down, deploy regression, Clerk failure). |
| 2 | Checkout error rate | Axiom: subscription.createPrimerCheckout errors | > 5 % over 10 min | Direct revenue impact. |
| 3 | Auth failure rate | Axiom: middleware redirects / Sentry Clerk errors | > baseline × 3 over 5 min | Sign-in broken. |
| 4 | /api/health failing | Railway healthcheck red | any | Pod cannot serve traffic. Auto-restart, but page if repeats. |
| 5 | Webhook signature failures | Sentry / log filter Clerk webhook signature verification failed | > 5 in 5 min | Misconfigured CLERK_WEBHOOK_SECRET or tampering. |
| 6 | Sirloin Connect Unavailable (code 14) | Axiom span tag | sustained > 1 % for 5 min | Upstream outage; brisket has no fallback. |
| 7 | Sentry release error volume | Sentry alert | > N errors in first 30 min after deploy | Bad deploy — candidate for rollback. |
| 8 | Build failures on release | GitHub Actions / Railway | any | Production deploy stuck. |
| 9 | Core Web Vitals regression | PostHog $web_vitals (sampled at 0.2 in features/analytics/posthog/PosthogProvider.tsx) | LCP p75 > 2.5 s, INP p75 > 200 ms | UX degradation. |
| 10 | OTel exporter failures | Axiom ingest 401 / log line “Failed to export traces” | any sustained | Observability gap; on-call is flying blind. |
TODO(@marty): which of these are wired into PagerDuty / Slack today. The codebase shows OTel + Sentry + PostHog instrumentation but does not encode alert rules.
Dashboards
- Axiom — Brisket overview: tRPC throughput, error rate, p50/p95/p99 by procedure. TODO(@marty): dashboard link.
- Sentry — Brisket project: unresolved issues, release health, regression detection.
- PostHog: funnel for sign-up → first generation → paid checkout.
- Railway —
brisketservice: deploy log, healthcheck status, build logs.
Triage Playbook
- Confirm scope — public outage (
/api/healthred, all users) vs. partial (one tRPC procedure spiking). Check Axiom bytrpc.proceduretag. - Check upstreams — sirloin status, Chargebee status, Clerk status, Cloudflare Turnstile status. Brisket fails in user-visible ways when any of these degrade.
- Check recent deploys — Railway deploy log; Sentry “regressions” tab. If the spike began at deploy time, rollback (see
brisket-runbook). - Open a war-room — paste Axiom + Sentry links and current error sample. Avoid speculation; lead with data.
- User comms — for >5-minute customer-visible issues, post status update via the standard channel. TODO(@marty): confirm status-page integration.
Escalation
| Tier | Owner | When |
|---|---|---|
| L1 | Brisket on-call | First responder; can rollback, restart, redeploy. |
| L2 | Frontend lead | Persistent client-side bug, deploy-time build issues. |
| L3 | Sirloin on-call | Confirmed upstream gRPC failure (sirloin returning systemic errors). |
| L3 | Auth/identity owner | Clerk webhook or session integrity issues. |
| L3 | Billing owner | Chargebee/Primer outage during checkout. |
Owner directory: TODO(@marty): link to internal on-call rotation.
Useful Queries
Axiom (tRPC error rate by procedure, last 15 min):
['beef-prod']| where _time > ago(15m) and span_name startswith "trpc."| summarize errors = countif(status == "ERROR"), total = count() by trpc_procedure| extend error_rate = todouble(errors) / todouble(total)| order by error_rate descTODO(@marty): exact Axiom field/dataset names; this is illustrative.
Anchors
operations/observability— full observability standards.operations/testing-strategy— pre-deploy gating.standards/auth-model,standards/security-model— context for auth/billing alerts.- Companion docs:
brisket-errors,brisket-runbook.