Skip to content

Brisket On-Call

Brisket is a customer-facing Next.js app. Most user-impacting incidents present as elevated tRPC error rates, a spike in 5xx from route handlers, or front-end performance regressions. This page lists what an on-call should watch and how to respond.

What’s Instrumented

  • OpenTelemetry: server/api/trpc.ts tracingMiddleware emits trpc.${type}.${path} spans with status codes; recordTrpcRequest (src/lib/metrics.ts) emits a metric per request. Exporters are bootstrapped in apps/brisket/otel.server.config.ts (URLs from BRISKET_OTEL_URL; trace/log + metric headers parsed from BRISKET_OTEL_TRACES_LOGS_HEADERS / BRISKET_OTEL_METRICS_HEADERS, which carry the Axiom x-axiom-dataset token) and gated by apps/brisket/src/otel-runtime-config.ts.
  • Sentry (@sentry/nextjs): captures unhandled errors on both server and browser; production errors carry a digest for cross-reference.
  • PostHog: product analytics + lifecycle events (account_created, logged_in) emitted from the Clerk webhook.
  • Health probe: GET /api/health — used by Railway healthcheck.

Top Alerts

Targets below are starting points — calibrate against historical baselines.

#AlertSignalThresholdWhy it matters
1tRPC error rate (overall)Axiom: % of trpc.* spans with status=ERROR> 2 % over 5 minIndicates a broad outage (sirloin down, deploy regression, Clerk failure).
2Checkout error rateAxiom: subscription.createPrimerCheckout errors> 5 % over 10 minDirect revenue impact.
3Auth failure rateAxiom: middleware redirects / Sentry Clerk errors> baseline × 3 over 5 minSign-in broken.
4/api/health failingRailway healthcheck redanyPod cannot serve traffic. Auto-restart, but page if repeats.
5Webhook signature failuresSentry / log filter Clerk webhook signature verification failed> 5 in 5 minMisconfigured CLERK_WEBHOOK_SECRET or tampering.
6Sirloin Connect Unavailable (code 14)Axiom span tagsustained > 1 % for 5 minUpstream outage; brisket has no fallback.
7Sentry release error volumeSentry alert> N errors in first 30 min after deployBad deploy — candidate for rollback.
8Build failures on releaseGitHub Actions / RailwayanyProduction deploy stuck.
9Core Web Vitals regressionPostHog $web_vitals (sampled at 0.2 in features/analytics/posthog/PosthogProvider.tsx)LCP p75 > 2.5 s, INP p75 > 200 msUX degradation.
10OTel exporter failuresAxiom ingest 401 / log line “Failed to export traces”any sustainedObservability gap; on-call is flying blind.

TODO(@marty): which of these are wired into PagerDuty / Slack today. The codebase shows OTel + Sentry + PostHog instrumentation but does not encode alert rules.

Dashboards

  • Axiom — Brisket overview: tRPC throughput, error rate, p50/p95/p99 by procedure. TODO(@marty): dashboard link.
  • Sentry — Brisket project: unresolved issues, release health, regression detection.
  • PostHog: funnel for sign-up → first generation → paid checkout.
  • Railway — brisket service: deploy log, healthcheck status, build logs.

Triage Playbook

  1. Confirm scope — public outage (/api/health red, all users) vs. partial (one tRPC procedure spiking). Check Axiom by trpc.procedure tag.
  2. Check upstreams — sirloin status, Chargebee status, Clerk status, Cloudflare Turnstile status. Brisket fails in user-visible ways when any of these degrade.
  3. Check recent deploys — Railway deploy log; Sentry “regressions” tab. If the spike began at deploy time, rollback (see brisket-runbook).
  4. Open a war-room — paste Axiom + Sentry links and current error sample. Avoid speculation; lead with data.
  5. User comms — for >5-minute customer-visible issues, post status update via the standard channel. TODO(@marty): confirm status-page integration.

Escalation

TierOwnerWhen
L1Brisket on-callFirst responder; can rollback, restart, redeploy.
L2Frontend leadPersistent client-side bug, deploy-time build issues.
L3Sirloin on-callConfirmed upstream gRPC failure (sirloin returning systemic errors).
L3Auth/identity ownerClerk webhook or session integrity issues.
L3Billing ownerChargebee/Primer outage during checkout.

Owner directory: TODO(@marty): link to internal on-call rotation.

Useful Queries

Axiom (tRPC error rate by procedure, last 15 min):

['beef-prod']
| where _time > ago(15m) and span_name startswith "trpc."
| summarize errors = countif(status == "ERROR"), total = count() by trpc_procedure
| extend error_rate = todouble(errors) / todouble(total)
| order by error_rate desc

TODO(@marty): exact Axiom field/dataset names; this is illustrative.

Anchors

  • operations/observability — full observability standards.
  • operations/testing-strategy — pre-deploy gating.
  • standards/auth-model, standards/security-model — context for auth/billing alerts.
  • Companion docs: brisket-errors, brisket-runbook.