Axiom Integration Playbook
This page is the operator-side companion to Observability. The spine documents what gets logged and how trace context propagates. This page documents the Axiom account itself — datasets, dashboards, monitors, the OTLP shipping path, and APL queries you’d run during an incident.
Overview
Axiom is the logs + traces + metrics sink for beef. Three signals, two dataset types:
axiom:events:v1datasets receive logs and traces (defaultbeef-staging).otel:metrics:v1datasets receive OTel metrics (defaultbeef-staging-otel-metrics).
Code wiring is OTLP-over-HTTP directly to api.axiom.co, gated by an env
flag. There is no Railway log drain in repo (grep -r 'AXIOM' apps/*/railway.json
returns nothing as of 2026-05-05) — every service that ships does so via its
own OTel SDK. Services without an Axiom token print to stdout only.
| Service | Wired in code? | Evidence |
|---|---|---|
| sirloin | Yes — Go OTLP | apps/sirloin/internal/pkg/tracing/tracing.go |
| brisket | Yes — Node OTLP | apps/brisket/otel.server.config.ts |
| brain | Yes — Node OTLP | apps/brain/src/otel.server.config.ts |
| round | No | apps/round/go.mod has no OTel deps |
| strip | No | apps/strip/go.mod has no OTel deps |
| flank | Partial | @opentelemetry/api@^1.9.1 only (no SDK exporter) per apps/flank/package.json; spans terminate locally |
| fennec | No | No @opentelemetry/* package in apps/fennec/package.json |
| chuck | No | No @opentelemetry/* or Axiom package in apps/chuck/package.json |
Gap. round, strip, flank, fennec, and chuck do not ship to Axiom today. Their logs land in Railway stdout and are not searchable in Axiom. For full-stack tracing, those services are dark.
Datasets in use
TODO(@law): live dataset list. The Axiom MCP probe (
listDatasets) failed with “user token not found or expired” on the latest re-run; the table below is reconstructed from.env.example. Re-runmcp__axiom__listDatasetsonce a token is set and reconcile.
| Dataset | Kind | Signals | Producers | Retention |
|---|---|---|---|---|
beef-staging | axiom:events:v1 | logs + traces | sirloin, brisket, brain | TODO(@law) |
beef-staging-otel-metrics | otel:metrics:v1 | metrics | sirloin, brisket, brain | TODO(@law) |
beef-production | axiom:events:v1 | logs + traces | TODO(@law): naming convention only | TODO(@law) |
beef-production-otel-metrics | otel:metrics:v1 | metrics | TODO(@law) | TODO(@law) |
Dataset names referenced verbatim:
# .env.example:59-60SIRLOIN_AXIOM_DATASET=beef-stagingSIRLOIN_AXIOM_METRICS_DATASET=beef-staging-otel-metrics
# apps/brisket/.env.example:25-26BRISKET_OTEL_METRICS_HEADERS=...,x-axiom-dataset=beef-staging-otel-metricsBRISKET_OTEL_TRACES_LOGS_HEADERS=...,x-axiom-dataset=beef-stagingPer-service log shipping
| Service | Mechanism | Endpoint | Dataset (logs+traces / metrics) | Code |
|---|---|---|---|---|
| sirloin | Go OTLP/HTTP — otlploghttp, otlptracehttp, otlpmetrichttp | api.axiom.co (hard-coded) | $SIRLOIN_AXIOM_DATASET / $SIRLOIN_AXIOM_METRICS_DATASET | apps/sirloin/internal/pkg/tracing/tracing.go:34 |
| brisket | Node OTLP — @opentelemetry/exporter-{trace,metrics,logs}-otlp-* via NodeSDK | $BRISKET_OTEL_URL/v1/{traces,metrics,logs} | header-encoded (x-axiom-dataset) | apps/brisket/otel.server.config.ts |
| brain | Node OTLP — same exporter family + SentrySpanProcessor | $BRAIN_OTEL_URL/v1/{traces,metrics,logs} | header-encoded | apps/brain/src/otel.server.config.ts |
| round | None — zerolog to stdout | — | — | apps/round/cmd/app/main.go |
| strip | None — zerolog to stdout | — | — | apps/strip/cmd/app/main.go |
| flank | None — @opentelemetry/api is the only OTel package; no SDK exporter | — | — | apps/flank/package.json |
Authentication on the wire is Authorization: Bearer <AXIOM_TOKEN> plus
X-Axiom-Dataset: <dataset>. sirloin and brisket use the same shape; only the
transport differs (Go vs Node).
OTel disabled when token is empty. sirloin’s tracing.Init short-circuits
if cfg.AxiomToken == "" (tracing.go:84), returning a no-op provider. The
Node services follow the same gating in
brain/src/common/otel/runtime-config.ts — exported booleans
isBrainOtelEnabled, isBrainOtelMetricsEnabled,
isBrainOtelTracesAndLogsEnabled derived from BRAIN_OTEL_URL /
BRAIN_OTEL_*_HEADERS.
Foxy360 query connector
In addition to shipping to Axiom, sirloin can query Axiom on behalf of
Foxy360 tools. This is a separate code path
(apps/sirloin/internal/app/foxy360/server.go:1216,
executeAxiomRawFallback) that hits https://api.axiom.co with a separate
token (SIRLOIN_FOXY360_AXIOM_API_TOKEN) and applies guardrails:
- max time window 30 days (
server.go:1292). - requires
params.datasetorSIRLOIN_FOXY360_AXIOM_DATASETenv.
Use this when Foxy360 needs an APL query as part of a tool call. It is not part of the log-shipping path.
Dashboards
TODO(@law): dashboard inventory.
mcp__axiom__listDashboardsfailed again withuser token not found or expired. Re-run when an MCP token is provisioned and populate the table below.
| Dashboard | Purpose | Owner |
|---|---|---|
| TODO(@law): per-service traces | latency p95/p99, error rate, RPS by service | TODO(@law) |
| TODO(@law): billing SLOs | success rate of billing_payment_total, dunning lag | @billing |
| TODO(@law): OTel metrics overview | resource utilisation by service | TODO(@law) |
Cross-reference: apps/sirloin/internal/app/services/billing/metrics/ defines
the metric names the billing dashboard should be built on (billing_payment_total,
billing_checkout_duration_seconds, etc.; see
Billing SLOs runbook).
Monitors and alerts
TODO(@law): monitor inventory.
mcp__axiom__checkMonitorsre-run still returnsuser token not found or expired. RuncheckMonitors()thengetMonitorHistory()per monitor and document below once a token is available.
Monitors known to be needed (from the observability spine’s open TODOs):
- 5xx-rate per service (sirloin, brain, brisket).
- OTLP ingest dropping — alert when a service’s log volume falls > 80% in 10m.
- Billing event SLOs (see runbook).
What we know is wired:
Sirloin Billing Drift
| Field | Value |
|---|---|
| Monitor name | Sirloin Billing Drift |
| Source metric | billing_invoices_drift_total (OTel Int64Counter) |
| Dataset | beef-staging-otel-metrics (and beef-production-otel-metrics once provisioned) |
| OTel scope | sirloin/billing |
| Threshold | any non-zero count over a rolling 15-minute window |
| Window | 15m, evaluated every 5m |
| Routing | Slack #billing-alerts |
| Owner | @billing |
| Status | wired (operator-managed via the Axiom UI / IaC repo) |
Rationale. The local billing.invoices registry is reconciled against
Chargebee by apps/sirloin/internal/app/worker/reconcileunpaidinvoices.go
(reconciler Pass 1). Per-field drift checks emit
billing_invoices_drift_total with an attribute.String("field", ...) label.
Field values currently emitted by the reconciler: chargebee_status,
amount_due_cents, total_cents, currency, chargebee_invoice_date,
due_date, chargebee_updated_at, subscription_status,
effective_auto_collection_off, customer_id, chargebee_subscription_id,
and deleted. Treat this list as the current reconciler emitter set, not as
exhaustive — new fields can be added in reconcileunpaidinvoices.go without
requiring an Axiom monitor change. The counter is registered at
apps/sirloin/internal/app/services/billing/metrics/billing_invoices.go:69
(BillingInvoicesDriftTotal). During the legacy CB-overlay window this is an
overlay-period observability signal; once the legacy overlay is closed it
becomes the primary signal that local state has diverged from Chargebee.
Query (Axiom metrics dataset; run via mcp__axiom__queryMetrics, not
APL — the OTLP metrics dataset is not queryable with APL — see the “Metric
query — OTLP ingest rate” note above):
# in queryMetrics:# metric: billing_invoices_drift_total# aggregation: sum# group_by: field# window: 15m# alert when sum(rate) > 0 over the 15m windowFor ad-hoc inspection of which field values are firing, group by the
field attribute. The reconciler emits one increment per detected field per
invoice per pass, so any non-zero sum indicates real divergence (not a noisy
counter).
Test procedure (sandbox). The monitor itself is created in the Axiom UI / IaC repo, not from this codebase. To validate end-to-end alert delivery without waiting for organic drift:
- In the sandbox sirloin instance, force one drift increment — either via a
temporary debug worker call to
metrics.BillingInvoicesDriftTotal().Add(ctx, 1, metric.WithAttributes(attribute.String("field", "chargebee_status"))), or by deliberately writing a sentinelbilling.invoicesrow whosechargebee_statusdiffers from the upstream Chargebee invoice and waiting one reconciler tick. - Confirm the counter increment lands in
beef-staging-otel-metrics(queryMetrics onbilling_invoices_drift_totalwith a 5-minute window). - Verify the Slack alert arrives in
#billing-alertswithin the monitor’s evaluation interval. Acknowledge and silence in Axiom afterwards so the sentinel does not leave a permanent firing condition.
Related metrics. Cross-referenced with the rest of the INVR (local invoice
registry) metric surface from
apps/sirloin/internal/app/services/billing/metrics/billing_invoices.go:
billing_invoices_upsert_total— labeled bysourceandresult; tracks every write intobilling.invoices.billing_invoices_pending_total— gauge of rows in a non-terminal Chargebee state, scraped each reconciler tick (a sustained high value implies the reconciler is falling behind, not that drift is occurring).billing_invoices_reconciler_actions_total— labeled byaction;action="drift"rows correlate withbilling_invoices_drift_totalincrements.
The observability spine flags “Alerts / SLOs / on-call ownership” as item 10 of the open-TODO list — beyond the Sirloin Billing Drift monitor above, assume nothing else is wired until proven otherwise.
APL queries
Useful APL queries for an on-call engineer. Run from the Axiom UI or via
mcp__axiom__queryDataset. All assume beef-staging as the events dataset;
swap for beef-production once that dataset exists.
Recent errors across all services (last 1h)
['beef-staging']| where _time > ago(1h)| where ['attributes.severity_text'] in ('ERROR', 'FATAL') or ['attributes']['level'] >= 50 // pino numeric level| project _time, ['resource.service.name'], ['attributes']['msg'], trace_id, span_id| order by _time desc| take 200Per-service error rate (5m buckets, last 6h)
['beef-staging']| where _time > ago(6h)| extend service = tostring(['resource.service.name'])| extend is_error = ['attributes.severity_text'] in ('ERROR', 'FATAL')| summarize errors = countif(is_error), total = count() by bin(_time, 5m), service| extend error_rate = todouble(errors) / todouble(total)HTTP latency p95 by route (sirloin)
['beef-staging']| where _time > ago(1h)| where ['resource.service.name'] == 'sirloin'| where isnotempty(['attributes']['http.route'])| summarize p95 = percentile(['attributes']['http.duration_ms'], 95) by ['attributes']['http.route']| order by p95 desc| take 25Recent 5xx with trace IDs (sirloin)
['beef-staging']| where _time > ago(30m)| where ['resource.service.name'] == 'sirloin'| where toint(['attributes']['http.status_code']) >= 500| project _time, ['attributes']['http.route'], status = ['attributes']['http.status_code'], trace_id| order by _time descOne trace, all spans (paste a trace_id)
['beef-staging']| where _time > ago(24h)| where trace_id == 'PASTE_TRACE_ID_HERE'| project _time, ['resource.service.name'], name, kind, duration, status_code| order by _time ascBullMQ job failures (brain)
['beef-staging']| where _time > ago(2h)| where ['resource.service.name'] == 'brain'| where name == 'bullmq.process'| where status_code != 1 // OK| summarize failures = count() by ['attributes']['queue.name'], ['attributes']['job.name']| order by failures descRound inference latency (via brain client span)
['beef-staging']| where _time > ago(1h)| where name == 'round.infer'| summarize p50 = percentile(duration, 50), p95 = percentile(duration, 95), p99 = percentile(duration, 99), count = count() by ['attributes']['round.model_id']Span-name constants live in apps/brain/src/common/otel/tracing.ts
(OTEL_SPAN_NAMES).
Brisket → sirloin call breakdown
['beef-staging']| where _time > ago(1h)| where ['resource.service.name'] == 'brisket'| where kind == 'CLIENT'| summarize p95 = percentile(duration, 95), errors = countif(status_code != 1), n = count() by name| order by n descMetric query — OTLP ingest rate
Use
mcp__axiom__queryMetrics(notqueryDataset) for thebeef-staging-otel-metricsdataset. APL does not apply.
# in queryMetrics, metric: otelcol_exporter_sent_spans (or equivalent)# group by service.nameFailure modes
| Mode | Detection | Mitigation |
|---|---|---|
| Token rotated, exporter 401s | service stdout shows OTel SDK errors; Axiom volume drops to zero | rotate *_AXIOM_TOKEN in Railway; restart service |
| Wrong dataset header | Axiom returns 400; logs visible in stdout but absent in Axiom UI | check X-Axiom-Dataset matches an existing axiom:events:v1 dataset; metrics dataset must be otel:metrics:v1 (see tracing.go:160 comment) |
| Metrics 400 from JSON exporter | brisket/brain crash log: “exporter-metrics-otlp” | must use @opentelemetry/exporter-metrics-otlp-proto, not -http. Already enforced in otel.server.config.ts; do not “fix” by swapping. |
| OTLP batch lag | Axiom UI shows recent gap, then catch-up | sirloin uses BatchProcessor with default flush; tolerate up to ~30s lag during quiet periods |
| Ingest quota exceeded | Axiom rate-limits → exporter 429s in stdout | reduce log verbosity; check Sensitive Data in Logs in observability spine for over-logging candidates |
| Query timeout | APL query > 30s in MCP | narrow _time > ago(...) window; project specific fields; aggregate before take |
| Service has no OTel SDK | service silent in Axiom (round, strip, flank, fennec, chuck) | known gap — see Observability |
Secrets
Storage: Railway environment variables, per service. Names verbatim from
apps/sirloin/internal/pkg/env/variables.go and the service .env.examples:
| Variable | Service | Purpose |
|---|---|---|
SIRLOIN_AXIOM_TOKEN | sirloin | OTLP exporter auth |
SIRLOIN_AXIOM_DATASET | sirloin | logs + traces dataset |
SIRLOIN_AXIOM_METRICS_DATASET | sirloin | metrics dataset |
SIRLOIN_FOXY360_AXIOM_API_TOKEN | sirloin | Foxy360 query connector — separate from shipping token |
SIRLOIN_FOXY360_AXIOM_DATASET | sirloin | default dataset for Foxy360 queries |
BRISKET_OTEL_URL | brisket | OTLP base URL (https://api.axiom.co) |
BRISKET_OTEL_TRACES_LOGS_HEADERS | brisket | authorization=Bearer xaat-...,x-axiom-dataset=... |
BRISKET_OTEL_METRICS_HEADERS | brisket | as above, metrics dataset |
BRAIN_OTEL_URL, BRAIN_OTEL_TRACES_LOGS_HEADERS, BRAIN_OTEL_METRICS_HEADERS | brain | mirror of brisket |
Org id and personal-token (PAT) for the Axiom MCP integration are stored in
the operator’s local config; they are not repo-side and are not the same
as the service-side xaat- ingest tokens. TODO(@law): document where the
shared org id is stashed (1Password? Railway shared env?).
Never commit Axiom tokens. All
xaat-tokens shown in.env.exampleare placeholders; rotate immediately if a real token leaks via git.
Cost model
TODO(@law): plan + cost. Need to confirm with billing owner:
- Axiom plan tier and monthly ingest GB cap.
- Per-dataset retention (events vs metrics).
- Whether
beef-productionis on a separate billing line.
Rough levers when ingest gets expensive:
- Drop debug-level logs in production (
apps/round/cmd/app/main.gosetupLoggeraccepts a level — apply same discipline to brain pino / sirloin slog). - Reduce trace sampling. Confirmed neither
apps/sirloin/internal/pkg/tracing/tracing.gonorapps/brain/src/otel.server.config.tsconfigures a sampler — SDK defaults apply (always-on / parent-based). TODO(@law): pin a sampler per environment. - Consolidate metrics — every PeriodicExportingMetricReader instance ships on a 60s interval; merging dimensions reduces cardinality.
PII and hygiene
See the Observability “Sensitive Data in Logs” section for the canonical list. Quick rules:
- No raw Chargebee card data, Primer tokens, or full Clerk JWTs in logs or span attributes.
- Email addresses currently appear in some sirloin/brain log lines (
Sirloin.SubsAll.email, brainUser.email); masking is TODO. - The forthcoming Security Model page (path TBD; not yet present at
docs/src/content/docs/concepts/security-model.mdoroperations/security-model.md) will own the redaction policy. Cross-link once it lands. TODO(@law).
Runbook hooks
- Billing SLOs runbook references the
billing_*metrics in the OTel metrics dataset; queries againstbeef-staging-otel-metricsare the canonical source there. - The observability spine’s “I see a 500 in production” playbook starts in Sentry, then jumps to Axiom for the trace — the “Recent 5xx with trace IDs” query above is the entry point.
- New runbooks should cite a specific APL query block from this page rather than re-deriving it.
Open items (TODO(@law))
- Re-run
mcp__axiom__listDatasetswith a valid MCP token; reconcile dataset table. - Re-run
mcp__axiom__listDashboards; populate dashboard table with names + owners. - Re-run
mcp__axiom__checkMonitors; record monitor IDs and what each alerts on. Sirloin Billing Drift is documented above; backfill its monitor ID and any sibling billing monitors once the MCP token is provisioned. - Confirm production dataset names (
beef-productionvs other). - Document Axiom plan + retention in the cost section.
- Confirm flank’s OTel exporter destination (or document that spans are dropped).
- Decide whether round/strip should ship to Axiom; if yes, wire
zerolog → OTel logslike sirloin’sProvider.ZerologWriter(apps/sirloin/internal/pkg/tracing/tracing.go,apps/sirloin/internal/pkg/tracing/zerolog_otel_writer.go). The writer forwards the full zerolog JSON payload — body, severity, and every structured field — into OTel log records. - Cross-link Security Model page once it exists.