Axiom Integration Playbook

This page is the operator-side companion to Observability. The spine documents what gets logged and how trace context propagates. This page documents the Axiom account itself — datasets, dashboards, monitors, the OTLP shipping path, and APL queries you’d run during an incident.

Overview

Axiom is the logs + traces + metrics sink for beef. Three signals, two dataset types:

axiom:events:v1 datasets receive logs and traces (default beef-staging).
otel:metrics:v1 datasets receive OTel metrics (default beef-staging-otel-metrics).

Code wiring is OTLP-over-HTTP directly to api.axiom.co, gated by an env flag. There is no Railway log drain in repo (grep -r 'AXIOM' apps/*/railway.json returns nothing as of 2026-05-05) — every service that ships does so via its own OTel SDK. Services without an Axiom token print to stdout only.

Service	Wired in code?	Evidence
sirloin	Yes — Go OTLP	`apps/sirloin/internal/pkg/tracing/tracing.go`
brisket	Yes — Node OTLP	`apps/brisket/otel.server.config.ts`
brain	Yes — Node OTLP	`apps/brain/src/otel.server.config.ts`
round	No	`apps/round/go.mod` has no OTel deps
strip	No	`apps/strip/go.mod` has no OTel deps
flank	Partial	`@opentelemetry/api@^1.9.1` only (no SDK exporter) per `apps/flank/package.json`; spans terminate locally
fennec	No	No `@opentelemetry/*` package in `apps/fennec/package.json`
chuck	No	No `@opentelemetry/*` or Axiom package in `apps/chuck/package.json`

Gap. round, strip, flank, fennec, and chuck do not ship to Axiom today. Their logs land in Railway stdout and are not searchable in Axiom. For full-stack tracing, those services are dark.

Datasets in use

TODO(@law): live dataset list. The Axiom MCP probe (listDatasets) failed with “user token not found or expired” on the latest re-run; the table below is reconstructed from .env.example. Re-run mcp__axiom__listDatasets once a token is set and reconcile.

Dataset	Kind	Signals	Producers	Retention
`beef-staging`	`axiom:events:v1`	logs + traces	sirloin, brisket, brain	TODO(@law)
`beef-staging-otel-metrics`	`otel:metrics:v1`	metrics	sirloin, brisket, brain	TODO(@law)
`beef-production`	`axiom:events:v1`	logs + traces	TODO(@law): naming convention only	TODO(@law)
`beef-production-otel-metrics`	`otel:metrics:v1`	metrics	TODO(@law)	TODO(@law)

Dataset names referenced verbatim:

# .env.example:59-60
SIRLOIN_AXIOM_DATASET=beef-staging
SIRLOIN_AXIOM_METRICS_DATASET=beef-staging-otel-metrics

# apps/brisket/.env.example:25-26
BRISKET_OTEL_METRICS_HEADERS=...,x-axiom-dataset=beef-staging-otel-metrics
BRISKET_OTEL_TRACES_LOGS_HEADERS=...,x-axiom-dataset=beef-staging

Per-service log shipping

Service	Mechanism	Endpoint	Dataset (logs+traces / metrics)	Code
sirloin	Go OTLP/HTTP — `otlploghttp`, `otlptracehttp`, `otlpmetrichttp`	`api.axiom.co` (hard-coded)	`$SIRLOIN_AXIOM_DATASET` / `$SIRLOIN_AXIOM_METRICS_DATASET`	`apps/sirloin/internal/pkg/tracing/tracing.go:34`
brisket	Node OTLP — `@opentelemetry/exporter-{trace,metrics,logs}-otlp-*` via `NodeSDK`	`$BRISKET_OTEL_URL/v1/{traces,metrics,logs}`	header-encoded (`x-axiom-dataset`)	`apps/brisket/otel.server.config.ts`
brain	Node OTLP — same exporter family + `SentrySpanProcessor`	`$BRAIN_OTEL_URL/v1/{traces,metrics,logs}`	header-encoded	`apps/brain/src/otel.server.config.ts`
round	None — `zerolog` to stdout	—	—	`apps/round/cmd/app/main.go`
strip	None — `zerolog` to stdout	—	—	`apps/strip/cmd/app/main.go`
flank	None — `@opentelemetry/api` is the only OTel package; no SDK exporter	—	—	`apps/flank/package.json`

Authentication on the wire is Authorization: Bearer <AXIOM_TOKEN> plus X-Axiom-Dataset: <dataset>. sirloin and brisket use the same shape; only the transport differs (Go vs Node).

OTel disabled when token is empty. sirloin’s tracing.Init short-circuits if cfg.AxiomToken == "" (tracing.go:84), returning a no-op provider. The Node services follow the same gating in brain/src/common/otel/runtime-config.ts — exported booleans isBrainOtelEnabled, isBrainOtelMetricsEnabled, isBrainOtelTracesAndLogsEnabled derived from BRAIN_OTEL_URL / BRAIN_OTEL_*_HEADERS.

Foxy360 query connector

In addition to shipping to Axiom, sirloin can query Axiom on behalf of Foxy360 tools. This is a separate code path (apps/sirloin/internal/app/foxy360/server.go:1216, executeAxiomRawFallback) that hits https://api.axiom.co with a separate token (SIRLOIN_FOXY360_AXIOM_API_TOKEN) and applies guardrails:

max time window 30 days (server.go:1292).
requires params.dataset or SIRLOIN_FOXY360_AXIOM_DATASET env.

Use this when Foxy360 needs an APL query as part of a tool call. It is not part of the log-shipping path.

Dashboards

TODO(@law): dashboard inventory. mcp__axiom__listDashboards failed again with user token not found or expired. Re-run when an MCP token is provisioned and populate the table below.

Dashboard	Purpose	Owner
TODO(@law): per-service traces	latency p95/p99, error rate, RPS by service	TODO(@law)
TODO(@law): billing SLOs	success rate of `billing_payment_total`, dunning lag	`@billing`
TODO(@law): OTel metrics overview	resource utilisation by service	TODO(@law)

Cross-reference: apps/sirloin/internal/app/services/billing/metrics/ defines the metric names the billing dashboard should be built on (billing_payment_total, billing_checkout_duration_seconds, etc.; see Billing SLOs runbook).

Monitors and alerts

TODO(@law): monitor inventory. mcp__axiom__checkMonitors re-run still returns user token not found or expired. Run checkMonitors() then getMonitorHistory() per monitor and document below once a token is available.

Monitors known to be needed (from the observability spine’s open TODOs):

5xx-rate per service (sirloin, brain, brisket).
OTLP ingest dropping — alert when a service’s log volume falls > 80% in 10m.
Billing event SLOs (see runbook).

What we know is wired:

Sirloin Billing Drift

Field	Value
Monitor name	`Sirloin Billing Drift`
Source metric	`billing_invoices_drift_total` (OTel `Int64Counter`)
Dataset	`beef-staging-otel-metrics` (and `beef-production-otel-metrics` once provisioned)
OTel scope	`sirloin/billing`
Threshold	any non-zero count over a rolling 15-minute window
Window	15m, evaluated every 5m
Routing	Slack `#billing-alerts`
Owner	`@billing`
Status	wired (operator-managed via the Axiom UI / IaC repo)

Rationale. The local billing.invoices registry is reconciled against Chargebee by apps/sirloin/internal/app/worker/reconcileunpaidinvoices.go (reconciler Pass 1). Per-field drift checks emit billing_invoices_drift_total with an attribute.String("field", ...) label. Field values currently emitted by the reconciler: chargebee_status, amount_due_cents, total_cents, currency, chargebee_invoice_date, due_date, chargebee_updated_at, subscription_status, effective_auto_collection_off, customer_id, chargebee_subscription_id, and deleted. Treat this list as the current reconciler emitter set, not as exhaustive — new fields can be added in reconcileunpaidinvoices.go without requiring an Axiom monitor change. The counter is registered at apps/sirloin/internal/app/services/billing/metrics/billing_invoices.go:69 (BillingInvoicesDriftTotal). During the legacy CB-overlay window this is an overlay-period observability signal; once the legacy overlay is closed it becomes the primary signal that local state has diverged from Chargebee.

Query (Axiom metrics dataset; run via mcp__axiom__queryMetrics, not APL — the OTLP metrics dataset is not queryable with APL — see the “Metric query — OTLP ingest rate” note above):

# in queryMetrics:
#   metric: billing_invoices_drift_total
#   aggregation: sum
#   group_by: field
#   window: 15m
# alert when sum(rate) > 0 over the 15m window

For ad-hoc inspection of which field values are firing, group by the field attribute. The reconciler emits one increment per detected field per invoice per pass, so any non-zero sum indicates real divergence (not a noisy counter).

Test procedure (sandbox). The monitor itself is created in the Axiom UI / IaC repo, not from this codebase. To validate end-to-end alert delivery without waiting for organic drift:

In the sandbox sirloin instance, force one drift increment — either via a temporary debug worker call to metrics.BillingInvoicesDriftTotal().Add(ctx, 1, metric.WithAttributes(attribute.String("field", "chargebee_status"))), or by deliberately writing a sentinel billing.invoices row whose chargebee_status differs from the upstream Chargebee invoice and waiting one reconciler tick.
Confirm the counter increment lands in beef-staging-otel-metrics (queryMetrics on billing_invoices_drift_total with a 5-minute window).
Verify the Slack alert arrives in #billing-alerts within the monitor’s evaluation interval. Acknowledge and silence in Axiom afterwards so the sentinel does not leave a permanent firing condition.

Related metrics. Cross-referenced with the rest of the INVR (local invoice registry) metric surface from apps/sirloin/internal/app/services/billing/metrics/billing_invoices.go:

billing_invoices_upsert_total — labeled by source and result; tracks every write into billing.invoices.
billing_invoices_pending_total — gauge of rows in a non-terminal Chargebee state, scraped each reconciler tick (a sustained high value implies the reconciler is falling behind, not that drift is occurring).
billing_invoices_reconciler_actions_total — labeled by action; action="drift" rows correlate with billing_invoices_drift_total increments.

The observability spine flags “Alerts / SLOs / on-call ownership” as item 10 of the open-TODO list — beyond the Sirloin Billing Drift monitor above, assume nothing else is wired until proven otherwise.

APL queries

Useful APL queries for an on-call engineer. Run from the Axiom UI or via mcp__axiom__queryDataset. All assume beef-staging as the events dataset; swap for beef-production once that dataset exists.

Recent errors across all services (last 1h)

['beef-staging']
| where _time > ago(1h)
| where ['attributes.severity_text'] in ('ERROR', 'FATAL')
   or ['attributes']['level'] >= 50    // pino numeric level
| project _time, ['resource.service.name'], ['attributes']['msg'], trace_id, span_id
| order by _time desc
| take 200

Per-service error rate (5m buckets, last 6h)

['beef-staging']
| where _time > ago(6h)
| extend service = tostring(['resource.service.name'])
| extend is_error = ['attributes.severity_text'] in ('ERROR', 'FATAL')
| summarize errors = countif(is_error), total = count() by bin(_time, 5m), service
| extend error_rate = todouble(errors) / todouble(total)

HTTP latency p95 by route (sirloin)

['beef-staging']
| where _time > ago(1h)
| where ['resource.service.name'] == 'sirloin'
| where isnotempty(['attributes']['http.route'])
| summarize p95 = percentile(['attributes']['http.duration_ms'], 95)
    by ['attributes']['http.route']
| order by p95 desc
| take 25

Recent 5xx with trace IDs (sirloin)

['beef-staging']
| where _time > ago(30m)
| where ['resource.service.name'] == 'sirloin'
| where toint(['attributes']['http.status_code']) >= 500
| project _time, ['attributes']['http.route'], status = ['attributes']['http.status_code'], trace_id
| order by _time desc

One trace, all spans (paste a trace_id)

['beef-staging']
| where _time > ago(24h)
| where trace_id == 'PASTE_TRACE_ID_HERE'
| project _time, ['resource.service.name'], name, kind, duration, status_code
| order by _time asc

BullMQ job failures (brain)

['beef-staging']
| where _time > ago(2h)
| where ['resource.service.name'] == 'brain'
| where name == 'bullmq.process'
| where status_code != 1   // OK
| summarize failures = count() by ['attributes']['queue.name'], ['attributes']['job.name']
| order by failures desc

Round inference latency (via brain client span)

['beef-staging']
| where _time > ago(1h)
| where name == 'round.infer'
| summarize p50 = percentile(duration, 50),
            p95 = percentile(duration, 95),
            p99 = percentile(duration, 99),
            count = count()
    by ['attributes']['round.model_id']

Span-name constants live in apps/brain/src/common/otel/tracing.ts (OTEL_SPAN_NAMES).

Brisket → sirloin call breakdown

['beef-staging']
| where _time > ago(1h)
| where ['resource.service.name'] == 'brisket'
| where kind == 'CLIENT'
| summarize p95 = percentile(duration, 95), errors = countif(status_code != 1), n = count()
    by name
| order by n desc

Metric query — OTLP ingest rate

Use mcp__axiom__queryMetrics (not queryDataset) for the beef-staging-otel-metrics dataset. APL does not apply.

# in queryMetrics, metric: otelcol_exporter_sent_spans (or equivalent)
# group by service.name

Failure modes

Mode	Detection	Mitigation
Token rotated, exporter 401s	service stdout shows OTel SDK errors; Axiom volume drops to zero	rotate `*_AXIOM_TOKEN` in Railway; restart service
Wrong dataset header	Axiom returns 400; logs visible in stdout but absent in Axiom UI	check `X-Axiom-Dataset` matches an existing `axiom:events:v1` dataset; metrics dataset must be `otel:metrics:v1` (see `tracing.go:160` comment)
Metrics 400 from JSON exporter	brisket/brain crash log: “exporter-metrics-otlp”	must use `@opentelemetry/exporter-metrics-otlp-proto`, not `-http`. Already enforced in `otel.server.config.ts`; do not “fix” by swapping.
OTLP batch lag	Axiom UI shows recent gap, then catch-up	sirloin uses `BatchProcessor` with default flush; tolerate up to ~30s lag during quiet periods
Ingest quota exceeded	Axiom rate-limits → exporter `429`s in stdout	reduce log verbosity; check `Sensitive Data in Logs` in observability spine for over-logging candidates
Query timeout	APL query > 30s in MCP	narrow `_time > ago(...)` window; project specific fields; aggregate before `take`
Service has no OTel SDK	service silent in Axiom (round, strip, flank, fennec, chuck)	known gap — see Observability

Secrets

Storage: Railway environment variables, per service. Names verbatim from apps/sirloin/internal/pkg/env/variables.go and the service .env.examples:

Variable	Service	Purpose
`SIRLOIN_AXIOM_TOKEN`	sirloin	OTLP exporter auth
`SIRLOIN_AXIOM_DATASET`	sirloin	logs + traces dataset
`SIRLOIN_AXIOM_METRICS_DATASET`	sirloin	metrics dataset
`SIRLOIN_FOXY360_AXIOM_API_TOKEN`	sirloin	Foxy360 query connector — separate from shipping token
`SIRLOIN_FOXY360_AXIOM_DATASET`	sirloin	default dataset for Foxy360 queries
`BRISKET_OTEL_URL`	brisket	OTLP base URL (`https://api.axiom.co`)
`BRISKET_OTEL_TRACES_LOGS_HEADERS`	brisket	`authorization=Bearer xaat-...,x-axiom-dataset=...`
`BRISKET_OTEL_METRICS_HEADERS`	brisket	as above, metrics dataset
`BRAIN_OTEL_URL`, `BRAIN_OTEL_TRACES_LOGS_HEADERS`, `BRAIN_OTEL_METRICS_HEADERS`	brain	mirror of brisket

Org id and personal-token (PAT) for the Axiom MCP integration are stored in the operator’s local config; they are not repo-side and are not the same as the service-side xaat- ingest tokens. TODO(@law): document where the shared org id is stashed (1Password? Railway shared env?).

Never commit Axiom tokens. All xaat- tokens shown in .env.example are placeholders; rotate immediately if a real token leaks via git.

Cost model

TODO(@law): plan + cost. Need to confirm with billing owner:

Axiom plan tier and monthly ingest GB cap.

Per-dataset retention (events vs metrics).

Whether beef-production is on a separate billing line.

Rough levers when ingest gets expensive:

Drop debug-level logs in production (apps/round/cmd/app/main.go setupLogger accepts a level — apply same discipline to brain pino / sirloin slog).
Reduce trace sampling. Confirmed neither apps/sirloin/internal/pkg/tracing/tracing.go nor apps/brain/src/otel.server.config.ts configures a sampler — SDK defaults apply (always-on / parent-based). TODO(@law): pin a sampler per environment.
Consolidate metrics — every PeriodicExportingMetricReader instance ships on a 60s interval; merging dimensions reduces cardinality.

PII and hygiene

See the Observability “Sensitive Data in Logs” section for the canonical list. Quick rules:

No raw Chargebee card data, Primer tokens, or full Clerk JWTs in logs or span attributes.
Email addresses currently appear in some sirloin/brain log lines ( Sirloin.SubsAll.email, brain User.email); masking is TODO.
The forthcoming Security Model page (path TBD; not yet present at docs/src/content/docs/concepts/security-model.md or operations/security-model.md) will own the redaction policy. Cross-link once it lands. TODO(@law).

Runbook hooks

Billing SLOs runbook references the billing_* metrics in the OTel metrics dataset; queries against beef-staging-otel-metrics are the canonical source there.
The observability spine’s “I see a 500 in production” playbook starts in Sentry, then jumps to Axiom for the trace — the “Recent 5xx with trace IDs” query above is the entry point.
New runbooks should cite a specific APL query block from this page rather than re-deriving it.

Open items (TODO(@law))

Re-run mcp__axiom__listDatasets with a valid MCP token; reconcile dataset table.
Re-run mcp__axiom__listDashboards; populate dashboard table with names + owners.
Re-run mcp__axiom__checkMonitors; record monitor IDs and what each alerts on. Sirloin Billing Drift is documented above; backfill its monitor ID and any sibling billing monitors once the MCP token is provisioned.
Confirm production dataset names (beef-production vs other).
Document Axiom plan + retention in the cost section.
Confirm flank’s OTel exporter destination (or document that spans are dropped).
Decide whether round/strip should ship to Axiom; if yes, wire zerolog → OTel logs like sirloin’s Provider.ZerologWriter (apps/sirloin/internal/pkg/tracing/tracing.go, apps/sirloin/internal/pkg/tracing/zerolog_otel_writer.go). The writer forwards the full zerolog JSON payload — body, severity, and every structured field — into OTel log records.
Cross-link Security Model page once it exists.