Skip to content

Axiom Integration Playbook

This page is the operator-side companion to Observability. The spine documents what gets logged and how trace context propagates. This page documents the Axiom account itself — datasets, dashboards, monitors, the OTLP shipping path, and APL queries you’d run during an incident.

Overview

Axiom is the logs + traces + metrics sink for beef. Three signals, two dataset types:

  • axiom:events:v1 datasets receive logs and traces (default beef-staging).
  • otel:metrics:v1 datasets receive OTel metrics (default beef-staging-otel-metrics).

Code wiring is OTLP-over-HTTP directly to api.axiom.co, gated by an env flag. There is no Railway log drain in repo (grep -r 'AXIOM' apps/*/railway.json returns nothing as of 2026-05-05) — every service that ships does so via its own OTel SDK. Services without an Axiom token print to stdout only.

ServiceWired in code?Evidence
sirloinYes — Go OTLPapps/sirloin/internal/pkg/tracing/tracing.go
brisketYes — Node OTLPapps/brisket/otel.server.config.ts
brainYes — Node OTLPapps/brain/src/otel.server.config.ts
roundNoapps/round/go.mod has no OTel deps
stripNoapps/strip/go.mod has no OTel deps
flankPartial@opentelemetry/api@^1.9.1 only (no SDK exporter) per apps/flank/package.json; spans terminate locally
fennecNoNo @opentelemetry/* package in apps/fennec/package.json
chuckNoNo @opentelemetry/* or Axiom package in apps/chuck/package.json

Gap. round, strip, flank, fennec, and chuck do not ship to Axiom today. Their logs land in Railway stdout and are not searchable in Axiom. For full-stack tracing, those services are dark.

Datasets in use

TODO(@law): live dataset list. The Axiom MCP probe (listDatasets) failed with “user token not found or expired” on the latest re-run; the table below is reconstructed from .env.example. Re-run mcp__axiom__listDatasets once a token is set and reconcile.

DatasetKindSignalsProducersRetention
beef-stagingaxiom:events:v1logs + tracessirloin, brisket, brainTODO(@law)
beef-staging-otel-metricsotel:metrics:v1metricssirloin, brisket, brainTODO(@law)
beef-productionaxiom:events:v1logs + tracesTODO(@law): naming convention onlyTODO(@law)
beef-production-otel-metricsotel:metrics:v1metricsTODO(@law)TODO(@law)

Dataset names referenced verbatim:

Terminal window
# .env.example:59-60
SIRLOIN_AXIOM_DATASET=beef-staging
SIRLOIN_AXIOM_METRICS_DATASET=beef-staging-otel-metrics
# apps/brisket/.env.example:25-26
BRISKET_OTEL_METRICS_HEADERS=...,x-axiom-dataset=beef-staging-otel-metrics
BRISKET_OTEL_TRACES_LOGS_HEADERS=...,x-axiom-dataset=beef-staging

Per-service log shipping

ServiceMechanismEndpointDataset (logs+traces / metrics)Code
sirloinGo OTLP/HTTP — otlploghttp, otlptracehttp, otlpmetrichttpapi.axiom.co (hard-coded)$SIRLOIN_AXIOM_DATASET / $SIRLOIN_AXIOM_METRICS_DATASETapps/sirloin/internal/pkg/tracing/tracing.go:34
brisketNode OTLP — @opentelemetry/exporter-{trace,metrics,logs}-otlp-* via NodeSDK$BRISKET_OTEL_URL/v1/{traces,metrics,logs}header-encoded (x-axiom-dataset)apps/brisket/otel.server.config.ts
brainNode OTLP — same exporter family + SentrySpanProcessor$BRAIN_OTEL_URL/v1/{traces,metrics,logs}header-encodedapps/brain/src/otel.server.config.ts
roundNone — zerolog to stdoutapps/round/cmd/app/main.go
stripNone — zerolog to stdoutapps/strip/cmd/app/main.go
flankNone — @opentelemetry/api is the only OTel package; no SDK exporterapps/flank/package.json

Authentication on the wire is Authorization: Bearer <AXIOM_TOKEN> plus X-Axiom-Dataset: <dataset>. sirloin and brisket use the same shape; only the transport differs (Go vs Node).

OTel disabled when token is empty. sirloin’s tracing.Init short-circuits if cfg.AxiomToken == "" (tracing.go:84), returning a no-op provider. The Node services follow the same gating in brain/src/common/otel/runtime-config.ts — exported booleans isBrainOtelEnabled, isBrainOtelMetricsEnabled, isBrainOtelTracesAndLogsEnabled derived from BRAIN_OTEL_URL / BRAIN_OTEL_*_HEADERS.

Foxy360 query connector

In addition to shipping to Axiom, sirloin can query Axiom on behalf of Foxy360 tools. This is a separate code path (apps/sirloin/internal/app/foxy360/server.go:1216, executeAxiomRawFallback) that hits https://api.axiom.co with a separate token (SIRLOIN_FOXY360_AXIOM_API_TOKEN) and applies guardrails:

  • max time window 30 days (server.go:1292).
  • requires params.dataset or SIRLOIN_FOXY360_AXIOM_DATASET env.

Use this when Foxy360 needs an APL query as part of a tool call. It is not part of the log-shipping path.

Dashboards

TODO(@law): dashboard inventory. mcp__axiom__listDashboards failed again with user token not found or expired. Re-run when an MCP token is provisioned and populate the table below.

DashboardPurposeOwner
TODO(@law): per-service traceslatency p95/p99, error rate, RPS by serviceTODO(@law)
TODO(@law): billing SLOssuccess rate of billing_payment_total, dunning lag@billing
TODO(@law): OTel metrics overviewresource utilisation by serviceTODO(@law)

Cross-reference: apps/sirloin/internal/app/services/billing/metrics/ defines the metric names the billing dashboard should be built on (billing_payment_total, billing_checkout_duration_seconds, etc.; see Billing SLOs runbook).

Monitors and alerts

TODO(@law): monitor inventory. mcp__axiom__checkMonitors re-run still returns user token not found or expired. Run checkMonitors() then getMonitorHistory() per monitor and document below once a token is available.

Monitors known to be needed (from the observability spine’s open TODOs):

  • 5xx-rate per service (sirloin, brain, brisket).
  • OTLP ingest dropping — alert when a service’s log volume falls > 80% in 10m.
  • Billing event SLOs (see runbook).

What we know is wired:

Sirloin Billing Drift

FieldValue
Monitor nameSirloin Billing Drift
Source metricbilling_invoices_drift_total (OTel Int64Counter)
Datasetbeef-staging-otel-metrics (and beef-production-otel-metrics once provisioned)
OTel scopesirloin/billing
Thresholdany non-zero count over a rolling 15-minute window
Window15m, evaluated every 5m
RoutingSlack #billing-alerts
Owner@billing
Statuswired (operator-managed via the Axiom UI / IaC repo)

Rationale. The local billing.invoices registry is reconciled against Chargebee by apps/sirloin/internal/app/worker/reconcileunpaidinvoices.go (reconciler Pass 1). Per-field drift checks emit billing_invoices_drift_total with an attribute.String("field", ...) label. Field values currently emitted by the reconciler: chargebee_status, amount_due_cents, total_cents, currency, chargebee_invoice_date, due_date, chargebee_updated_at, subscription_status, effective_auto_collection_off, customer_id, chargebee_subscription_id, and deleted. Treat this list as the current reconciler emitter set, not as exhaustive — new fields can be added in reconcileunpaidinvoices.go without requiring an Axiom monitor change. The counter is registered at apps/sirloin/internal/app/services/billing/metrics/billing_invoices.go:69 (BillingInvoicesDriftTotal). During the legacy CB-overlay window this is an overlay-period observability signal; once the legacy overlay is closed it becomes the primary signal that local state has diverged from Chargebee.

Query (Axiom metrics dataset; run via mcp__axiom__queryMetrics, not APL — the OTLP metrics dataset is not queryable with APL — see the “Metric query — OTLP ingest rate” note above):

# in queryMetrics:
# metric: billing_invoices_drift_total
# aggregation: sum
# group_by: field
# window: 15m
# alert when sum(rate) > 0 over the 15m window

For ad-hoc inspection of which field values are firing, group by the field attribute. The reconciler emits one increment per detected field per invoice per pass, so any non-zero sum indicates real divergence (not a noisy counter).

Test procedure (sandbox). The monitor itself is created in the Axiom UI / IaC repo, not from this codebase. To validate end-to-end alert delivery without waiting for organic drift:

  1. In the sandbox sirloin instance, force one drift increment — either via a temporary debug worker call to metrics.BillingInvoicesDriftTotal().Add(ctx, 1, metric.WithAttributes(attribute.String("field", "chargebee_status"))), or by deliberately writing a sentinel billing.invoices row whose chargebee_status differs from the upstream Chargebee invoice and waiting one reconciler tick.
  2. Confirm the counter increment lands in beef-staging-otel-metrics (queryMetrics on billing_invoices_drift_total with a 5-minute window).
  3. Verify the Slack alert arrives in #billing-alerts within the monitor’s evaluation interval. Acknowledge and silence in Axiom afterwards so the sentinel does not leave a permanent firing condition.

Related metrics. Cross-referenced with the rest of the INVR (local invoice registry) metric surface from apps/sirloin/internal/app/services/billing/metrics/billing_invoices.go:

  • billing_invoices_upsert_total — labeled by source and result; tracks every write into billing.invoices.
  • billing_invoices_pending_total — gauge of rows in a non-terminal Chargebee state, scraped each reconciler tick (a sustained high value implies the reconciler is falling behind, not that drift is occurring).
  • billing_invoices_reconciler_actions_total — labeled by action; action="drift" rows correlate with billing_invoices_drift_total increments.

The observability spine flags “Alerts / SLOs / on-call ownership” as item 10 of the open-TODO list — beyond the Sirloin Billing Drift monitor above, assume nothing else is wired until proven otherwise.

APL queries

Useful APL queries for an on-call engineer. Run from the Axiom UI or via mcp__axiom__queryDataset. All assume beef-staging as the events dataset; swap for beef-production once that dataset exists.

Recent errors across all services (last 1h)

['beef-staging']
| where _time > ago(1h)
| where ['attributes.severity_text'] in ('ERROR', 'FATAL')
or ['attributes']['level'] >= 50 // pino numeric level
| project _time, ['resource.service.name'], ['attributes']['msg'], trace_id, span_id
| order by _time desc
| take 200

Per-service error rate (5m buckets, last 6h)

['beef-staging']
| where _time > ago(6h)
| extend service = tostring(['resource.service.name'])
| extend is_error = ['attributes.severity_text'] in ('ERROR', 'FATAL')
| summarize errors = countif(is_error), total = count() by bin(_time, 5m), service
| extend error_rate = todouble(errors) / todouble(total)

HTTP latency p95 by route (sirloin)

['beef-staging']
| where _time > ago(1h)
| where ['resource.service.name'] == 'sirloin'
| where isnotempty(['attributes']['http.route'])
| summarize p95 = percentile(['attributes']['http.duration_ms'], 95)
by ['attributes']['http.route']
| order by p95 desc
| take 25

Recent 5xx with trace IDs (sirloin)

['beef-staging']
| where _time > ago(30m)
| where ['resource.service.name'] == 'sirloin'
| where toint(['attributes']['http.status_code']) >= 500
| project _time, ['attributes']['http.route'], status = ['attributes']['http.status_code'], trace_id
| order by _time desc

One trace, all spans (paste a trace_id)

['beef-staging']
| where _time > ago(24h)
| where trace_id == 'PASTE_TRACE_ID_HERE'
| project _time, ['resource.service.name'], name, kind, duration, status_code
| order by _time asc

BullMQ job failures (brain)

['beef-staging']
| where _time > ago(2h)
| where ['resource.service.name'] == 'brain'
| where name == 'bullmq.process'
| where status_code != 1 // OK
| summarize failures = count() by ['attributes']['queue.name'], ['attributes']['job.name']
| order by failures desc

Round inference latency (via brain client span)

['beef-staging']
| where _time > ago(1h)
| where name == 'round.infer'
| summarize p50 = percentile(duration, 50),
p95 = percentile(duration, 95),
p99 = percentile(duration, 99),
count = count()
by ['attributes']['round.model_id']

Span-name constants live in apps/brain/src/common/otel/tracing.ts (OTEL_SPAN_NAMES).

Brisket → sirloin call breakdown

['beef-staging']
| where _time > ago(1h)
| where ['resource.service.name'] == 'brisket'
| where kind == 'CLIENT'
| summarize p95 = percentile(duration, 95), errors = countif(status_code != 1), n = count()
by name
| order by n desc

Metric query — OTLP ingest rate

Use mcp__axiom__queryMetrics (not queryDataset) for the beef-staging-otel-metrics dataset. APL does not apply.

# in queryMetrics, metric: otelcol_exporter_sent_spans (or equivalent)
# group by service.name

Failure modes

ModeDetectionMitigation
Token rotated, exporter 401sservice stdout shows OTel SDK errors; Axiom volume drops to zerorotate *_AXIOM_TOKEN in Railway; restart service
Wrong dataset headerAxiom returns 400; logs visible in stdout but absent in Axiom UIcheck X-Axiom-Dataset matches an existing axiom:events:v1 dataset; metrics dataset must be otel:metrics:v1 (see tracing.go:160 comment)
Metrics 400 from JSON exporterbrisket/brain crash log: “exporter-metrics-otlp”must use @opentelemetry/exporter-metrics-otlp-proto, not -http. Already enforced in otel.server.config.ts; do not “fix” by swapping.
OTLP batch lagAxiom UI shows recent gap, then catch-upsirloin uses BatchProcessor with default flush; tolerate up to ~30s lag during quiet periods
Ingest quota exceededAxiom rate-limits → exporter 429s in stdoutreduce log verbosity; check Sensitive Data in Logs in observability spine for over-logging candidates
Query timeoutAPL query > 30s in MCPnarrow _time > ago(...) window; project specific fields; aggregate before take
Service has no OTel SDKservice silent in Axiom (round, strip, flank, fennec, chuck)known gap — see Observability

Secrets

Storage: Railway environment variables, per service. Names verbatim from apps/sirloin/internal/pkg/env/variables.go and the service .env.examples:

VariableServicePurpose
SIRLOIN_AXIOM_TOKENsirloinOTLP exporter auth
SIRLOIN_AXIOM_DATASETsirloinlogs + traces dataset
SIRLOIN_AXIOM_METRICS_DATASETsirloinmetrics dataset
SIRLOIN_FOXY360_AXIOM_API_TOKENsirloinFoxy360 query connector — separate from shipping token
SIRLOIN_FOXY360_AXIOM_DATASETsirloindefault dataset for Foxy360 queries
BRISKET_OTEL_URLbrisketOTLP base URL (https://api.axiom.co)
BRISKET_OTEL_TRACES_LOGS_HEADERSbrisketauthorization=Bearer xaat-...,x-axiom-dataset=...
BRISKET_OTEL_METRICS_HEADERSbrisketas above, metrics dataset
BRAIN_OTEL_URL, BRAIN_OTEL_TRACES_LOGS_HEADERS, BRAIN_OTEL_METRICS_HEADERSbrainmirror of brisket

Org id and personal-token (PAT) for the Axiom MCP integration are stored in the operator’s local config; they are not repo-side and are not the same as the service-side xaat- ingest tokens. TODO(@law): document where the shared org id is stashed (1Password? Railway shared env?).

Never commit Axiom tokens. All xaat- tokens shown in .env.example are placeholders; rotate immediately if a real token leaks via git.

Cost model

TODO(@law): plan + cost. Need to confirm with billing owner:

  • Axiom plan tier and monthly ingest GB cap.
  • Per-dataset retention (events vs metrics).
  • Whether beef-production is on a separate billing line.

Rough levers when ingest gets expensive:

  1. Drop debug-level logs in production (apps/round/cmd/app/main.go setupLogger accepts a level — apply same discipline to brain pino / sirloin slog).
  2. Reduce trace sampling. Confirmed neither apps/sirloin/internal/pkg/tracing/tracing.go nor apps/brain/src/otel.server.config.ts configures a sampler — SDK defaults apply (always-on / parent-based). TODO(@law): pin a sampler per environment.
  3. Consolidate metrics — every PeriodicExportingMetricReader instance ships on a 60s interval; merging dimensions reduces cardinality.

PII and hygiene

See the Observability “Sensitive Data in Logs” section for the canonical list. Quick rules:

  • No raw Chargebee card data, Primer tokens, or full Clerk JWTs in logs or span attributes.
  • Email addresses currently appear in some sirloin/brain log lines ( Sirloin.SubsAll.email, brain User.email); masking is TODO.
  • The forthcoming Security Model page (path TBD; not yet present at docs/src/content/docs/concepts/security-model.md or operations/security-model.md) will own the redaction policy. Cross-link once it lands. TODO(@law).

Runbook hooks

  • Billing SLOs runbook references the billing_* metrics in the OTel metrics dataset; queries against beef-staging-otel-metrics are the canonical source there.
  • The observability spine’s “I see a 500 in production” playbook starts in Sentry, then jumps to Axiom for the trace — the “Recent 5xx with trace IDs” query above is the entry point.
  • New runbooks should cite a specific APL query block from this page rather than re-deriving it.

Open items (TODO(@law))

  1. Re-run mcp__axiom__listDatasets with a valid MCP token; reconcile dataset table.
  2. Re-run mcp__axiom__listDashboards; populate dashboard table with names + owners.
  3. Re-run mcp__axiom__checkMonitors; record monitor IDs and what each alerts on. Sirloin Billing Drift is documented above; backfill its monitor ID and any sibling billing monitors once the MCP token is provisioned.
  4. Confirm production dataset names (beef-production vs other).
  5. Document Axiom plan + retention in the cost section.
  6. Confirm flank’s OTel exporter destination (or document that spans are dropped).
  7. Decide whether round/strip should ship to Axiom; if yes, wire zerolog → OTel logs like sirloin’s Provider.ZerologWriter (apps/sirloin/internal/pkg/tracing/tracing.go, apps/sirloin/internal/pkg/tracing/zerolog_otel_writer.go). The writer forwards the full zerolog JSON payload — body, severity, and every structured field — into OTel log records.
  8. Cross-link Security Model page once it exists.