Skip to content

Observability

How we see what production is doing — logs, metrics, traces, and the playbook for chasing a 500 back to its root cause.

Stack Inventory

What is actually wired in, per service, with file evidence.

ServiceLogsMetricsTracesErrorsProduct analytics
sirloin (Go)log/slog (stdlib) shipped via OTLP logs (go.opentelemetry.io/otel/log, otlploghttp in apps/sirloin/go.mod)OpenTelemetry metrics → OTLP HTTP (go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp)OpenTelemetry traces → OTLP HTTP, with gRPC + HTTP middleware (otelgrpc, otelhttp); BUN auto-instrumented (bunotel, otelsql)Sentry Go SDK (github.com/getsentry/sentry-go v0.44.1)PostHog (github.com/posthog/posthog-go v1.11.2)
brain (NestJS)pino + nestjs-pino + pino-http + pino-pretty (in apps/brain/package.json); auto-correlated with traces via @opentelemetry/instrumentation-pinoOpenTelemetry metrics → OTLP proto exporter (@opentelemetry/exporter-metrics-otlp-proto); endpoint set via BRAIN_OTEL_URL (.env.example:240); custom process-memory gauges in apps/brain/src/common/otel/metrics.tsOpenTelemetry SDK Node (@opentelemetry/sdk-node, auto-instrumentations-node); OTLP HTTP exportersSentry NestJS (@sentry/nestjs, @sentry/profiling-node, @sentry/opentelemetry)None — posthog-node is absent from apps/brain/package.json
brisket (Next.js)Server: OTLP logs shipped via OpenTelemetry headers in BRISKET_OTEL_TRACES_LOGS_HEADERS (.env.example:63); browser: console + Sentry captureOpenTelemetry metrics, exported to Axiom (BRISKET_OTEL_URL=https://api.axiom.co set in code)@opentelemetry/sdk-node + auto-instrumentations; explicit trace/SpanStatusCode usage in apps/brisket/src@sentry/nextjs (server + client)posthog-js (useFeatureFlagEnabled, usePostHog across apps/brisket/src); posthog-node for server
fennec (React)Browser console only (no log shipper in apps/fennec/package.json); SPA backed by nginx (apps/fennec/nginx.conf)None observedNone observedNone — no @sentry/* in apps/fennec/package.jsonNone — no posthog-* in apps/fennec/package.json
flank (TS)Custom logger apps/flank/server/engine/logger.ts (stdout JSON)@opentelemetry/api only — no SDK exporters in apps/flank/package.json, so spans are never exportedAPI surface only; spans created via @opentelemetry/api are dropped (no SDK registered in apps/flank/package.json)None — no @sentry/* in apps/flank/package.jsonNone — no posthog-* in apps/flank/package.json
round (Go ML)zerolog (apps/round/cmd/app/main.go: setupLogger returns zerolog.Logger)None observed in go.modNone observedNone observedNone observed
strip (Go SSR)zerolog (apps/strip/cmd/app/main.go uses log.Info().Msg(...))None observedNone observedNone observedNone observed
chuck (Strapi)Strapi 5 default logger only — no logger override in apps/chuck/meat/config/None — no metrics deps in apps/chuck/meat/package.jsonNone — no OTel deps in apps/chuck/meat/package.jsonNone — no @sentry/* in apps/chuck/meat/package.jsonNone — no posthog-* in apps/chuck/meat/package.json
shank (email)n/a — build-time export onlyn/an/an/an/a

Backends: OTLP traces/metrics/logs are exported to Axiom. The endpoint URL is configured per service via env: SIRLOIN_AXIOM_TOKEN / _DATASET / _METRICS_DATASET (.env.example:58-60), BRAIN_OTEL_URL / BRAIN_OTEL_METRICS_HEADERS / BRAIN_OTEL_TRACES_LOGS_HEADERS (.env.example:240-242), BRISKET_OTEL_URL / BRISKET_OTEL_METRICS_HEADERS / BRISKET_OTEL_TRACES_LOGS_HEADERS (.env.example:61-63). Sentry is the error sink for sirloin, brain, and brisket. PostHog is the product-analytics

  • feature-flag sink for sirloin and brisket.

Log Levels and Conventions

Go services (sirloin, round, strip)

  • sirloin uses Go stdlib log/slog with structured key/value pairs, OTLP-exported.
  • round and strip use zerolog (apps/round/cmd/app/main.go, apps/strip/cmd/app/main.go). Levels: Debug, Info, Warn, Error, Fatal.
  • Log level is configurable via env (round: setupLogger(level string)).
  • Convention: log.Error().Err(err).Msg("failed to X") — error in err field, human message in Msg.

TS services (brain, brisket, flank, fennec)

  • brain uses pino via nestjs-pino/pino-http. Prod = JSON, dev = pino-pretty. Auto-instrumented by OpenTelemetry so each log line carries trace_id/span_id.
  • brisket uses Next.js + OpenTelemetry. Server logs ship via OTLP to Axiom (headers in BRISKET_OTEL_TRACES_LOGS_HEADERS, .env.example:63); browser uses console + Sentry capture.
  • Levels: trace, debug, info, warn, error, fatal (pino defaults). TODO(@pawel): production minimum log level per service (brain pino config).

Metric Naming

TODO(@law): no documented convention found in code. Recommend OTel semantic conventions for HTTP/gRPC (http.server.duration, rpc.server.duration) and domain prefixes for custom metrics (sirloin.billing.invoice.failed, brain.generation.queued). File a follow-up if you want this enforced.

Brain exports these process-memory gauges through the same metrics datastream when BRAIN_OTEL_METRICS_HEADERS is configured:

  • process.memory.rss — resident set size for the Node process.
  • process.memory.external — native/external memory tracked by Node.
  • process.memory.array_buffers — ArrayBuffer and Buffer backing memory.
  • process.memory.heap_used — V8 heap used bytes.
  • process.memory.heap_total — total V8 heap allocated bytes.
  • process.memory.unaccounted — RSS not explained by V8 heap total plus Node-tracked external memory.

Brain also exports portable V8 and Sharp memory diagnostics:

  • v8.heap.used_heap_size, v8.heap.total_heap_size, v8.heap.total_physical_size, v8.heap.total_available_size, v8.heap.heap_size_limit, v8.heap.malloced_memory, and v8.heap.peak_malloced_memory.
  • v8.heap_space.used_size, v8.heap_space.space_size, v8.heap_space.physical_size, and v8.heap_space.available_size, grouped by v8.space.name.
  • v8.code.code_and_metadata_size and v8.code.bytecode_and_metadata_size.
  • v8.cpp_heap.used_size and v8.cpp_heap.committed_size when Node exposes C++ heap statistics.
  • brain.sharp.queue, brain.sharp.process, brain.sharp.cache.memory, brain.sharp.cache.files, and brain.sharp.cache.items.

Brain also exports Node-specific runtime diagnostics:

  • brain.runtime.active_resources — active resources from process.getActiveResourcesInfo(), grouped by node.resource.type.

Brain also exports file I/O limiter gauges from apps/brain/src/common/otel/metrics.ts, backed by apps/brain/src/modules/application/storage/services/file-io-limiter.service.ts:

  • brain.file_io_limiter.active — active work items by limiter.name.
  • brain.file_io_limiter.queued — queued work items by limiter.name.
  • brain.file_io_limiter.limit — configured concurrency limit by limiter.name.
  • brain.file_io_limiter.oldest_queued_ms — oldest queued wait by limiter.name.

Current limiter names are remote_downloads, video_transforms, provider_output_downloads, provider_preuploads, and archive_creation.

Brain also exports mediaflows BullMQ job-state gauges from the generation module. Each gauge uses queue.name=mediaflows:

  • brain.queue.jobs.active — jobs claimed by Brain from BullMQ; downstream provider work may still be pending.
  • brain.queue.jobs.waiting — jobs waiting in BullMQ.
  • brain.queue.jobs.delayed — jobs delayed in BullMQ.
  • brain.queue.jobs.failed — jobs failed in BullMQ.

For FOXY-202 native-memory leak checks, start with three Axiom chart groups in the brain metrics dataset:

  1. RSS shape: process.memory.rss, process.memory.unaccounted, process.memory.external, process.memory.array_buffers, and process.memory.heap_total.
  2. V8/native runtime: v8.heap.total_physical_size, v8.heap.malloced_memory, v8.heap.peak_malloced_memory, v8.cpp_heap.used_size, v8.code.code_and_metadata_size, and brain.runtime.active_resources.
  3. Workload pressure: generation.media.total, round.infer.total, brain.file_io_limiter.active, brain.file_io_limiter.queued, brain.sharp.queue, and brain.sharp.process.

RSS/unaccounted growth while V8 heap, V8 malloced memory, and Sharp counters are flat points away from ordinary JS heap growth and toward native allocator, provider SDK, ffmpeg/gRPC, or platform-level RSS retention.

Trace Propagation

Trace IDs cross service boundaries via standard W3C traceparent headers.

Evidence (grep traceparent across apps/sirloin, apps/brain/src):

  • TS code reads/writes traceparent and tracestate (e.g., normalizeTraceContextHeaderValue(headers.traceparent)).
  • gRPC: otelgrpc interceptors on sirloin (Go side) auto-propagate; brain’s @opentelemetry/sdk-node + auto-instrumentations handle the TS side.
  • Correlation ID fallback: x-correlation-id is read across NestJS / Fastify layers (req.headers['x-correlation-id'] ?? uuidv4()) and surfaced to logs as correlationId / req.id.

So a request hitting brisket gets a traceparent, propagates through sirloin gRPC to brain. The trace terminates at the sirloin/brain → round boundary: apps/round/go.mod declares no go.opentelemetry.io/* dependencies, and no OTel imports exist under apps/round/cmd or apps/round/internal.

Dashboards

TODO(@law): catalogue of Axiom dashboards, Sentry projects, PostHog dashboards (URLs and ownership). Suggested sections to fill:

  • Axiom: per-service traces dashboard, error-rate dashboard, slow-query view.
  • Sentry: one project per service (sirloin, brain, brisket).
  • PostHog: funnels for signup → first generation → paid conversion.

Alerts and SLOs

TODO(@law): no alerts or SLO definitions found in this repo. If alerts live in Axiom / Sentry / PostHog UI rather than as code, document at minimum:

  • Owning team / on-call.
  • Pager destination.
  • Threshold and rationale.

Debugging Playbook — “I see a 500 in production”

Order assumes you have nothing but a Sentry alert or a user report.

  1. Sentry — find the exception. Service → environment → time window. Grab the trace_id from the event tags (auto-attached for sirloin via getsentry/sentry-go + OTel, for brain via @sentry/opentelemetry).
  2. Axiom — pivot on trace_id. Search across services for that trace ID. You will see the full request span tree: brisket → sirloin (gRPC) → brain (gRPC) → round/external. Identify the failing span.
  3. Logs in Axiom. Filter logs by trace_id (sirloin slog → OTLP logs; brain pino → OTel logs). Correlate timestamps with the failing span. Fallback: filter by correlationId / x-correlation-id.
  4. PostHog — user impact. Look up the affected user / session by distinct_id to confirm severity and reproduction path.
  5. Reproduce locally. Capture the request payload from the trace, replay against a local stack (make dev-up-d). For billing or payment paths see the relevant ADRs in Decisions — many bugs in those flows are state-machine issues that need the right preconditions.

If the trace dead-ends at a service with no OTel (round, strip — currently), fall back to that service’s zerolog output by container/pod log search.

Sensitive Data in Logs

  • Never log raw Chargebee card data, Primer tokens, or full Clerk JWTs.
  • Email addresses currently appear in Sirloin.SubsAll.email and brain User.email. No global PII redaction layer exists today (see security-model.md → Logging hygiene for the partial sanitisers in apps/sirloin/internal/app/foxy360/server.go:526 and apps/sirloin/internal/app/services/media/listmediaexamples.go:53).
  • See the Security Model page for the authoritative redaction rules.

Open follow-ups

  1. Production minimum log level per service (TODO(@pawel) for brain).
  2. Metric naming convention enforcement (TODO(@law)).
  3. Dashboard inventory — Axiom / Sentry / PostHog URLs (TODO(@law)).
  4. Alerts / SLOs / on-call ownership (TODO(@law)).
  5. PII masking policy in logs (TODO(@law)).