Observability

How we see what production is doing — logs, metrics, traces, and the playbook for chasing a 500 back to its root cause.

Stack Inventory

What is actually wired in, per service, with file evidence.

Service	Logs	Metrics	Traces	Errors	Product analytics
sirloin (Go)	`log/slog` (stdlib) shipped via OTLP logs (`go.opentelemetry.io/otel/log`, `otlploghttp` in `apps/sirloin/go.mod`)	OpenTelemetry metrics → OTLP HTTP (`go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp`)	OpenTelemetry traces → OTLP HTTP, with gRPC + HTTP middleware (`otelgrpc`, `otelhttp`); BUN auto-instrumented (`bunotel`, `otelsql`)	Sentry Go SDK (`github.com/getsentry/sentry-go v0.44.1`)	PostHog (`github.com/posthog/posthog-go v1.11.2`)
brain (NestJS)	`pino` + `nestjs-pino` + `pino-http` + `pino-pretty` (in `apps/brain/package.json`); auto-correlated with traces via `@opentelemetry/instrumentation-pino`	OpenTelemetry metrics → OTLP proto exporter (`@opentelemetry/exporter-metrics-otlp-proto`); endpoint set via `BRAIN_OTEL_URL` (`.env.example:240`); custom process-memory gauges in `apps/brain/src/common/otel/metrics.ts`	OpenTelemetry SDK Node (`@opentelemetry/sdk-node`, `auto-instrumentations-node`); OTLP HTTP exporters	Sentry NestJS (`@sentry/nestjs`, `@sentry/profiling-node`, `@sentry/opentelemetry`)	None — `posthog-node` is absent from `apps/brain/package.json`
brisket (Next.js)	Server: OTLP logs shipped via OpenTelemetry headers in `BRISKET_OTEL_TRACES_LOGS_HEADERS` (`.env.example:63`); browser: console + Sentry capture	OpenTelemetry metrics, exported to Axiom (`BRISKET_OTEL_URL=https://api.axiom.co` set in code)	`@opentelemetry/sdk-node` + auto-instrumentations; explicit `trace`/`SpanStatusCode` usage in `apps/brisket/src`	`@sentry/nextjs` (server + client)	`posthog-js` (`useFeatureFlagEnabled`, `usePostHog` across `apps/brisket/src`); `posthog-node` for server
fennec (React)	Browser `console` only (no log shipper in `apps/fennec/package.json`); SPA backed by nginx (`apps/fennec/nginx.conf`)	None observed	None observed	None — no `@sentry/*` in `apps/fennec/package.json`	None — no `posthog-*` in `apps/fennec/package.json`
flank (TS)	Custom logger `apps/flank/server/engine/logger.ts` (stdout JSON)	`@opentelemetry/api` only — no SDK exporters in `apps/flank/package.json`, so spans are never exported	API surface only; spans created via `@opentelemetry/api` are dropped (no SDK registered in `apps/flank/package.json`)	None — no `@sentry/*` in `apps/flank/package.json`	None — no `posthog-*` in `apps/flank/package.json`
round (Go ML)	`zerolog` (`apps/round/cmd/app/main.go`: `setupLogger` returns `zerolog.Logger`)	None observed in `go.mod`	None observed	None observed	None observed
strip (Go SSR)	`zerolog` (`apps/strip/cmd/app/main.go` uses `log.Info().Msg(...)`)	None observed	None observed	None observed	None observed
chuck (Strapi)	Strapi 5 default logger only — no logger override in `apps/chuck/meat/config/`	None — no metrics deps in `apps/chuck/meat/package.json`	None — no OTel deps in `apps/chuck/meat/package.json`	None — no `@sentry/*` in `apps/chuck/meat/package.json`	None — no `posthog-*` in `apps/chuck/meat/package.json`
shank (email)	n/a — build-time export only	n/a	n/a	n/a	n/a

Backends: OTLP traces/metrics/logs are exported to Axiom. The endpoint URL is configured per service via env: SIRLOIN_AXIOM_TOKEN / _DATASET / _METRICS_DATASET (.env.example:58-60), BRAIN_OTEL_URL / BRAIN_OTEL_METRICS_HEADERS / BRAIN_OTEL_TRACES_LOGS_HEADERS (.env.example:240-242), BRISKET_OTEL_URL / BRISKET_OTEL_METRICS_HEADERS / BRISKET_OTEL_TRACES_LOGS_HEADERS (.env.example:61-63). Sentry is the error sink for sirloin, brain, and brisket. PostHog is the product-analytics

feature-flag sink for sirloin and brisket.

Log Levels and Conventions

Go services (sirloin, round, strip)

sirloin uses Go stdlib log/slog with structured key/value pairs, OTLP-exported.
round and strip use zerolog (apps/round/cmd/app/main.go, apps/strip/cmd/app/main.go). Levels: Debug, Info, Warn, Error, Fatal.
Log level is configurable via env (round: setupLogger(level string)).
Convention: log.Error().Err(err).Msg("failed to X") — error in err field, human message in Msg.

TS services (brain, brisket, flank, fennec)

brain uses pino via nestjs-pino/pino-http. Prod = JSON, dev = pino-pretty. Auto-instrumented by OpenTelemetry so each log line carries trace_id/span_id.
brisket uses Next.js + OpenTelemetry. Server logs ship via OTLP to Axiom (headers in BRISKET_OTEL_TRACES_LOGS_HEADERS, .env.example:63); browser uses console + Sentry capture.
Levels: trace, debug, info, warn, error, fatal (pino defaults). TODO(@pawel): production minimum log level per service (brain pino config).

Metric Naming

TODO(@law): no documented convention found in code. Recommend OTel semantic conventions for HTTP/gRPC (http.server.duration, rpc.server.duration) and domain prefixes for custom metrics (sirloin.billing.invoice.failed, brain.generation.queued). File a follow-up if you want this enforced.

Brain exports these process-memory gauges through the same metrics datastream when BRAIN_OTEL_METRICS_HEADERS is configured:

process.memory.rss — resident set size for the Node process.
process.memory.external — native/external memory tracked by Node.
process.memory.array_buffers — ArrayBuffer and Buffer backing memory.
process.memory.heap_used — V8 heap used bytes.
process.memory.heap_total — total V8 heap allocated bytes.
process.memory.unaccounted — RSS not explained by V8 heap total plus Node-tracked external memory.

Brain also exports portable V8 and Sharp memory diagnostics:

v8.heap.used_heap_size, v8.heap.total_heap_size, v8.heap.total_physical_size, v8.heap.total_available_size, v8.heap.heap_size_limit, v8.heap.malloced_memory, and v8.heap.peak_malloced_memory.
v8.heap_space.used_size, v8.heap_space.space_size, v8.heap_space.physical_size, and v8.heap_space.available_size, grouped by v8.space.name.
v8.code.code_and_metadata_size and v8.code.bytecode_and_metadata_size.
v8.cpp_heap.used_size and v8.cpp_heap.committed_size when Node exposes C++ heap statistics.
brain.sharp.queue, brain.sharp.process, brain.sharp.cache.memory, brain.sharp.cache.files, and brain.sharp.cache.items.

Brain also exports Node-specific runtime diagnostics:

brain.runtime.active_resources — active resources from process.getActiveResourcesInfo(), grouped by node.resource.type.

Brain also exports file I/O limiter gauges from apps/brain/src/common/otel/metrics.ts, backed by apps/brain/src/modules/application/storage/services/file-io-limiter.service.ts:

brain.file_io_limiter.active — active work items by limiter.name.
brain.file_io_limiter.queued — queued work items by limiter.name.
brain.file_io_limiter.limit — configured concurrency limit by limiter.name.
brain.file_io_limiter.oldest_queued_ms — oldest queued wait by limiter.name.

Current limiter names are remote_downloads, video_transforms, provider_output_downloads, provider_preuploads, and archive_creation.

Brain also exports mediaflows BullMQ job-state gauges from the generation module. Each gauge uses queue.name=mediaflows:

brain.queue.jobs.active — jobs claimed by Brain from BullMQ; downstream provider work may still be pending.
brain.queue.jobs.waiting — jobs waiting in BullMQ.
brain.queue.jobs.delayed — jobs delayed in BullMQ.
brain.queue.jobs.failed — jobs failed in BullMQ.

For FOXY-202 native-memory leak checks, start with three Axiom chart groups in the brain metrics dataset:

RSS shape: process.memory.rss, process.memory.unaccounted, process.memory.external, process.memory.array_buffers, and process.memory.heap_total.
V8/native runtime: v8.heap.total_physical_size, v8.heap.malloced_memory, v8.heap.peak_malloced_memory, v8.cpp_heap.used_size, v8.code.code_and_metadata_size, and brain.runtime.active_resources.
Workload pressure: generation.media.total, round.infer.total, brain.file_io_limiter.active, brain.file_io_limiter.queued, brain.sharp.queue, and brain.sharp.process.

RSS/unaccounted growth while V8 heap, V8 malloced memory, and Sharp counters are flat points away from ordinary JS heap growth and toward native allocator, provider SDK, ffmpeg/gRPC, or platform-level RSS retention.

Trace Propagation

Trace IDs cross service boundaries via standard W3C traceparent headers.

Evidence (grep traceparent across apps/sirloin, apps/brain/src):

TS code reads/writes traceparent and tracestate (e.g., normalizeTraceContextHeaderValue(headers.traceparent)).
gRPC: otelgrpc interceptors on sirloin (Go side) auto-propagate; brain’s @opentelemetry/sdk-node + auto-instrumentations handle the TS side.
Correlation ID fallback: x-correlation-id is read across NestJS / Fastify layers (req.headers['x-correlation-id'] ?? uuidv4()) and surfaced to logs as correlationId / req.id.

So a request hitting brisket gets a traceparent, propagates through sirloin gRPC to brain. The trace terminates at the sirloin/brain → round boundary: apps/round/go.mod declares no go.opentelemetry.io/* dependencies, and no OTel imports exist under apps/round/cmd or apps/round/internal.

Dashboards

TODO(@law): catalogue of Axiom dashboards, Sentry projects, PostHog dashboards (URLs and ownership). Suggested sections to fill:

Axiom: per-service traces dashboard, error-rate dashboard, slow-query view.
Sentry: one project per service (sirloin, brain, brisket).
PostHog: funnels for signup → first generation → paid conversion.

Alerts and SLOs

TODO(@law): no alerts or SLO definitions found in this repo. If alerts live in Axiom / Sentry / PostHog UI rather than as code, document at minimum:

Owning team / on-call.
Pager destination.
Threshold and rationale.

Debugging Playbook — “I see a 500 in production”

Order assumes you have nothing but a Sentry alert or a user report.

Sentry — find the exception. Service → environment → time window. Grab the trace_id from the event tags (auto-attached for sirloin via getsentry/sentry-go + OTel, for brain via @sentry/opentelemetry).
Axiom — pivot on trace_id. Search across services for that trace ID. You will see the full request span tree: brisket → sirloin (gRPC) → brain (gRPC) → round/external. Identify the failing span.
Logs in Axiom. Filter logs by trace_id (sirloin slog → OTLP logs; brain pino → OTel logs). Correlate timestamps with the failing span. Fallback: filter by correlationId / x-correlation-id.
PostHog — user impact. Look up the affected user / session by distinct_id to confirm severity and reproduction path.
Reproduce locally. Capture the request payload from the trace, replay against a local stack (make dev-up-d). For billing or payment paths see the relevant ADRs in Decisions — many bugs in those flows are state-machine issues that need the right preconditions.

If the trace dead-ends at a service with no OTel (round, strip — currently), fall back to that service’s zerolog output by container/pod log search.

Sensitive Data in Logs

Never log raw Chargebee card data, Primer tokens, or full Clerk JWTs.
Email addresses currently appear in Sirloin.SubsAll.email and brain User.email. No global PII redaction layer exists today (see security-model.md → Logging hygiene for the partial sanitisers in apps/sirloin/internal/app/foxy360/server.go:526 and apps/sirloin/internal/app/services/media/listmediaexamples.go:53).
See the Security Model page for the authoritative redaction rules.

Open follow-ups

Production minimum log level per service (TODO(@pawel) for brain).
Metric naming convention enforcement (TODO(@law)).
Dashboard inventory — Axiom / Sentry / PostHog URLs (TODO(@law)).
Alerts / SLOs / on-call ownership (TODO(@law)).
PII masking policy in logs (TODO(@law)).