Observability
How we see what production is doing — logs, metrics, traces, and the playbook for chasing a 500 back to its root cause.
Stack Inventory
What is actually wired in, per service, with file evidence.
| Service | Logs | Metrics | Traces | Errors | Product analytics |
|---|---|---|---|---|---|
| sirloin (Go) | log/slog (stdlib) shipped via OTLP logs (go.opentelemetry.io/otel/log, otlploghttp in apps/sirloin/go.mod) | OpenTelemetry metrics → OTLP HTTP (go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp) | OpenTelemetry traces → OTLP HTTP, with gRPC + HTTP middleware (otelgrpc, otelhttp); BUN auto-instrumented (bunotel, otelsql) | Sentry Go SDK (github.com/getsentry/sentry-go v0.44.1) | PostHog (github.com/posthog/posthog-go v1.11.2) |
| brain (NestJS) | pino + nestjs-pino + pino-http + pino-pretty (in apps/brain/package.json); auto-correlated with traces via @opentelemetry/instrumentation-pino | OpenTelemetry metrics → OTLP proto exporter (@opentelemetry/exporter-metrics-otlp-proto); endpoint set via BRAIN_OTEL_URL (.env.example:240); custom process-memory gauges in apps/brain/src/common/otel/metrics.ts | OpenTelemetry SDK Node (@opentelemetry/sdk-node, auto-instrumentations-node); OTLP HTTP exporters | Sentry NestJS (@sentry/nestjs, @sentry/profiling-node, @sentry/opentelemetry) | None — posthog-node is absent from apps/brain/package.json |
| brisket (Next.js) | Server: OTLP logs shipped via OpenTelemetry headers in BRISKET_OTEL_TRACES_LOGS_HEADERS (.env.example:63); browser: console + Sentry capture | OpenTelemetry metrics, exported to Axiom (BRISKET_OTEL_URL=https://api.axiom.co set in code) | @opentelemetry/sdk-node + auto-instrumentations; explicit trace/SpanStatusCode usage in apps/brisket/src | @sentry/nextjs (server + client) | posthog-js (useFeatureFlagEnabled, usePostHog across apps/brisket/src); posthog-node for server |
| fennec (React) | Browser console only (no log shipper in apps/fennec/package.json); SPA backed by nginx (apps/fennec/nginx.conf) | None observed | None observed | None — no @sentry/* in apps/fennec/package.json | None — no posthog-* in apps/fennec/package.json |
| flank (TS) | Custom logger apps/flank/server/engine/logger.ts (stdout JSON) | @opentelemetry/api only — no SDK exporters in apps/flank/package.json, so spans are never exported | API surface only; spans created via @opentelemetry/api are dropped (no SDK registered in apps/flank/package.json) | None — no @sentry/* in apps/flank/package.json | None — no posthog-* in apps/flank/package.json |
| round (Go ML) | zerolog (apps/round/cmd/app/main.go: setupLogger returns zerolog.Logger) | None observed in go.mod | None observed | None observed | None observed |
| strip (Go SSR) | zerolog (apps/strip/cmd/app/main.go uses log.Info().Msg(...)) | None observed | None observed | None observed | None observed |
| chuck (Strapi) | Strapi 5 default logger only — no logger override in apps/chuck/meat/config/ | None — no metrics deps in apps/chuck/meat/package.json | None — no OTel deps in apps/chuck/meat/package.json | None — no @sentry/* in apps/chuck/meat/package.json | None — no posthog-* in apps/chuck/meat/package.json |
| shank (email) | n/a — build-time export only | n/a | n/a | n/a | n/a |
Backends: OTLP traces/metrics/logs are exported to Axiom. The endpoint
URL is configured per service via env: SIRLOIN_AXIOM_TOKEN / _DATASET /
_METRICS_DATASET (.env.example:58-60), BRAIN_OTEL_URL /
BRAIN_OTEL_METRICS_HEADERS / BRAIN_OTEL_TRACES_LOGS_HEADERS
(.env.example:240-242), BRISKET_OTEL_URL / BRISKET_OTEL_METRICS_HEADERS
/ BRISKET_OTEL_TRACES_LOGS_HEADERS (.env.example:61-63). Sentry is the
error sink for sirloin, brain, and brisket. PostHog is the product-analytics
- feature-flag sink for sirloin and brisket.
Log Levels and Conventions
Go services (sirloin, round, strip)
- sirloin uses Go stdlib
log/slogwith structured key/value pairs, OTLP-exported. - round and strip use
zerolog(apps/round/cmd/app/main.go,apps/strip/cmd/app/main.go). Levels:Debug,Info,Warn,Error,Fatal. - Log level is configurable via env (round:
setupLogger(level string)). - Convention:
log.Error().Err(err).Msg("failed to X")— error inerrfield, human message inMsg.
TS services (brain, brisket, flank, fennec)
- brain uses
pinovianestjs-pino/pino-http. Prod = JSON, dev =pino-pretty. Auto-instrumented by OpenTelemetry so each log line carriestrace_id/span_id. - brisket uses Next.js + OpenTelemetry. Server logs ship via OTLP to Axiom (headers in
BRISKET_OTEL_TRACES_LOGS_HEADERS,.env.example:63); browser usesconsole+ Sentry capture. - Levels:
trace,debug,info,warn,error,fatal(pino defaults). TODO(@pawel): production minimum log level per service (brain pino config).
Metric Naming
TODO(@law): no documented convention found in code. Recommend OTel
semantic conventions for HTTP/gRPC (http.server.duration,
rpc.server.duration) and domain prefixes for custom metrics
(sirloin.billing.invoice.failed, brain.generation.queued). File a
follow-up if you want this enforced.
Brain exports these process-memory gauges through the same metrics datastream
when BRAIN_OTEL_METRICS_HEADERS is configured:
process.memory.rss— resident set size for the Node process.process.memory.external— native/external memory tracked by Node.process.memory.array_buffers— ArrayBuffer and Buffer backing memory.process.memory.heap_used— V8 heap used bytes.process.memory.heap_total— total V8 heap allocated bytes.process.memory.unaccounted— RSS not explained by V8 heap total plus Node-tracked external memory.
Brain also exports portable V8 and Sharp memory diagnostics:
v8.heap.used_heap_size,v8.heap.total_heap_size,v8.heap.total_physical_size,v8.heap.total_available_size,v8.heap.heap_size_limit,v8.heap.malloced_memory, andv8.heap.peak_malloced_memory.v8.heap_space.used_size,v8.heap_space.space_size,v8.heap_space.physical_size, andv8.heap_space.available_size, grouped byv8.space.name.v8.code.code_and_metadata_sizeandv8.code.bytecode_and_metadata_size.v8.cpp_heap.used_sizeandv8.cpp_heap.committed_sizewhen Node exposes C++ heap statistics.brain.sharp.queue,brain.sharp.process,brain.sharp.cache.memory,brain.sharp.cache.files, andbrain.sharp.cache.items.
Brain also exports Node-specific runtime diagnostics:
brain.runtime.active_resources— active resources fromprocess.getActiveResourcesInfo(), grouped bynode.resource.type.
Brain also exports file I/O limiter gauges from
apps/brain/src/common/otel/metrics.ts, backed by
apps/brain/src/modules/application/storage/services/file-io-limiter.service.ts:
brain.file_io_limiter.active— active work items bylimiter.name.brain.file_io_limiter.queued— queued work items bylimiter.name.brain.file_io_limiter.limit— configured concurrency limit bylimiter.name.brain.file_io_limiter.oldest_queued_ms— oldest queued wait bylimiter.name.
Current limiter names are remote_downloads, video_transforms,
provider_output_downloads, provider_preuploads, and archive_creation.
Brain also exports mediaflows BullMQ job-state gauges from the generation
module. Each gauge uses queue.name=mediaflows:
brain.queue.jobs.active— jobs claimed by Brain from BullMQ; downstream provider work may still be pending.brain.queue.jobs.waiting— jobs waiting in BullMQ.brain.queue.jobs.delayed— jobs delayed in BullMQ.brain.queue.jobs.failed— jobs failed in BullMQ.
For FOXY-202 native-memory leak checks, start with three Axiom chart groups in the brain metrics dataset:
- RSS shape:
process.memory.rss,process.memory.unaccounted,process.memory.external,process.memory.array_buffers, andprocess.memory.heap_total. - V8/native runtime:
v8.heap.total_physical_size,v8.heap.malloced_memory,v8.heap.peak_malloced_memory,v8.cpp_heap.used_size,v8.code.code_and_metadata_size, andbrain.runtime.active_resources. - Workload pressure:
generation.media.total,round.infer.total,brain.file_io_limiter.active,brain.file_io_limiter.queued,brain.sharp.queue, andbrain.sharp.process.
RSS/unaccounted growth while V8 heap, V8 malloced memory, and Sharp counters are flat points away from ordinary JS heap growth and toward native allocator, provider SDK, ffmpeg/gRPC, or platform-level RSS retention.
Trace Propagation
Trace IDs cross service boundaries via standard W3C traceparent headers.
Evidence (grep traceparent across apps/sirloin, apps/brain/src):
- TS code reads/writes
traceparentandtracestate(e.g.,normalizeTraceContextHeaderValue(headers.traceparent)). - gRPC:
otelgrpcinterceptors on sirloin (Go side) auto-propagate; brain’s@opentelemetry/sdk-node+ auto-instrumentations handle the TS side. - Correlation ID fallback:
x-correlation-idis read across NestJS / Fastify layers (req.headers['x-correlation-id'] ?? uuidv4()) and surfaced to logs ascorrelationId/req.id.
So a request hitting brisket gets a traceparent, propagates through sirloin
gRPC to brain. The trace terminates at the sirloin/brain → round boundary:
apps/round/go.mod declares no go.opentelemetry.io/* dependencies, and no
OTel imports exist under apps/round/cmd or apps/round/internal.
Dashboards
TODO(@law): catalogue of Axiom dashboards, Sentry projects, PostHog dashboards (URLs and ownership). Suggested sections to fill:
- Axiom: per-service traces dashboard, error-rate dashboard, slow-query view.
- Sentry: one project per service (sirloin, brain, brisket).
- PostHog: funnels for signup → first generation → paid conversion.
Alerts and SLOs
TODO(@law): no alerts or SLO definitions found in this repo. If alerts live in Axiom / Sentry / PostHog UI rather than as code, document at minimum:
- Owning team / on-call.
- Pager destination.
- Threshold and rationale.
Debugging Playbook — “I see a 500 in production”
Order assumes you have nothing but a Sentry alert or a user report.
- Sentry — find the exception. Service → environment → time window. Grab
the
trace_idfrom the event tags (auto-attached for sirloin viagetsentry/sentry-go+ OTel, for brain via@sentry/opentelemetry). - Axiom — pivot on
trace_id. Search across services for that trace ID. You will see the full request span tree: brisket → sirloin (gRPC) → brain (gRPC) → round/external. Identify the failing span. - Logs in Axiom. Filter logs by
trace_id(sirloin slog → OTLP logs; brain pino → OTel logs). Correlate timestamps with the failing span. Fallback: filter bycorrelationId/x-correlation-id. - PostHog — user impact. Look up the affected user / session by
distinct_idto confirm severity and reproduction path. - Reproduce locally. Capture the request payload from the trace, replay
against a local stack (
make dev-up-d). For billing or payment paths see the relevant ADRs in Decisions — many bugs in those flows are state-machine issues that need the right preconditions.
If the trace dead-ends at a service with no OTel (round, strip — currently),
fall back to that service’s zerolog output by container/pod log search.
Sensitive Data in Logs
- Never log raw Chargebee card data, Primer tokens, or full Clerk JWTs.
- Email addresses currently appear in
Sirloin.SubsAll.emailand brainUser.email. No global PII redaction layer exists today (see security-model.md → Logging hygiene for the partial sanitisers inapps/sirloin/internal/app/foxy360/server.go:526andapps/sirloin/internal/app/services/media/listmediaexamples.go:53). - See the Security Model page for the authoritative redaction rules.
Open follow-ups
- Production minimum log level per service (TODO(@pawel) for brain).
- Metric naming convention enforcement (TODO(@law)).
- Dashboard inventory — Axiom / Sentry / PostHog URLs (TODO(@law)).
- Alerts / SLOs / on-call ownership (TODO(@law)).
- PII masking policy in logs (TODO(@law)).