Round On-Call

Day-of-incident reference for round. For background context use services/round; for fix steps use services/round-runbook.

At a glance

Blast radius. Round is on the synchronous path for embeddings and face-detection. Outages translate directly into degraded responses in sirloin and brain.
Replicas. 1 replica in us-east4-eqdc4a. There is no hot failover — a Railway redeploy or rollback is the recovery action.
CPU-only. No GPU dependencies, no model partitioning. Restarts are clean.
Drain. Health flips to NOT_SERVING on SIGTERM; clients should observe UNAVAILABLE and back off.

Top alerts

These are the symptom-first alerts on-call should expect. The exact axiom queries / monitor IDs are in the observability standard (operations/observability); the table below is the “what does this mean and what do I do” cheat sheet.

Alert	Trigger	Likely cause	First action
Round latency p95 high	`Infer` p95 > 250 ms over 5 min, or > 500 ms for 1 min.	CPU contention; payload size creep; cold start after redeploy.	Check replica CPU in Railway; check `heap_alloc_mb` trend; if isolated to one `model_id`, suspect a payload regression upstream.
Round error rate	`INTERNAL` rate > 1 % over 5 min.	Model file missing or corrupted; ONNX session error; runaway panic in a handler.	Check axiom for `Inference failed` and `model_id`. Compare with the latest deploy time — if it correlates, roll back per `services/round-runbook`.
Round availability	`/health` failing for > 60 s, or `UNAVAILABLE` rate spike at callers.	Process crash loop; OOMKilled; deploy stuck on `loadModels`.	Railway dashboard → check Deployments. Roll back if a deploy is in progress; bump memory if OOM.
Round FD pressure	Log: `High file descriptor usage detected` (>80 % of `RLIMIT_NOFILE`).	Connection leak; goroutine leak; clients not closing streams.	Check `goroutines` in `Resource usage snapshot`. Restart the service to recover; investigate the leak in the next business day.
Round goroutine leak	Log: `High goroutine count detected` (>1000).	Same root cause set as FD pressure.	Capture goroutine snapshot if available, then restart.
Round build failure	Railway build red.	Model download stub or 404; Dockerfile change.	Check the build log for the `0-byte or LFS-pointer stub` guard. Roll back the offending commit.

Signals available

apps/round/internal/pkg/monitoring/monitor.go emits a structured zerolog line every 30 s (Resource usage snapshot). It includes:

goroutines
heap_alloc_mb, heap_sys_mb, num_gc
open_file_descriptors, fd_limit, fd_usage_percent

Plus per-RPC info from interceptors:

loggingInterceptor — method, duration, status, error
monitoringInterceptor — request totals, failure counter
recoveryInterceptor — panic stack traces

There are no Prometheus metrics today — grep -r prometheus apps/round returns no hits, and the resource-usage signal in apps/round/internal/pkg/monitoring/monitor.go is zerolog-only. TODO(@law): track adding a /metrics endpoint as observability hardening — no GitHub issue or ADR is linked from this repo.

Escalation

Round currently has a single primary owner (@law). The escalation chain for a real incident:

L1 (you). Triage with this page. Mitigate via redeploy / rollback in Railway. Stop bleed first.
L2 — service owner (@law). Engage if the L1 mitigation does not stick within 15 min, or if the incident requires a model-file change, a Dockerfile change, or a memory-tier bump.
L3 — caller owners (sirloin, brain). Engage when the symptom is at the caller layer (response timeouts in user-facing flows) and round itself looks healthy. Coordinate to confirm whether the issue is round, the network, or a caller-side bug.

Communication channel: incident channel for the run; status updates pinned every 15 min until resolved.

Decision tree

flowchart TD
  A[Alert fires] --> B{Health 200?}
  B -- No --> C[Process down or boot stuck]
  C --> D{Deploy in progress?}
  D -- Yes --> E[Roll back deploy]
  D -- No --> F[Check OOM / crash loop<br/>bump memory or restart]
  B -- Yes --> G{Error rate elevated?}
  G -- Yes --> H[Inspect axiom 'Inference failed'<br/>per model_id]
  H --> I{Single model_id?}
  I -- Yes --> J[Likely payload or model file regression<br/>check recent deploy]
  I -- No --> K[Likely runtime issue<br/>FD / memory / panic]
  G -- No --> L{Latency elevated?}
  L -- Yes --> M[Check CPU and replica<br/>scale per runbook]
  L -- No --> N[Probably caller-side<br/>escalate to L3]

What is intentionally not paged

INVALID_ARGUMENT rates. These are caller hygiene; they should be tracked at the caller side, not round.
Optional model not loaded at boot (warning log only). The service is still healthy; treat as a follow-up ticket.
Single Inference failed log lines. Only sustained rates page.

Common diagnostic commands

# Status of latest deploy
railway status --json | jq '.services[] | select(.name=="beef-round")'

# Confirm the live model registry
grpcurl -plaintext round:8080 round.v1.RoundService/ListModels

# Health probe (HTTP)
curl -i http://round:8080/health

# Health probe (gRPC) — used by grpc_health_probe in the container
grpc_health_probe -addr=round:8080 -service=round.v1.RoundService

After-incident checklist

File a follow-up issue with the timeline, blast radius, and root cause.
If the incident exposed a new failure mode, add a row to services/round-errors and an alert row here.
If a model file was the culprit, update services/round-models with the version that is actually serving.
Update last_reviewed on the touched docs.