Skip to content

Round On-Call

Round On-Call

Day-of-incident reference for round. For background context use services/round; for fix steps use services/round-runbook.

At a glance

  • Blast radius. Round is on the synchronous path for embeddings and face-detection. Outages translate directly into degraded responses in sirloin and brain.
  • Replicas. 1 replica in us-east4-eqdc4a. There is no hot failover — a Railway redeploy or rollback is the recovery action.
  • CPU-only. No GPU dependencies, no model partitioning. Restarts are clean.
  • Drain. Health flips to NOT_SERVING on SIGTERM; clients should observe UNAVAILABLE and back off.

Top alerts

These are the symptom-first alerts on-call should expect. The exact axiom queries / monitor IDs are in the observability standard (operations/observability); the table below is the “what does this mean and what do I do” cheat sheet.

AlertTriggerLikely causeFirst action
Round latency p95 highInfer p95 > 250 ms over 5 min, or > 500 ms for 1 min.CPU contention; payload size creep; cold start after redeploy.Check replica CPU in Railway; check heap_alloc_mb trend; if isolated to one model_id, suspect a payload regression upstream.
Round error rateINTERNAL rate > 1 % over 5 min.Model file missing or corrupted; ONNX session error; runaway panic in a handler.Check axiom for Inference failed and model_id. Compare with the latest deploy time — if it correlates, roll back per services/round-runbook.
Round availability/health failing for > 60 s, or UNAVAILABLE rate spike at callers.Process crash loop; OOMKilled; deploy stuck on loadModels.Railway dashboard → check Deployments. Roll back if a deploy is in progress; bump memory if OOM.
Round FD pressureLog: High file descriptor usage detected (>80 % of RLIMIT_NOFILE).Connection leak; goroutine leak; clients not closing streams.Check goroutines in Resource usage snapshot. Restart the service to recover; investigate the leak in the next business day.
Round goroutine leakLog: High goroutine count detected (>1000).Same root cause set as FD pressure.Capture goroutine snapshot if available, then restart.
Round build failureRailway build red.Model download stub or 404; Dockerfile change.Check the build log for the 0-byte or LFS-pointer stub guard. Roll back the offending commit.

Signals available

apps/round/internal/pkg/monitoring/monitor.go emits a structured zerolog line every 30 s (Resource usage snapshot). It includes:

  • goroutines
  • heap_alloc_mb, heap_sys_mb, num_gc
  • open_file_descriptors, fd_limit, fd_usage_percent

Plus per-RPC info from interceptors:

  • loggingInterceptor — method, duration, status, error
  • monitoringInterceptor — request totals, failure counter
  • recoveryInterceptor — panic stack traces

There are no Prometheus metrics today — grep -r prometheus apps/round returns no hits, and the resource-usage signal in apps/round/internal/pkg/monitoring/monitor.go is zerolog-only. TODO(@law): track adding a /metrics endpoint as observability hardening — no GitHub issue or ADR is linked from this repo.

Escalation

Round currently has a single primary owner (@law). The escalation chain for a real incident:

  1. L1 (you). Triage with this page. Mitigate via redeploy / rollback in Railway. Stop bleed first.
  2. L2 — service owner (@law). Engage if the L1 mitigation does not stick within 15 min, or if the incident requires a model-file change, a Dockerfile change, or a memory-tier bump.
  3. L3 — caller owners (sirloin, brain). Engage when the symptom is at the caller layer (response timeouts in user-facing flows) and round itself looks healthy. Coordinate to confirm whether the issue is round, the network, or a caller-side bug.

Communication channel: incident channel for the run; status updates pinned every 15 min until resolved.

Decision tree

flowchart TD
A[Alert fires] --> B{Health 200?}
B -- No --> C[Process down or boot stuck]
C --> D{Deploy in progress?}
D -- Yes --> E[Roll back deploy]
D -- No --> F[Check OOM / crash loop<br/>bump memory or restart]
B -- Yes --> G{Error rate elevated?}
G -- Yes --> H[Inspect axiom 'Inference failed'<br/>per model_id]
H --> I{Single model_id?}
I -- Yes --> J[Likely payload or model file regression<br/>check recent deploy]
I -- No --> K[Likely runtime issue<br/>FD / memory / panic]
G -- No --> L{Latency elevated?}
L -- Yes --> M[Check CPU and replica<br/>scale per runbook]
L -- No --> N[Probably caller-side<br/>escalate to L3]

What is intentionally not paged

  • INVALID_ARGUMENT rates. These are caller hygiene; they should be tracked at the caller side, not round.
  • Optional model not loaded at boot (warning log only). The service is still healthy; treat as a follow-up ticket.
  • Single Inference failed log lines. Only sustained rates page.

Common diagnostic commands

Terminal window
# Status of latest deploy
railway status --json | jq '.services[] | select(.name=="beef-round")'
# Confirm the live model registry
grpcurl -plaintext round:8080 round.v1.RoundService/ListModels
# Health probe (HTTP)
curl -i http://round:8080/health
# Health probe (gRPC) — used by grpc_health_probe in the container
grpc_health_probe -addr=round:8080 -service=round.v1.RoundService

After-incident checklist

  1. File a follow-up issue with the timeline, blast radius, and root cause.
  2. If the incident exposed a new failure mode, add a row to services/round-errors and an alert row here.
  3. If a model file was the culprit, update services/round-models with the version that is actually serving.
  4. Update last_reviewed on the touched docs.