Round On-Call
Round On-Call
Day-of-incident reference for round. For background context use services/round; for fix steps use services/round-runbook.
At a glance
- Blast radius. Round is on the synchronous path for embeddings and face-detection. Outages translate directly into degraded responses in sirloin and brain.
- Replicas. 1 replica in
us-east4-eqdc4a. There is no hot failover — a Railway redeploy or rollback is the recovery action. - CPU-only. No GPU dependencies, no model partitioning. Restarts are clean.
- Drain. Health flips to
NOT_SERVINGonSIGTERM; clients should observeUNAVAILABLEand back off.
Top alerts
These are the symptom-first alerts on-call should expect. The exact axiom queries / monitor IDs are in the observability standard (operations/observability); the table below is the “what does this mean and what do I do” cheat sheet.
| Alert | Trigger | Likely cause | First action |
|---|---|---|---|
| Round latency p95 high | Infer p95 > 250 ms over 5 min, or > 500 ms for 1 min. | CPU contention; payload size creep; cold start after redeploy. | Check replica CPU in Railway; check heap_alloc_mb trend; if isolated to one model_id, suspect a payload regression upstream. |
| Round error rate | INTERNAL rate > 1 % over 5 min. | Model file missing or corrupted; ONNX session error; runaway panic in a handler. | Check axiom for Inference failed and model_id. Compare with the latest deploy time — if it correlates, roll back per services/round-runbook. |
| Round availability | /health failing for > 60 s, or UNAVAILABLE rate spike at callers. | Process crash loop; OOMKilled; deploy stuck on loadModels. | Railway dashboard → check Deployments. Roll back if a deploy is in progress; bump memory if OOM. |
| Round FD pressure | Log: High file descriptor usage detected (>80 % of RLIMIT_NOFILE). | Connection leak; goroutine leak; clients not closing streams. | Check goroutines in Resource usage snapshot. Restart the service to recover; investigate the leak in the next business day. |
| Round goroutine leak | Log: High goroutine count detected (>1000). | Same root cause set as FD pressure. | Capture goroutine snapshot if available, then restart. |
| Round build failure | Railway build red. | Model download stub or 404; Dockerfile change. | Check the build log for the 0-byte or LFS-pointer stub guard. Roll back the offending commit. |
Signals available
apps/round/internal/pkg/monitoring/monitor.go emits a structured zerolog line every 30 s (Resource usage snapshot). It includes:
goroutinesheap_alloc_mb,heap_sys_mb,num_gcopen_file_descriptors,fd_limit,fd_usage_percent
Plus per-RPC info from interceptors:
loggingInterceptor— method, duration, status, errormonitoringInterceptor— request totals, failure counterrecoveryInterceptor— panic stack traces
There are no Prometheus metrics today — grep -r prometheus apps/round returns no hits, and the resource-usage signal in apps/round/internal/pkg/monitoring/monitor.go is zerolog-only. TODO(@law): track adding a /metrics endpoint as observability hardening — no GitHub issue or ADR is linked from this repo.
Escalation
Round currently has a single primary owner (@law). The escalation chain for a real incident:
- L1 (you). Triage with this page. Mitigate via redeploy / rollback in Railway. Stop bleed first.
- L2 — service owner (
@law). Engage if the L1 mitigation does not stick within 15 min, or if the incident requires a model-file change, a Dockerfile change, or a memory-tier bump. - L3 — caller owners (sirloin, brain). Engage when the symptom is at the caller layer (response timeouts in user-facing flows) and round itself looks healthy. Coordinate to confirm whether the issue is round, the network, or a caller-side bug.
Communication channel: incident channel for the run; status updates pinned every 15 min until resolved.
Decision tree
flowchart TD A[Alert fires] --> B{Health 200?} B -- No --> C[Process down or boot stuck] C --> D{Deploy in progress?} D -- Yes --> E[Roll back deploy] D -- No --> F[Check OOM / crash loop<br/>bump memory or restart] B -- Yes --> G{Error rate elevated?} G -- Yes --> H[Inspect axiom 'Inference failed'<br/>per model_id] H --> I{Single model_id?} I -- Yes --> J[Likely payload or model file regression<br/>check recent deploy] I -- No --> K[Likely runtime issue<br/>FD / memory / panic] G -- No --> L{Latency elevated?} L -- Yes --> M[Check CPU and replica<br/>scale per runbook] L -- No --> N[Probably caller-side<br/>escalate to L3]What is intentionally not paged
INVALID_ARGUMENTrates. These are caller hygiene; they should be tracked at the caller side, not round.- Optional model not loaded at boot (warning log only). The service is still healthy; treat as a follow-up ticket.
- Single
Inference failedlog lines. Only sustained rates page.
Common diagnostic commands
# Status of latest deployrailway status --json | jq '.services[] | select(.name=="beef-round")'
# Confirm the live model registrygrpcurl -plaintext round:8080 round.v1.RoundService/ListModels
# Health probe (HTTP)curl -i http://round:8080/health
# Health probe (gRPC) — used by grpc_health_probe in the containergrpc_health_probe -addr=round:8080 -service=round.v1.RoundServiceAfter-incident checklist
- File a follow-up issue with the timeline, blast radius, and root cause.
- If the incident exposed a new failure mode, add a row to
services/round-errorsand an alert row here. - If a model file was the culprit, update
services/round-modelswith the version that is actually serving. - Update
last_reviewedon the touched docs.