Round Runbook
Round Runbook
Operational steps for deploying, rolling back, swapping models, and provisioning capacity for the round service.
Topology
- Single Railway service
beef-round, regionus-east4-eqdc4a, 1 replica (apps/round/railway.json). - Builder: Dockerfile, watch path
/apps/round/**. - Healthcheck: HTTP
GET /healthonGRPC_PORT(default 8080), timeout 120 s. - Restart policy:
ON_FAILURE, max 10 retries. - Callers: sirloin (gRPC), brain (gRPC). No public ingress.
CPU-only; no GPU node provisioning today (see services/round-env).
Deploy
Round deploys via Railway from main on push, exactly like other services. There is no separate release workflow under .github/workflows/ for round.
Standard flow:
- Open a PR touching
apps/round/**(orproto/round/v1/**if RPC shape changes). - CI runs
make lintandmake run-tests(with-race). - Merge to
main. Railway picks up the change viawatchPatternsand builds the Dockerfile. - Build downloads model files from R2 / HuggingFace into the runtime image. Watch for non-zero exit on the model-download stage — the Dockerfile fails the build if a download produced a 0-byte or LFS-pointer stub.
- Healthcheck
GET /healthmust return 200 within 120 s of the new container starting. Health flips toSERVINGonly afterloadModelscompletes (apps/round/cmd/app/main.go), so failures insideloadModelswill stall the deploy.
Verify a deploy:
# from the repo rootrailway status --json | jq '.services[] | select(.name=="beef-round") | .latestDeployment'
# or via grpcurl against the public-internal hostname (only inside the network)grpcurl -plaintext round:8080 round.v1.RoundService/ListModelsCross-check logs in axiom for Models loaded successfully and gRPC+health listener accepting connections.
Rollback
Railway exposes a one-click rollback to the previous successful deployment. Use it when:
- Healthcheck is failing post-deploy.
- A model file was swapped to a bad URL and round is now producing
INTERNALfor one model_id. - A proto change is incompatible with the live sirloin / brain clients.
Steps:
- Railway dashboard →
beef-round→ Deployments → previous green build → Redeploy. - Confirm
/healthreturns 200 andListModelsreturns the expected set. - Revert the offending commit on
mainso the next deploy does not re-introduce the bug.
If the bad deploy was a proto change, also redeploy any caller services that already shipped with the new client stubs (sirloin / brain) so their request shape matches the rolled-back round.
Model swap
Changing a model URL or version goes through the build, not a runtime config. The two-stage flow:
- Update the relevant default URL in
apps/round/Dockerfile(RETINAFACE_MODEL_URL,LVFACE_MODEL_URL,MODELS_BASE_URL) or the embeddings URLs in code / Railway env (ROUND_EMBEDDINGS_MODEL_URL,ROUND_EMBEDDINGS_TOKENIZER_URL). Seeservices/round-envfor the full list. - Bump any version metadata in
services/round-modelsso the doc reflects what is actually serving. - Open a PR; let CI build and Railway deploy.
- After the deploy is green, call
ListModelsand confirmversionandmodel_idmatch the new spec.
For one-off A/B testing without redeploying:
- Set the override env at the Railway service level (e.g.
ROUND_LVFACE_URL=https://...). - Trigger a redeploy so the runtime fetches the new file into
MODEL_CACHE_DIR. - Watch
heap_alloc_mbin the resource snapshot logs for unexpected growth.
There is no hot-swap path. The registry is built once per process, so a model change always implies a restart.
flowchart LR edit[Edit Dockerfile / env] --> pr[Open PR] pr --> ci[CI lint + tests] ci --> merge[Merge to main] merge --> build[Railway Docker build<br/>downloads ONNX] build --> healthcheck["/health 200"] healthcheck --> serving[SERVING flag flipped] serving --> done[Live traffic]Capacity
There is one replica today and no horizontal autoscaling. To scale:
- Edit
apps/round/railway.json—multiRegionConfig.us-east4-eqdc4a.numReplicas. - Mirror the change in the Railway service config (it does not auto-follow the file — see
operations/railway). - After scaling, watch FD and goroutine counters in axiom. Each replica re-loads all ONNX models into RAM, so memory cost scales linearly.
If GPU is ever introduced, capacity provisioning becomes a separate decision tracked by an ADR; today the answer is “more CPU replicas”. TODO(@law): confirm the Railway plan tier sustains the memory ceiling needed for both face models (RetinaFace mv1_0.25 + LVFace-B Glint360K, see apps/round/Dockerfile) loaded simultaneously — the plan/tier is not declared in apps/round/railway.json.
Common operations
Restart the service
Railway dashboard → beef-round → Restart. Round handles SIGTERM gracefully:
- Health flips to
NOT_SERVINGimmediately. - gRPC server stops accepting new streams; in-flight RPCs are given up to 30 s (
shutdownTimeout). - HTTP server gets a 10 s grace on
Shutdown.
Inspect live RPCs
grpcurl -plaintext round:8080 list (reflection is enabled). Then Health/Check, ListModels, or a small Infer payload.
Tail logs
Filter axiom for service:round. Useful queries:
service:round msg:"Inference failed"— surfacesINTERNALerrors withmodel_id.service:round msg:"Resource usage snapshot"— periodic monitoring lines, every 30 s.service:round level:warn msg:"High file descriptor usage detected"— FD pressure.
Disaster scenarios
| Scenario | First action | Escalation |
|---|---|---|
| Deploy stuck — healthcheck never returns 200 | Check Railway build logs for model-download failures or Failed to load models. | Roll back to previous deployment. |
All Infer calls returning INTERNAL | Check axiom for Inference failed and ONNX init errors. Confirm model files are present in MODEL_CACHE_DIR. | Roll back; restart with verified model URLs. |
| OOMKilled loop | Inspect Resource usage snapshot for memory trend before the kill. | Bump Railway memory tier; consider unloading the optional face-embedding model. |
Caller reports UNAVAILABLE storms | Confirm replica count and that MaxConcurrentStreams (100) is not the bottleneck under load. | Add a replica; coordinate keepalive defaults with sirloin / brain clients. |
See services/round-oncall for alert thresholds and paging routes.