Round Runbook

Operational steps for deploying, rolling back, swapping models, and provisioning capacity for the round service.

Topology

Single Railway service beef-round, region us-east4-eqdc4a, 1 replica (apps/round/railway.json).
Builder: Dockerfile, watch path /apps/round/**.
Healthcheck: HTTP GET /health on GRPC_PORT (default 8080), timeout 120 s.
Restart policy: ON_FAILURE, max 10 retries.
Callers: sirloin (gRPC), brain (gRPC). No public ingress.

CPU-only; no GPU node provisioning today (see services/round-env).

Deploy

Round deploys via Railway from main on push, exactly like other services. There is no separate release workflow under .github/workflows/ for round.

Standard flow:

Open a PR touching apps/round/** (or proto/round/v1/** if RPC shape changes).
CI runs make lint and make run-tests (with -race).
Merge to main. Railway picks up the change via watchPatterns and builds the Dockerfile.
Build downloads model files from R2 / HuggingFace into the runtime image. Watch for non-zero exit on the model-download stage — the Dockerfile fails the build if a download produced a 0-byte or LFS-pointer stub.
Healthcheck GET /health must return 200 within 120 s of the new container starting. Health flips to SERVING only after loadModels completes (apps/round/cmd/app/main.go), so failures inside loadModels will stall the deploy.

Verify a deploy:

# from the repo root
railway status --json | jq '.services[] | select(.name=="beef-round") | .latestDeployment'

# or via grpcurl against the public-internal hostname (only inside the network)
grpcurl -plaintext round:8080 round.v1.RoundService/ListModels

Cross-check logs in axiom for Models loaded successfully and gRPC+health listener accepting connections.

Rollback

Railway exposes a one-click rollback to the previous successful deployment. Use it when:

Healthcheck is failing post-deploy.
A model file was swapped to a bad URL and round is now producing INTERNAL for one model_id.
A proto change is incompatible with the live sirloin / brain clients.

Steps:

Railway dashboard → beef-round → Deployments → previous green build → Redeploy.
Confirm /health returns 200 and ListModels returns the expected set.
Revert the offending commit on main so the next deploy does not re-introduce the bug.

If the bad deploy was a proto change, also redeploy any caller services that already shipped with the new client stubs (sirloin / brain) so their request shape matches the rolled-back round.

Model swap

Changing a model URL or version goes through the build, not a runtime config. The two-stage flow:

Update the relevant default URL in apps/round/Dockerfile (RETINAFACE_MODEL_URL, LVFACE_MODEL_URL, MODELS_BASE_URL) or the embeddings URLs in code / Railway env (ROUND_EMBEDDINGS_MODEL_URL, ROUND_EMBEDDINGS_TOKENIZER_URL). See services/round-env for the full list.
Bump any version metadata in services/round-models so the doc reflects what is actually serving.
Open a PR; let CI build and Railway deploy.
After the deploy is green, call ListModels and confirm version and model_id match the new spec.

For one-off A/B testing without redeploying:

Set the override env at the Railway service level (e.g. ROUND_LVFACE_URL=https://...).
Trigger a redeploy so the runtime fetches the new file into MODEL_CACHE_DIR.
Watch heap_alloc_mb in the resource snapshot logs for unexpected growth.

There is no hot-swap path. The registry is built once per process, so a model change always implies a restart.

flowchart LR
  edit[Edit Dockerfile / env] --> pr[Open PR]
  pr --> ci[CI lint + tests]
  ci --> merge[Merge to main]
  merge --> build[Railway Docker build<br/>downloads ONNX]
  build --> healthcheck["/health 200"]
  healthcheck --> serving[SERVING flag flipped]
  serving --> done[Live traffic]

Capacity

There is one replica today and no horizontal autoscaling. To scale:

Edit apps/round/railway.json — multiRegionConfig.us-east4-eqdc4a.numReplicas.
Mirror the change in the Railway service config (it does not auto-follow the file — see operations/railway).
After scaling, watch FD and goroutine counters in axiom. Each replica re-loads all ONNX models into RAM, so memory cost scales linearly.

If GPU is ever introduced, capacity provisioning becomes a separate decision tracked by an ADR; today the answer is “more CPU replicas”. TODO(@law): confirm the Railway plan tier sustains the memory ceiling needed for both face models (RetinaFace mv1_0.25 + LVFace-B Glint360K, see apps/round/Dockerfile) loaded simultaneously — the plan/tier is not declared in apps/round/railway.json.

Common operations

Restart the service

Railway dashboard → beef-round → Restart. Round handles SIGTERM gracefully:

Health flips to NOT_SERVING immediately.
gRPC server stops accepting new streams; in-flight RPCs are given up to 30 s (shutdownTimeout).
HTTP server gets a 10 s grace on Shutdown.

Inspect live RPCs

grpcurl -plaintext round:8080 list (reflection is enabled). Then Health/Check, ListModels, or a small Infer payload.

Tail logs

Filter axiom for service:round. Useful queries:

service:round msg:"Inference failed" — surfaces INTERNAL errors with model_id.
service:round msg:"Resource usage snapshot" — periodic monitoring lines, every 30 s.
service:round level:warn msg:"High file descriptor usage detected" — FD pressure.

Disaster scenarios

Scenario	First action	Escalation
Deploy stuck — healthcheck never returns 200	Check Railway build logs for model-download failures or `Failed to load models`.	Roll back to previous deployment.
All `Infer` calls returning `INTERNAL`	Check axiom for `Inference failed` and ONNX init errors. Confirm model files are present in `MODEL_CACHE_DIR`.	Roll back; restart with verified model URLs.
OOMKilled loop	Inspect `Resource usage snapshot` for memory trend before the kill.	Bump Railway memory tier; consider unloading the optional `face-embedding` model.
Caller reports `UNAVAILABLE` storms	Confirm replica count and that `MaxConcurrentStreams` (100) is not the bottleneck under load.	Add a replica; coordinate keepalive defaults with sirloin / brain clients.

See services/round-oncall for alert thresholds and paging routes.