Round Errors

Round returns errors via gRPC status codes only — never inside InferResponse.output. This page enumerates the codes, the conditions that produce them, and the recovery action expected from callers and on-call.

Status code map

Code	When	Source	Caller action
`INVALID_ARGUMENT`	Malformed request — empty `model_id`, missing input, oversized text or image, invalid base64.	`services/inference.go::Infer`	Fix the client. Do not retry.
`NOT_FOUND`	Health Watch for an unknown service name.	`services/health.go`	Check service name; mostly a developer error.
`INTERNAL`	Anything raised below the validator: registry lookup miss, ONNX session error, OOM, model output decode failure.	`services/inference.go` (catch-all `inference failed: %v`)	Retry with backoff (idempotent). Page on-call if sustained.
`UNAVAILABLE`	Returned by the gRPC stack when the listener is draining or rejecting due to keepalive enforcement.	grpc-go server	Retry with backoff and jitter; respect Railway drain.
`RESOURCE_EXHAUSTED`	gRPC server enforces `MaxRecvMsgSize = 15 MiB`. Oversized frames are rejected before the handler runs.	grpc-go server	Trim payload before retry.
`DEADLINE_EXCEEDED`	Caller deadline elapsed mid-inference (round itself does not impose per-RPC deadlines today).	grpc-go server	Increase deadline or shrink input.

The catch-all wrapping at apps/round/internal/app/services/inference.go is intentionally broad — adding finer-grained codes (NOT_FOUND for unknown model_id, RESOURCE_EXHAUSTED for OOM) is tracked under operational hardening but is not implemented yet. TODO(@law): coordinate with sirloin/brain caller owners before introducing finer-grained codes — today every non-validation failure surfaces as INTERNAL, so callers retry on it.

Failure scenarios

Model not loaded

Symptom. Caller sends model_id that is not in the registry. registry.Infer returns an error; the handler wraps it as INTERNAL: inference failed: model not found: <id>.

Why. The registry is populated once at boot by cmd/app/main.go::loadModels. Models are not hot-loaded.

The embeddings and face-detection models are required: failure to load them is fatal at boot (logger.Fatal).
The face-embedding model (LVFace) is optional: if its ONNX file is missing, loadModels logs a warning and continues. Calls to model_id=face-embedding will then return INTERNAL until the file is present and the service is restarted.

Recovery. Verify the file is at MODEL_CACHE_DIR/lvface/LVFace-B_Glint360K.onnx (or its overridden URL is reachable), then restart the deployment. See the rollback steps in services/round-runbook.

Out-of-memory / resource exhaustion

Symptom. Container restarts with OOMKilled. Health flips NOT_SERVING. Callers see UNAVAILABLE until Railway brings the replica back.

Causes.

ONNX session memory + heap exceeding the Railway memory limit. Each model holds its weights in RAM continuously.
File-descriptor exhaustion — the resource monitor (internal/pkg/monitoring) warns at 80 % FD usage and at >1000 goroutines, both visible in axiom.
Bursts of large image_base64 payloads up to 15 MiB on the wire.

Recovery. Bump the Railway memory ceiling, then investigate via the resource snapshot logs (Resource usage snapshot lines from monitoring.logResourceUsage) covering heap_alloc_mb, goroutines, and fd_usage_percent. See services/round-oncall for thresholds.

Malformed input

Symptom. INVALID_ARGUMENT returned synchronously. Common subcases:

Message	Trigger
`model_id is required`	Empty `model_id`.
`either text or image_base64 input is required`	Neither input field set.
`text input exceeds maximum length of N bytes`	`len(text) > MAX_TEXT_LENGTH`.
`invalid base64 encoding: <err>`	`base64.StdEncoding.DecodeString` failed — caller sent URL-safe base64 or junk.
`decoded image exceeds maximum size of N bytes`	Decoded bytes > `MAX_BINARY_SIZE`.

These are hot paths during integration work. Validate at the caller (sirloin / brain) when possible to keep network and CPU cost off round.

GPU errors

Not applicable — round has no CUDA path today. Any future GPU-specific failures (CUDA_ERROR_OUT_OF_MEMORY, missing libcudart) will need to be added here when the GPU build lands. See services/round-env for the current CPU-only posture.

Panics inside handlers

The recoveryInterceptor (see internal/app/server/interceptors.go) converts panics into INTERNAL and logs the stack at error level. The process keeps serving — panics do not bring the listener down — but a sustained spike means a model is producing un-decodable output and should be paged.

Logging

Inference logs are structured (zerolog) and include:

model_id
has_text, has_binary
output_length on success
err on failure

Search axiom with service:round level:error and pivot on model_id to scope an incident to a single model.

Caller checklist

When wiring sirloin or brain to round:

Validate inputs locally before the RPC. INVALID_ARGUMENT should never escape your service to a user.
Treat INTERNAL and UNAVAILABLE as retryable. Use exponential backoff with jitter; cap retries at 3 inside a request and shed load if all fail.
Wrap Infer with a per-call deadline (1–5 s for embeddings, 5–10 s for face-detection).
Log the round-side metadata.model_id and model_id you sent on every error — it is the only correlation key when round logs are sampled.

Mapping to alerts

The on-call surface in services/round-oncall watches:

error_rate of INTERNAL over a 5-minute window (model not loaded, ONNX errors).
error_rate of UNAVAILABLE (drain, restart loops).
INVALID_ARGUMENT is intentionally not alerted — it is caller hygiene.