Round API
Round API
Round exposes a single gRPC service, round.v1.RoundService, plus a small HTTP surface for liveness on the same listener. There is no public REST surface — only sirloin and brain talk to round, and they both use gRPC.
The full proto definition lives at proto/round/v1/round.proto, with generated Go bindings at apps/round/internal/pkg/pb/round/v1/. The generated reference is also rendered under Generated References → Round v1.
Listener
Round serves gRPC and the HTTP health endpoint on the same port using h2c multiplexing (see apps/round/internal/app/server/server.go).
| Concern | Value |
|---|---|
| Bind | HOST:GRPC_PORT (defaults 0.0.0.0:8080) |
| Protocol | gRPC over HTTP/2 (h2c, plain text) — TLS terminated at the platform edge |
| HTTP path | /health (HTTP/1.1 OK / NOT OK string body, see services.HealthService.HTTPHandler) |
| gRPC reflection | enabled (reflection.Register) — used by grpcui / grpcurl |
| gRPC health | grpc.health.v1.Health/Check and Watch registered |
Resource limits applied at the gRPC server (see server.go):
MaxConcurrentStreams = 100MaxRecvMsgSize = 15 MiB,MaxSendMsgSize = 15 MiB(accounts for base64 overhead on top ofMAX_BINARY_SIZE)- Connection timeout 10 s, keepalive every 2 min, keepalive timeout 20 s, min keepalive 1 min
RPCs
Infer(InferRequest) returns (InferResponse)
Runs inference for a model selected by model_id. Input is mutually exclusive: callers set either text or image_base64, never both. options is a free-form JSON string forwarded to the model.
Model selection. model_id is matched against the in-memory registry populated at startup (apps/round/cmd/app/main.go::loadModels). Today the registered IDs are:
embeddings— text → 384-dim float vector (BAAI/bge-small-en-v1.5)face-detection— image → bounding-box JSON (RetinaFace)face-embedding— image → 512-dim float vector (LVFace-B_Glint360K), optional, only registered if the LVFace ONNX file is present at startup
Use ListModels to discover the live set rather than hard-coding IDs.
Request shape.
message InferRequest { string model_id = 1; // e.g. "embeddings", "face-detection" oneof input { string text = 2; string image_base64 = 3; // JPEG or PNG, base64 standard encoding } string options = 4; // optional, JSON, model-specific}Response shape. output is a model-specific JSON string; metadata is a JSON object with at least the model ID and version, used for observability (see services/inference.go).
message InferResponse { string output = 1; // embeddings: JSON array of floats; face: JSON with boxes string metadata = 2; // JSON, e.g. {"model_id":"embeddings","version":"1.0"}}Validation order (all INVALID_ARGUMENT — see round-errors):
model_idnon-empty.- Either
textorimage_base64non-empty. len(text) <= MAX_TEXT_LENGTH.image_base64decodes as standard base64.- Decoded bytes
<= MAX_BINARY_SIZE.
Then the request is dispatched through registry.Infer(ctx, model_id, input) and any failure from the model layer is wrapped as INTERNAL (inference failed: %v).
ListModels(ListModelsRequest) returns (ListModelsResponse)
Returns one ModelInfo per registered model (model_id, name, description, input_type, output_type, version). Used by callers that want to enumerate capabilities at boot. The empty request message is reserved for future pagination.
grpc.health.v1.Health
Both overall ("") and per-service ("round.v1.RoundService") statuses are managed by services.HealthService. They flip to SERVING only after loadModels completes, and to NOT_SERVING immediately on shutdown so load balancers stop routing during the drain window.
The /health HTTP endpoint mirrors the overall serving status — Railway uses it as the platform healthcheck (apps/round/railway.json → healthcheckPath: /health, timeout 120 s).
Latency expectations
Treat these as soft targets, validated locally with make bench-embeddings and via the round_* p95 traces in axiom. TODO(@law): codify formal SLOs — none are checked into the repo today (no SLO file under docs/src/content/docs/services/ for round, no alert thresholds in apps/round/railway.json beyond healthcheckTimeout: 120).
| RPC / model | p50 | p95 | Notes |
|---|---|---|---|
Infer / embeddings (≤512 tokens) | ~15 ms | ~40 ms | CPU, single ONNX session, batch=1 |
Infer / face-detection (≤1 MiB JPEG) | ~50 ms | ~150 ms | CPU, RetinaFace mv1_0.25 |
Infer / face-embedding (after detection crop) | ~30 ms | ~80 ms | CPU, optional model |
ListModels | <1 ms | <5 ms | in-memory registry |
Health.Check | <1 ms | <5 ms | atomic status read |
Round runs CPU-only. There is no GPU code path today (no CUDA execution provider in internal/pkg/onnxrt).
Sequence
sequenceDiagram participant C as Caller (sirloin or brain) participant R as Round gRPC participant V as Validator participant Reg as Registry participant M as ONNX model C->>R: Infer(model_id, text|image_base64, options) R->>V: validate shape and limits V-->>R: ok or InvalidArgument R->>Reg: lookup model by id Reg-->>R: ok or NotFound (Internal today) R->>M: model.Infer(ctx, input) M-->>R: output, metadata R-->>C: InferResponseCaller wiring
- sirloin consumes round via
apps/sirloin/internal/pkg/pb/round/. Hosts come fromSIRLOIN_ROUND_HOST(defaultround:8080). - brain consumes round via its generated client. Host comes from
BRAIN_ROUND_HOST(defaultround:8080). Seeservices/brainfor the brain side. - Both clients use plaintext gRPC inside the Railway private network.
See services/round-env for full host and port details, and services/round-errors for the gRPC status codes callers need to handle.