Round API

Round exposes a single gRPC service, round.v1.RoundService, plus a small HTTP surface for liveness on the same listener. There is no public REST surface — only sirloin and brain talk to round, and they both use gRPC.

The full proto definition lives at proto/round/v1/round.proto, with generated Go bindings at apps/round/internal/pkg/pb/round/v1/. The generated reference is also rendered under Generated References → Round v1.

Listener

Round serves gRPC and the HTTP health endpoint on the same port using h2c multiplexing (see apps/round/internal/app/server/server.go).

Concern	Value
Bind	`HOST:GRPC_PORT` (defaults `0.0.0.0:8080`)
Protocol	gRPC over HTTP/2 (h2c, plain text) — TLS terminated at the platform edge
HTTP path	`/health` (HTTP/1.1 OK / NOT OK string body, see `services.HealthService.HTTPHandler`)
gRPC reflection	enabled (`reflection.Register`) — used by `grpcui` / `grpcurl`
gRPC health	`grpc.health.v1.Health/Check` and `Watch` registered

Resource limits applied at the gRPC server (see server.go):

MaxConcurrentStreams = 100
MaxRecvMsgSize = 15 MiB, MaxSendMsgSize = 15 MiB (accounts for base64 overhead on top of MAX_BINARY_SIZE)
Connection timeout 10 s, keepalive every 2 min, keepalive timeout 20 s, min keepalive 1 min

RPCs

`Infer(InferRequest) returns (InferResponse)`

Runs inference for a model selected by model_id. Input is mutually exclusive: callers set either text or image_base64, never both. options is a free-form JSON string forwarded to the model.

Model selection. model_id is matched against the in-memory registry populated at startup (apps/round/cmd/app/main.go::loadModels). Today the registered IDs are:

embeddings — text → 384-dim float vector (BAAI/bge-small-en-v1.5)
face-detection — image → bounding-box JSON (RetinaFace)
face-embedding — image → 512-dim float vector (LVFace-B_Glint360K), optional, only registered if the LVFace ONNX file is present at startup

Use ListModels to discover the live set rather than hard-coding IDs.

Request shape.

message InferRequest {
  string model_id = 1;       // e.g. "embeddings", "face-detection"
  oneof input {
    string text         = 2;
    string image_base64 = 3; // JPEG or PNG, base64 standard encoding
  }
  string options = 4;        // optional, JSON, model-specific
}

Response shape. output is a model-specific JSON string; metadata is a JSON object with at least the model ID and version, used for observability (see services/inference.go).

message InferResponse {
  string output   = 1; // embeddings: JSON array of floats; face: JSON with boxes
  string metadata = 2; // JSON, e.g. {"model_id":"embeddings","version":"1.0"}
}

Validation order (all INVALID_ARGUMENT — see round-errors):

model_id non-empty.
Either text or image_base64 non-empty.
len(text) <= MAX_TEXT_LENGTH.
image_base64 decodes as standard base64.
Decoded bytes <= MAX_BINARY_SIZE.

Then the request is dispatched through registry.Infer(ctx, model_id, input) and any failure from the model layer is wrapped as INTERNAL (inference failed: %v).

`ListModels(ListModelsRequest) returns (ListModelsResponse)`

Returns one ModelInfo per registered model (model_id, name, description, input_type, output_type, version). Used by callers that want to enumerate capabilities at boot. The empty request message is reserved for future pagination.

`grpc.health.v1.Health`

Both overall ("") and per-service ("round.v1.RoundService") statuses are managed by services.HealthService. They flip to SERVING only after loadModels completes, and to NOT_SERVING immediately on shutdown so load balancers stop routing during the drain window.

The /health HTTP endpoint mirrors the overall serving status — Railway uses it as the platform healthcheck (apps/round/railway.json → healthcheckPath: /health, timeout 120 s).

Latency expectations

Treat these as soft targets, validated locally with make bench-embeddings and via the round_* p95 traces in axiom. TODO(@law): codify formal SLOs — none are checked into the repo today (no SLO file under docs/src/content/docs/services/ for round, no alert thresholds in apps/round/railway.json beyond healthcheckTimeout: 120).

RPC / model	p50	p95	Notes
`Infer` / `embeddings` (≤512 tokens)	~15 ms	~40 ms	CPU, single ONNX session, batch=1
`Infer` / `face-detection` (≤1 MiB JPEG)	~50 ms	~150 ms	CPU, RetinaFace mv1_0.25
`Infer` / `face-embedding` (after detection crop)	~30 ms	~80 ms	CPU, optional model
`ListModels`	<1 ms	<5 ms	in-memory registry
`Health.Check`	<1 ms	<5 ms	atomic status read

Round runs CPU-only. There is no GPU code path today (no CUDA execution provider in internal/pkg/onnxrt).

Sequence

sequenceDiagram
  participant C as Caller (sirloin or brain)
  participant R as Round gRPC
  participant V as Validator
  participant Reg as Registry
  participant M as ONNX model
  C->>R: Infer(model_id, text|image_base64, options)
  R->>V: validate shape and limits
  V-->>R: ok or InvalidArgument
  R->>Reg: lookup model by id
  Reg-->>R: ok or NotFound (Internal today)
  R->>M: model.Infer(ctx, input)
  M-->>R: output, metadata
  R-->>C: InferResponse

Caller wiring

sirloin consumes round via apps/sirloin/internal/pkg/pb/round/. Hosts come from SIRLOIN_ROUND_HOST (default round:8080).
brain consumes round via its generated client. Host comes from BRAIN_ROUND_HOST (default round:8080). See services/brain for the brain side.
Both clients use plaintext gRPC inside the Railway private network.

See services/round-env for full host and port details, and services/round-errors for the gRPC status codes callers need to handle.