Skip to content

Round API

Round API

Round exposes a single gRPC service, round.v1.RoundService, plus a small HTTP surface for liveness on the same listener. There is no public REST surface — only sirloin and brain talk to round, and they both use gRPC.

The full proto definition lives at proto/round/v1/round.proto, with generated Go bindings at apps/round/internal/pkg/pb/round/v1/. The generated reference is also rendered under Generated References → Round v1.

Listener

Round serves gRPC and the HTTP health endpoint on the same port using h2c multiplexing (see apps/round/internal/app/server/server.go).

ConcernValue
BindHOST:GRPC_PORT (defaults 0.0.0.0:8080)
ProtocolgRPC over HTTP/2 (h2c, plain text) — TLS terminated at the platform edge
HTTP path/health (HTTP/1.1 OK / NOT OK string body, see services.HealthService.HTTPHandler)
gRPC reflectionenabled (reflection.Register) — used by grpcui / grpcurl
gRPC healthgrpc.health.v1.Health/Check and Watch registered

Resource limits applied at the gRPC server (see server.go):

  • MaxConcurrentStreams = 100
  • MaxRecvMsgSize = 15 MiB, MaxSendMsgSize = 15 MiB (accounts for base64 overhead on top of MAX_BINARY_SIZE)
  • Connection timeout 10 s, keepalive every 2 min, keepalive timeout 20 s, min keepalive 1 min

RPCs

Infer(InferRequest) returns (InferResponse)

Runs inference for a model selected by model_id. Input is mutually exclusive: callers set either text or image_base64, never both. options is a free-form JSON string forwarded to the model.

Model selection. model_id is matched against the in-memory registry populated at startup (apps/round/cmd/app/main.go::loadModels). Today the registered IDs are:

  • embeddings — text → 384-dim float vector (BAAI/bge-small-en-v1.5)
  • face-detection — image → bounding-box JSON (RetinaFace)
  • face-embedding — image → 512-dim float vector (LVFace-B_Glint360K), optional, only registered if the LVFace ONNX file is present at startup

Use ListModels to discover the live set rather than hard-coding IDs.

Request shape.

message InferRequest {
string model_id = 1; // e.g. "embeddings", "face-detection"
oneof input {
string text = 2;
string image_base64 = 3; // JPEG or PNG, base64 standard encoding
}
string options = 4; // optional, JSON, model-specific
}

Response shape. output is a model-specific JSON string; metadata is a JSON object with at least the model ID and version, used for observability (see services/inference.go).

message InferResponse {
string output = 1; // embeddings: JSON array of floats; face: JSON with boxes
string metadata = 2; // JSON, e.g. {"model_id":"embeddings","version":"1.0"}
}

Validation order (all INVALID_ARGUMENT — see round-errors):

  1. model_id non-empty.
  2. Either text or image_base64 non-empty.
  3. len(text) <= MAX_TEXT_LENGTH.
  4. image_base64 decodes as standard base64.
  5. Decoded bytes <= MAX_BINARY_SIZE.

Then the request is dispatched through registry.Infer(ctx, model_id, input) and any failure from the model layer is wrapped as INTERNAL (inference failed: %v).

ListModels(ListModelsRequest) returns (ListModelsResponse)

Returns one ModelInfo per registered model (model_id, name, description, input_type, output_type, version). Used by callers that want to enumerate capabilities at boot. The empty request message is reserved for future pagination.

grpc.health.v1.Health

Both overall ("") and per-service ("round.v1.RoundService") statuses are managed by services.HealthService. They flip to SERVING only after loadModels completes, and to NOT_SERVING immediately on shutdown so load balancers stop routing during the drain window.

The /health HTTP endpoint mirrors the overall serving status — Railway uses it as the platform healthcheck (apps/round/railway.jsonhealthcheckPath: /health, timeout 120 s).

Latency expectations

Treat these as soft targets, validated locally with make bench-embeddings and via the round_* p95 traces in axiom. TODO(@law): codify formal SLOs — none are checked into the repo today (no SLO file under docs/src/content/docs/services/ for round, no alert thresholds in apps/round/railway.json beyond healthcheckTimeout: 120).

RPC / modelp50p95Notes
Infer / embeddings (≤512 tokens)~15 ms~40 msCPU, single ONNX session, batch=1
Infer / face-detection (≤1 MiB JPEG)~50 ms~150 msCPU, RetinaFace mv1_0.25
Infer / face-embedding (after detection crop)~30 ms~80 msCPU, optional model
ListModels<1 ms<5 msin-memory registry
Health.Check<1 ms<5 msatomic status read

Round runs CPU-only. There is no GPU code path today (no CUDA execution provider in internal/pkg/onnxrt).

Sequence

sequenceDiagram
participant C as Caller (sirloin or brain)
participant R as Round gRPC
participant V as Validator
participant Reg as Registry
participant M as ONNX model
C->>R: Infer(model_id, text|image_base64, options)
R->>V: validate shape and limits
V-->>R: ok or InvalidArgument
R->>Reg: lookup model by id
Reg-->>R: ok or NotFound (Internal today)
R->>M: model.Infer(ctx, input)
M-->>R: output, metadata
R-->>C: InferResponse

Caller wiring

  • sirloin consumes round via apps/sirloin/internal/pkg/pb/round/. Hosts come from SIRLOIN_ROUND_HOST (default round:8080).
  • brain consumes round via its generated client. Host comes from BRAIN_ROUND_HOST (default round:8080). See services/brain for the brain side.
  • Both clients use plaintext gRPC inside the Railway private network.

See services/round-env for full host and port details, and services/round-errors for the gRPC status codes callers need to handle.