Railway Integration Playbook

Overview

Railway is our PaaS for runtime services — every long-running beef-* service runs on Railway. Hosted staging and production run as a single Railway region, us-east4-eqdc4a (US East), colocated with the US production Neon path. Railway private networking is used only between Railway services in the same project; Neon is reached through its normal connection strings (see Neon integration and gRPC mesh).

This page is the integration playbook — how Railway is wired into the codebase, how deploys flow, where secrets live, and what breaks. For the day-to-day operational reference (manifest mirroring, healthcheck table, clone-to-prod checklist), see operations/railway.md.

MCP access at the time of writing: the Railway CLI MCP tool returned Invalid or expired Railway token for both check-railway-status and list-projects. Live project/service IDs and per-service variable inventories are therefore reconstructed from apps/*/railway.json, the .github/workflows/neon-branching.yml workflow, and the existing operations notes. TODO(@law): re-run the MCP queries after refreshing the Railway API token and reconcile any drift.

Project & service map

All services live in the beef Railway project under the Foxy workspace (per operations/railway.md). Each service is wired to a checked-in railway.json so build/deploy config is reviewable in Git.

flowchart LR
  subgraph Railway["Railway project: beef (workspace: Foxy)"]
    subgraph public["Public-facing"]
      brisket["beef-brisket<br/>Next.js 16<br/>/api/health"]
      fennec["beef-fennec<br/>Vite SPA<br/>/"]
      strip["beef-strip<br/>Go Fiber SSR<br/>/health"]
      flank["beef-flank<br/>Next.js (workflows UI)<br/>/sign-in"]
    end
    subgraph internal["Internal (private network)"]
      sirloin["beef-sirloin<br/>Go gRPC+REST<br/>/health on gRPC port"]
      brain["beef-brain<br/>NestJS<br/>/health"]
      round["beef-round<br/>Go ML inference<br/>/health on gRPC port"]
    end
  end

  brisket -->|REST| sirloin
  fennec -->|REST| sirloin
  strip -->|gRPC| sirloin
  flank -->|gRPC| sirloin
  sirloin -->|gRPC| brain
  sirloin -->|gRPC| round
  brain -->|gRPC| round
  sirloin -->|gRPC: FlankExecutionService| flank

  Neon[("Neon Postgres<br/>US")]
  sirloin -.-> Neon
  brain -.-> Neon

  classDef pub fill:#e8f5e8,stroke:#2d7d2d
  classDef int fill:#e8f0ff,stroke:#2d4d8c
  class brisket,fennec,strip,flank pub
  class sirloin,brain,round int

Seven beef-* services have a checked-in railway.json: brain, brisket, fennec, flank, round, sirloin, strip. Other services (chuck — Strapi CMS; shank — email templates) are not Railway runtime services as of the current commit. TODO(@law): confirm the chuck hosting target.

Per-service deploy mode

Railway is the deploy executor for runtime services. The .github/workflows/*.yml per-service workflows in this repo run lint/typecheck/test only — none of them invokes railway up or pushes a container image. Deployments are triggered by Railway’s repo watch on the configured branch, gated by watchPatterns in each railway.json.

Service	Trigger	Branch tracked	`watchPatterns`	Healthcheck path	Replicas
`beef-brain`	Railway watch on push	`main`*	`/apps/brain/**`	`/health`	1
`beef-brisket`	Railway watch on push	`main`*	`/apps/brisket/**`	`/api/health`	1
`beef-fennec`	Railway watch on push	`main`*	`[]` (rebuild on every push)	`/`	1
`beef-flank`	Railway watch on push	`main`*	`/apps/flank/**`	`/sign-in`	1
`beef-round`	Railway watch on push	`main`*	`/apps/round/**`	`/health` (on gRPC)	1
`beef-sirloin`	Railway watch on push	`main`*	`/apps/sirloin/`, `/docs/src/content/docs/`, `/scripts/docs-build-kb-export.mjs`	`/health` (on gRPC)	1
`beef-strip`	Railway watch on push	`main`*	`/apps/strip/**`	`/health`	1

* Railway’s tracked branch is configured per service in the Railway UI, not in railway.json. The repository’s two long-lived branches are main and release (per .github/workflows/*.yml branches: lists). TODO(@law): confirm whether prod tracks release and staging tracks main, or whether both environments track main.

CI workflows that do not deploy: brain.yml, brisket.yml, fennec.yml, flank.yml, sirloin.yml, docs-quality.yml, pr-description-check.yml, ai-hygiene.yml, up-to-date-check.yml. The sirloin.yml workflow has a commented-out docker-build job that would push to ECR — currently dead (.github/workflows/sirloin.yml:112 is commented). TODO(@law): decide whether to delete or revive the dead docker-build job.

The only workflow that talks to Railway directly is .github/workflows/neon-branching.yml (see the Environments section).

Environments

Three environment classes:

Class	Railway environment	Branch	Neon branch	Notes
Production	`production`	likely `release`	Neon `production`	Customer traffic.
Staging	`staging`	likely `main`	Neon `staging`	Pre-prod soak.
Preview	`beef-pr-<PR_NUMBER>`	PR head	`preview/pr-<N>-<sanitized-branch>`	Created/destroyed by `neon-branching.yml`. 14-day Neon TTL.

TODO(@law): confirm production tracks release and staging tracks main in the Railway UI.

Per-app stages are normalized by the *_STAGE env var (SIRLOIN_STAGE, BRAIN_STAGE, BRISKET_STAGE, STRIP_STAGE, FLANK_STAGE) — values production | staging | sandbox | development (see deployment-env standard). NODE_ENV stays reserved for framework runtime mode.

Preview environment lifecycle

Driven entirely by .github/workflows/neon-branching.yml:

PR opened/synchronized → workflow installs the Railway CLI, links to RAILWAY_PROJECT_ID env staging, then waits up to 5 min for the Railway preview env beef-pr-<N> to materialize.
Creates a Neon branch preview/pr-<N>-<branch> with a 14-day expiry.
Rewrites database variables on the preview env (with retries):
- beef-sirloin: SIRLOIN_DATABASE_URL, SIRLOIN_DATABASE_POOLED_URL → Neon rump
- beef-brain: DATABASE_URL, DIRECT_DATABASE_URL → Neon fennec (schema fennec)
Disables OTEL on the preview env by deleting the gate vars (BRAIN_OTEL_URL on brain, SIRLOIN_AXIOM_TOKEN on sirloin) so preview logs/traces don’t pollute the staging Axiom dataset.
Verifies via railway variable list --json that no variable still resolves to railway.internal.
PR closed → Neon branch deleted (workflow closed action).

This is the single entry point that mutates Railway state from CI. Any other variable changes are made by hand in the Railway UI.

Secrets management

Railway env vars are the canonical secret store for runtime services (per security-model §Secrets management). The shape is documented in .env.example. There is currently no Doppler / AWS Secrets Manager / Vault layer — Railway’s per-environment vars are the source of truth.

Variable scoping:

Per-service vars — most secrets, e.g. SIRLOIN_BRAIN_API_KEY, SIRLOIN_PRIMER_WEBHOOK_SECRET, BRISKET_CLERK_ENCRYPTION_KEY.
Shared variables — Clerk keys (CLERK_PUBLISHABLE_KEY, CLERK_SECRET_KEY) used by brain/flank/fennec/strip; FLANK_ENCRYPTION_KEY used by flank+sirloin. TODO(@law): confirm whether Clerk keys and FLANK_ENCRYPTION_KEY use Railway shared variables or duplicated per-service vars.
Internal references — ${{shared.X}} and ${{Service.VAR}} reference syntax. None of the checked-in railway.json files contain reference expressions; usage is per-environment in the Railway UI. TODO(@law): audit current Railway variable references.

Rotation procedure (manual today):

Generate new value (per the secret’s spec — see flank’s secrets.ts for the secret-name registry, or .env.example for top-level shapes).
Update Railway var on production first; Railway redeploys the service.
Wait for healthcheck to flip green; verify from logs that the new value is in use.
Update staging to match (so promotion deltas don’t surface stale secrets).
For flank workflow secrets (encrypted in sirloin’s DB, AES-256 with FLANK_ENCRYPTION_KEY), use the in-app Secret store, not Railway env. Cache TTL is 5 minutes — wait that long before treating rotation as live.
Revoke the old value at the upstream provider.

TODO(@law): automate Railway variable rotation from a runbook; currently rotations are done manually in the Railway UI.

Required manual auth secret on GitHub: RAILWAY_API_TOKEN (used by neon-branching.yml). Without it, preview environments stay pointed at Railway-Postgres clones rather than Neon branches.

Networking

Railway provides a private IPv6 network between services in the same project; cross-service gRPC traffic uses *.railway.internal hostnames on that mesh. Public traffic terminates at Railway’s edge with auto-issued TLS certificates.

Inter-service gRPC mesh (sirloin ↔ brain, sirloin ↔ round, brain → round, sirloin ↔ flank): traverses the private network. See gRPC mesh for the full topology and TLS posture.
beef-brain and beef-round set ipv6EgressEnabled: true so they can reach external IPv6-only endpoints (Neon proxy, model registries).
Healthcheck port discipline: for round and sirloin, /health is served on the same socket as gRPC (HTTP/2 cleartext + HTTP routing) so Railway’s required PORT variable stays aligned with GRPC_PORT / SIRLOIN_PORT. Splitting these would break Railway’s healthcheck.
Public domains: assigned per service in the Railway UI. TODO(@law): inventory public domains; they are not checked into Git.

Logs & metrics

Railway logs — accessible via railway logs --service <name> --environment <env> or the Railway UI. Default retention is the Railway plan default. TODO(@law): exact retention window.
Application telemetry — services emit OTEL traces/metrics directly to Axiom (SIRLOIN_AXIOM_TOKEN, BRAIN_OTEL_URL); Railway is not in the telemetry path.
Railway → Axiom log drain — confirmed not configured: a repo-wide grep for log_drain / RAILWAY_LOG_DRAIN returns nothing. If platform events (deploy success/fail, OOM kills, healthcheck flaps) need to land in Axiom, configure a Railway log drain pointing at the Axiom HTTP ingest endpoint and add RAILWAY_LOG_DRAIN_TOKEN to the secret rotation list.

Failure modes

Mode	Detection	Mitigation
Deploy stuck in “Building” / “Deploying”	Railway UI deployment hung > 15 min; service still on previous version.	Cancel from UI, retry. If repeated: inspect build logs for OOM during `pnpm install` / Go `go build`; bump build resources or split Dockerfile stages.
Healthcheck fail loop	Railway dashboard shows “Crashed” / repeated restarts; restart counter climbs. `restartPolicyMaxRetries: 10` on every service.	Check `/health` route is reachable on `PORT`. For round/sirloin, confirm gRPC + HTTP share the listener. Roll back via “Redeploy” on prior commit.
Env var drift between envs	Preview behaving differently from staging on identical code.	Run `railway variable list --json --service X --environment staging` vs preview; reconcile. The `neon-branching.yml` retry-loop catches DB-URL drift specifically.
OOM kill	Process restart immediately after warm-up; `dmesg`-style OOM signal in logs.	Bump service memory in Railway UI. Local compose no longer encodes hosted memory sizing; use Railway service metrics and plan limits as the sizing source of truth.
Region outage (`us-east4-eqdc4a`)	Railway status page; all `beef-*` services flapping simultaneously.	No active failover — single region (every service pins `us-east4-eqdc4a` in `railway.json`). Mitigation is wait; the DR plan is still open.
Preview env not created	`neon-branching.yml` “Wait for preview environment” step times out after 30 × 10 s polls.	Railway auto-creates preview envs from the PR; if the workflow is missing `RAILWAY_API_TOKEN` or the project’s PR-environments toggle is off, the env never appears.
Neon → Railway rewrite fails	Workflow logs `did not match the Neon URL written by this workflow` or `still points at Railway Postgres`.	Re-run the workflow; the retry helper retries 5×. If still failing, set the var manually in the Railway UI.

Cost model

Railway billing is usage-based (compute + memory + egress + plan tier). This repo does not pin per-service compute/memory caps in railway.json (limitOverride: null everywhere) — Railway uses plan defaults.

TODO(@law) — fill in:

Plan tier (Pro? Team?).
Per-service resource ceilings (vCPU, memory).
Monthly run-rate split by service.
Network transfer to external Neon, Axiom, and S3-compatible endpoints. Railway and Neon are geographically colocated in US East here, but that is not a private same-network guarantee: same-region/zone paths may have minimal transfer cost, while cross-region or public proxy endpoints can still bill as egress.

The docker-compose memory targets give a rough sense of the prod budget envelope: brisket 12G, sirloin 12G, brain 8G, round 8G, strip 4G, redis 2G, fennec 2G, flank 2G — totalling ~54G across services. Railway production should be sized at or above these.

TODO(@law): define the Railway single-region DR plan.

Local mirror

For full-stack local dev, use the root docker-compose.yml. This brings up Postgres (pgvector/pgvector:pg17-trixie), brain, sirloin, round, brisket, fennec, flank, strip, and gRPC UI helpers — bound to localhost, no Railway in the loop. Per-app stage vars default to sandbox / development.

See the services overview for per-app dev quickstart commands; this is not a Railway-specific story but is the standard reproduction path before deploying to a Railway preview.

Runbook hooks

operations/railway.md — checked-in service manifest discipline, healthcheck table, clone-to-prod checklist.
standards/deployment-env.md — env names, app stage vocabulary, preview database mapping rules.
standards/security-model.md — secrets policy, classification, rotation expectations.
integrations/neon.md — Neon project, branch naming, preview branch lifecycle (paired with this doc via neon-branching.yml).
integrations/grpc-mesh.md — private-network service-to-service contracts.
Per-service runbooks under services/*-runbook.md — use these for service-level paging, alert thresholds, and rollback procedures.

TODO(@law): confirm runbook coverage for round, strip, and fennec.