Railway Integration Playbook
Overview
Railway is our PaaS for runtime services — every
long-running beef-* service runs on Railway. Hosted staging and production
run as a single Railway region, us-east4-eqdc4a (US East), colocated
with the US production Neon path. Railway private networking is used only
between Railway services in the same project; Neon is reached through its
normal connection strings (see Neon integration and
gRPC mesh).
This page is the integration playbook — how Railway is wired into the
codebase, how deploys flow, where secrets live, and what breaks. For the
day-to-day operational reference (manifest mirroring, healthcheck table,
clone-to-prod checklist), see operations/railway.md.
MCP access at the time of writing: the Railway CLI MCP tool returned
Invalid or expired Railway tokenfor bothcheck-railway-statusandlist-projects. Live project/service IDs and per-service variable inventories are therefore reconstructed fromapps/*/railway.json, the.github/workflows/neon-branching.ymlworkflow, and the existing operations notes. TODO(@law): re-run the MCP queries after refreshing the Railway API token and reconcile any drift.
Project & service map
All services live in the beef Railway project under the Foxy
workspace (per operations/railway.md). Each service is wired to a checked-in
railway.json so build/deploy config is reviewable in Git.
flowchart LR subgraph Railway["Railway project: beef (workspace: Foxy)"] subgraph public["Public-facing"] brisket["beef-brisket<br/>Next.js 16<br/>/api/health"] fennec["beef-fennec<br/>Vite SPA<br/>/"] strip["beef-strip<br/>Go Fiber SSR<br/>/health"] flank["beef-flank<br/>Next.js (workflows UI)<br/>/sign-in"] end subgraph internal["Internal (private network)"] sirloin["beef-sirloin<br/>Go gRPC+REST<br/>/health on gRPC port"] brain["beef-brain<br/>NestJS<br/>/health"] round["beef-round<br/>Go ML inference<br/>/health on gRPC port"] end end
brisket -->|REST| sirloin fennec -->|REST| sirloin strip -->|gRPC| sirloin flank -->|gRPC| sirloin sirloin -->|gRPC| brain sirloin -->|gRPC| round brain -->|gRPC| round sirloin -->|gRPC: FlankExecutionService| flank
Neon[("Neon Postgres<br/>US")] sirloin -.-> Neon brain -.-> Neon
classDef pub fill:#e8f5e8,stroke:#2d7d2d classDef int fill:#e8f0ff,stroke:#2d4d8c class brisket,fennec,strip,flank pub class sirloin,brain,round intSeven beef-* services have a checked-in railway.json:
brain, brisket, fennec, flank, round, sirloin, strip. Other
services (chuck — Strapi CMS; shank — email templates) are not Railway
runtime services as of the current commit. TODO(@law): confirm the chuck hosting target.
Per-service deploy mode
Railway is the deploy executor for runtime services. The
.github/workflows/*.yml per-service workflows in this repo run
lint/typecheck/test only — none of them invokes railway up or pushes a
container image. Deployments are triggered by Railway’s repo watch on the
configured branch, gated by watchPatterns in each railway.json.
| Service | Trigger | Branch tracked | watchPatterns | Healthcheck path | Replicas |
|---|---|---|---|---|---|
beef-brain | Railway watch on push | main* | /apps/brain/** | /health | 1 |
beef-brisket | Railway watch on push | main* | /apps/brisket/** | /api/health | 1 |
beef-fennec | Railway watch on push | main* | [] (rebuild on every push) | / | 1 |
beef-flank | Railway watch on push | main* | /apps/flank/** | /sign-in | 1 |
beef-round | Railway watch on push | main* | /apps/round/** | /health (on gRPC) | 1 |
beef-sirloin | Railway watch on push | main* | /apps/sirloin/**, /docs/src/content/docs/**, /scripts/docs-build-kb-export.mjs | /health (on gRPC) | 1 |
beef-strip | Railway watch on push | main* | /apps/strip/** | /health | 1 |
* Railway’s tracked branch is configured per service in the Railway UI, not
in railway.json. The repository’s two long-lived branches are main and
release (per .github/workflows/*.yml branches: lists).
TODO(@law): confirm whether prod tracks release and staging tracks main, or whether both environments track main.
CI workflows that do not deploy:
brain.yml, brisket.yml, fennec.yml, flank.yml, sirloin.yml,
docs-quality.yml, pr-description-check.yml, ai-hygiene.yml,
up-to-date-check.yml. The sirloin.yml workflow has a commented-out
docker-build job that would push to ECR — currently dead (.github/workflows/sirloin.yml:112
is commented). TODO(@law): decide whether to delete or revive the dead docker-build job.
The only workflow that talks to Railway directly is
.github/workflows/neon-branching.yml (see the Environments
section).
Environments
Three environment classes:
| Class | Railway environment | Branch | Neon branch | Notes |
|---|---|---|---|---|
| Production | production | likely release | Neon production | Customer traffic. |
| Staging | staging | likely main | Neon staging | Pre-prod soak. |
| Preview | beef-pr-<PR_NUMBER> | PR head | preview/pr-<N>-<sanitized-branch> | Created/destroyed by neon-branching.yml. 14-day Neon TTL. |
TODO(@law): confirm production tracks release and staging tracks main in the Railway UI.
Per-app stages are normalized by the *_STAGE env var
(SIRLOIN_STAGE, BRAIN_STAGE, BRISKET_STAGE, STRIP_STAGE,
FLANK_STAGE) — values production | staging | sandbox | development (see
deployment-env standard). NODE_ENV stays
reserved for framework runtime mode.
Preview environment lifecycle
Driven entirely by .github/workflows/neon-branching.yml:
- PR opened/synchronized → workflow installs the Railway CLI, links to
RAILWAY_PROJECT_IDenvstaging, then waits up to 5 min for the Railway preview envbeef-pr-<N>to materialize. - Creates a Neon branch
preview/pr-<N>-<branch>with a 14-day expiry. - Rewrites database variables on the preview env (with retries):
beef-sirloin:SIRLOIN_DATABASE_URL,SIRLOIN_DATABASE_POOLED_URL→ Neonrumpbeef-brain:DATABASE_URL,DIRECT_DATABASE_URL→ Neonfennec(schemafennec)
- Disables OTEL on the preview env by deleting the gate vars
(
BRAIN_OTEL_URLon brain,SIRLOIN_AXIOM_TOKENon sirloin) so preview logs/traces don’t pollute the staging Axiom dataset. - Verifies via
railway variable list --jsonthat no variable still resolves torailway.internal. - PR closed → Neon branch deleted (workflow
closedaction).
This is the single entry point that mutates Railway state from CI. Any other variable changes are made by hand in the Railway UI.
Secrets management
Railway env vars are the canonical secret store for runtime services
(per security-model §Secrets management). The
shape is documented in .env.example. There is currently no Doppler / AWS
Secrets Manager / Vault layer — Railway’s per-environment vars are the
source of truth.
Variable scoping:
- Per-service vars — most secrets, e.g.
SIRLOIN_BRAIN_API_KEY,SIRLOIN_PRIMER_WEBHOOK_SECRET,BRISKET_CLERK_ENCRYPTION_KEY. - Shared variables — Clerk keys (
CLERK_PUBLISHABLE_KEY,CLERK_SECRET_KEY) used by brain/flank/fennec/strip;FLANK_ENCRYPTION_KEYused by flank+sirloin. TODO(@law): confirm whether Clerk keys and FLANK_ENCRYPTION_KEY use Railway shared variables or duplicated per-service vars. - Internal references —
${{shared.X}}and${{Service.VAR}}reference syntax. None of the checked-inrailway.jsonfiles contain reference expressions; usage is per-environment in the Railway UI. TODO(@law): audit current Railway variable references.
Rotation procedure (manual today):
- Generate new value (per the secret’s spec — see flank’s
secrets.tsfor the secret-name registry, or.env.examplefor top-level shapes). - Update Railway var on production first; Railway redeploys the service.
- Wait for healthcheck to flip green; verify from logs that the new value is in use.
- Update staging to match (so promotion deltas don’t surface stale secrets).
- For flank workflow secrets (encrypted in sirloin’s DB, AES-256 with
FLANK_ENCRYPTION_KEY), use the in-app Secret store, not Railway env. Cache TTL is 5 minutes — wait that long before treating rotation as live. - Revoke the old value at the upstream provider.
TODO(@law): automate Railway variable rotation from a runbook; currently rotations are done manually in the Railway UI.
Required manual auth secret on GitHub: RAILWAY_API_TOKEN (used by
neon-branching.yml). Without it, preview environments stay pointed at
Railway-Postgres clones rather than Neon branches.
Networking
Railway provides a private IPv6 network between services in the same
project; cross-service gRPC traffic uses *.railway.internal hostnames on
that mesh. Public traffic terminates at Railway’s edge with auto-issued TLS
certificates.
- Inter-service gRPC mesh (
sirloin ↔ brain,sirloin ↔ round,brain → round,sirloin ↔ flank): traverses the private network. See gRPC mesh for the full topology and TLS posture. beef-brainandbeef-roundsetipv6EgressEnabled: trueso they can reach external IPv6-only endpoints (Neon proxy, model registries).- Healthcheck port discipline: for
roundandsirloin,/healthis served on the same socket as gRPC (HTTP/2 cleartext + HTTP routing) so Railway’s requiredPORTvariable stays aligned withGRPC_PORT/SIRLOIN_PORT. Splitting these would break Railway’s healthcheck. - Public domains: assigned per service in the Railway UI. TODO(@law): inventory public domains; they are not checked into Git.
Logs & metrics
- Railway logs — accessible via
railway logs --service <name> --environment <env>or the Railway UI. Default retention is the Railway plan default. TODO(@law): exact retention window. - Application telemetry — services emit OTEL traces/metrics directly to
Axiom (
SIRLOIN_AXIOM_TOKEN,BRAIN_OTEL_URL); Railway is not in the telemetry path. - Railway → Axiom log drain — confirmed not configured: a repo-wide grep
for
log_drain/RAILWAY_LOG_DRAINreturns nothing. If platform events (deploy success/fail, OOM kills, healthcheck flaps) need to land in Axiom, configure a Railway log drain pointing at the Axiom HTTP ingest endpoint and addRAILWAY_LOG_DRAIN_TOKENto the secret rotation list.
Failure modes
| Mode | Detection | Mitigation |
|---|---|---|
| Deploy stuck in “Building” / “Deploying” | Railway UI deployment hung > 15 min; service still on previous version. | Cancel from UI, retry. If repeated: inspect build logs for OOM during pnpm install / Go go build; bump build resources or split Dockerfile stages. |
| Healthcheck fail loop | Railway dashboard shows “Crashed” / repeated restarts; restart counter climbs. restartPolicyMaxRetries: 10 on every service. | Check /health route is reachable on PORT. For round/sirloin, confirm gRPC + HTTP share the listener. Roll back via “Redeploy” on prior commit. |
| Env var drift between envs | Preview behaving differently from staging on identical code. | Run railway variable list --json --service X --environment staging vs preview; reconcile. The neon-branching.yml retry-loop catches DB-URL drift specifically. |
| OOM kill | Process restart immediately after warm-up; dmesg-style OOM signal in logs. | Bump service memory in Railway UI. Local compose no longer encodes hosted memory sizing; use Railway service metrics and plan limits as the sizing source of truth. |
Region outage (us-east4-eqdc4a) | Railway status page; all beef-* services flapping simultaneously. | No active failover — single region (every service pins us-east4-eqdc4a in railway.json). Mitigation is wait; the DR plan is still open. |
| Preview env not created | neon-branching.yml “Wait for preview environment” step times out after 30 × 10 s polls. | Railway auto-creates preview envs from the PR; if the workflow is missing RAILWAY_API_TOKEN or the project’s PR-environments toggle is off, the env never appears. |
| Neon → Railway rewrite fails | Workflow logs did not match the Neon URL written by this workflow or still points at Railway Postgres. | Re-run the workflow; the retry helper retries 5×. If still failing, set the var manually in the Railway UI. |
Cost model
Railway billing is usage-based (compute + memory + egress + plan tier).
This repo does not pin per-service compute/memory caps in railway.json
(limitOverride: null everywhere) — Railway uses plan defaults.
TODO(@law) — fill in:
- Plan tier (Pro? Team?).
- Per-service resource ceilings (vCPU, memory).
- Monthly run-rate split by service.
- Network transfer to external Neon, Axiom, and S3-compatible endpoints. Railway and Neon are geographically colocated in US East here, but that is not a private same-network guarantee: same-region/zone paths may have minimal transfer cost, while cross-region or public proxy endpoints can still bill as egress.
The docker-compose memory targets give a rough sense of the prod budget envelope: brisket 12G, sirloin 12G, brain 8G, round 8G, strip 4G, redis 2G, fennec 2G, flank 2G — totalling ~54G across services. Railway production should be sized at or above these.
TODO(@law): define the Railway single-region DR plan.
Local mirror
For full-stack local dev, use the root docker-compose.yml. This brings up Postgres
(pgvector/pgvector:pg17-trixie), brain, sirloin, round, brisket, fennec,
flank, strip, and gRPC UI helpers — bound to localhost, no Railway in the
loop. Per-app stage vars default to sandbox / development.
See the services overview for per-app dev quickstart commands; this is not a Railway-specific story but is the standard reproduction path before deploying to a Railway preview.
Runbook hooks
operations/railway.md— checked-in service manifest discipline, healthcheck table, clone-to-prod checklist.standards/deployment-env.md— env names, app stage vocabulary, preview database mapping rules.standards/security-model.md— secrets policy, classification, rotation expectations.integrations/neon.md— Neon project, branch naming, preview branch lifecycle (paired with this doc vianeon-branching.yml).integrations/grpc-mesh.md— private-network service-to-service contracts.- Per-service runbooks under
services/*-runbook.md— use these for service-level paging, alert thresholds, and rollback procedures.
TODO(@law): confirm runbook coverage for round, strip, and fennec.