Decisions
ADR-0004: Live runner runs on Fly.io Machines (one per deployment)
The live trading runtime has constraints that don't fit Vercel Functions:
- Status: accepted
- Date: 2026-05-27
Context
The live trading runtime has constraints that don't fit Vercel Functions:
- Persistent WebSocket to Hyperliquid (market data, fills)
- Long-lived state across ticks (broker handle, in-flight orders)
- Schedule adherence (cron-aligned ticks; can't afford cold start on each tick)
- Per-Skill isolation — a crash in one Skill must not affect another
Vercel Fluid Compute is excellent for stateless HTTP, but a multi-hour-running process with a held WS connection is fighting its design.
Candidates: Fly.io Machines, Railway, Render, self-hosted VMs on Hetzner/DO, AWS ECS/Fargate, single dispatcher process pattern.
Decision
Fly.io Machines, one machine per active Deployment. The apps/live-runner Node service is the only thing on the machine. Provisioned on Deploy Live click via the Fly Machines API; destroyed when the deployment stops.
Alternatives considered
Alt A — Railway / Render workers
- Simpler PaaS, single always-on process
- Single shared process for all Skills = noisy-neighbor and blast-radius problems
- Not picked: isolation matters for production trading; one bad Skill should not take down others.
Alt B — Self-hosted VMs (Hetzner/DO/AWS)
- Cheapest at scale, total control
- Operational burden: provisioning automation, OS updates, monitoring, networking
- Not picked: premature. Pay Fly for the orchestration until we're large enough to absorb the toil.
Alt C — Single dispatcher process that spawns workers
- One coordinator on Fly that owns N Skills, manages worker subprocesses
- More moving parts, harder restart semantics, shared failure domain
- Not picked: per-machine isolation is simpler and Fly bills per-second so it's not noticeably more expensive at our scale.
Alt D — Vercel Functions with cron polling
- Skill ticks become cron-fired HTTP requests
- No WS = polling-only agents = higher latency to fills + higher RPC volume
- State must be re-fetched on every tick (no in-process cache of positions)
- Not picked: latency and cost penalty is significant for active strategies; design constraint we don't want.
Alt E — AWS ECS Fargate
- Production-grade container orchestration
- Significantly more setup (VPC, IAM, ALB if needed)
- Not picked: Fly gives 80% of the value with 5% of the setup at our scale.
Consequences
Positive
- Isolation: one Skill's crash, memory leak, or runaway loop only affects itself
- Fast cold start: Fly machines boot in ~1s; if a machine restarts after a crash, the deployment is back ticking within seconds
- Per-Skill cost transparency: "this Skill costs $X/mo on infra" is a trivial query
- Scale: N Skills = N machines, scales linearly with no orchestration code from us
- Stop = free: Fly machines bill only while running; stopped deployment = $0
- Fly Logs scoped per machine = per-Skill log streams without extra work
Negative / trade-offs
- Adds Fly as a vendor (operationally separate from Vercel)
- Fly Machines API has its own auth, rate limits, regions to think about
- Cross-machine coordination (e.g., shared rate limits across all Skills) requires a coordination layer in Postgres
- Fly outages affect all live trading on the platform
Things we'll need to revisit
- When we have >100 live deployments, evaluate whether dispatcher-pool architecture saves enough cost to justify the complexity
- If Fly availability becomes a problem, evaluate a multi-region active-passive setup
- Latency: confirm Fly region (
iad) is well-placed vs Hyperliquid's API endpoints. Move to a different region if latency becomes a strategy issue.
References
docs/architecture/live-runtime.md- Fly Machines API: https://fly.io/docs/machines/