ADR-0004: Live runner runs on Fly.io Machines (one per deployment)

Status: accepted
Date: 2026-05-27

Context

The live trading runtime has constraints that don't fit Vercel Functions:

Persistent WebSocket to Hyperliquid (market data, fills)
Long-lived state across ticks (broker handle, in-flight orders)
Schedule adherence (cron-aligned ticks; can't afford cold start on each tick)
Per-Skill isolation — a crash in one Skill must not affect another

Vercel Fluid Compute is excellent for stateless HTTP, but a multi-hour-running process with a held WS connection is fighting its design.

Candidates: Fly.io Machines, Railway, Render, self-hosted VMs on Hetzner/DO, AWS ECS/Fargate, single dispatcher process pattern.

Fly.io Machines, one machine per active Deployment. The apps/live-runner Node service is the only thing on the machine. Provisioned on Deploy Live click via the Fly Machines API; destroyed when the deployment stops.

Alternatives considered

Alt A — Railway / Render workers

Simpler PaaS, single always-on process
Single shared process for all Skills = noisy-neighbor and blast-radius problems
Not picked: isolation matters for production trading; one bad Skill should not take down others.

Alt B — Self-hosted VMs (Hetzner/DO/AWS)

Cheapest at scale, total control
Operational burden: provisioning automation, OS updates, monitoring, networking
Not picked: premature. Pay Fly for the orchestration until we're large enough to absorb the toil.

Alt C — Single dispatcher process that spawns workers

One coordinator on Fly that owns N Skills, manages worker subprocesses
More moving parts, harder restart semantics, shared failure domain
Not picked: per-machine isolation is simpler and Fly bills per-second so it's not noticeably more expensive at our scale.

Alt D — Vercel Functions with cron polling

Skill ticks become cron-fired HTTP requests
No WS = polling-only agents = higher latency to fills + higher RPC volume
State must be re-fetched on every tick (no in-process cache of positions)
Not picked: latency and cost penalty is significant for active strategies; design constraint we don't want.

Alt E — AWS ECS Fargate

Production-grade container orchestration
Significantly more setup (VPC, IAM, ALB if needed)
Not picked: Fly gives 80% of the value with 5% of the setup at our scale.

Consequences

Positive

Isolation: one Skill's crash, memory leak, or runaway loop only affects itself
Fast cold start: Fly machines boot in ~1s; if a machine restarts after a crash, the deployment is back ticking within seconds
Per-Skill cost transparency: "this Skill costs $X/mo on infra" is a trivial query
Scale: N Skills = N machines, scales linearly with no orchestration code from us
Stop = free: Fly machines bill only while running; stopped deployment = $0
Fly Logs scoped per machine = per-Skill log streams without extra work

Negative / trade-offs

Adds Fly as a vendor (operationally separate from Vercel)
Fly Machines API has its own auth, rate limits, regions to think about
Cross-machine coordination (e.g., shared rate limits across all Skills) requires a coordination layer in Postgres
Fly outages affect all live trading on the platform

Things we'll need to revisit

When we have >100 live deployments, evaluate whether dispatcher-pool architecture saves enough cost to justify the complexity
If Fly availability becomes a problem, evaluate a multi-region active-passive setup
Latency: confirm Fly region (iad) is well-placed vs Hyperliquid's API endpoints. Move to a different region if latency becomes a strategy issue.

References

docs/architecture/live-runtime.md
Fly Machines API: https://fly.io/docs/machines/