Agentic Trading
Decisions

ADR-0004: Live runner runs on Fly.io Machines (one per deployment)

The live trading runtime has constraints that don't fit Vercel Functions:

  • Status: accepted
  • Date: 2026-05-27

Context

The live trading runtime has constraints that don't fit Vercel Functions:

  • Persistent WebSocket to Hyperliquid (market data, fills)
  • Long-lived state across ticks (broker handle, in-flight orders)
  • Schedule adherence (cron-aligned ticks; can't afford cold start on each tick)
  • Per-Skill isolation — a crash in one Skill must not affect another

Vercel Fluid Compute is excellent for stateless HTTP, but a multi-hour-running process with a held WS connection is fighting its design.

Candidates: Fly.io Machines, Railway, Render, self-hosted VMs on Hetzner/DO, AWS ECS/Fargate, single dispatcher process pattern.

Decision

Fly.io Machines, one machine per active Deployment. The apps/live-runner Node service is the only thing on the machine. Provisioned on Deploy Live click via the Fly Machines API; destroyed when the deployment stops.

Alternatives considered

Alt A — Railway / Render workers

  • Simpler PaaS, single always-on process
  • Single shared process for all Skills = noisy-neighbor and blast-radius problems
  • Not picked: isolation matters for production trading; one bad Skill should not take down others.

Alt B — Self-hosted VMs (Hetzner/DO/AWS)

  • Cheapest at scale, total control
  • Operational burden: provisioning automation, OS updates, monitoring, networking
  • Not picked: premature. Pay Fly for the orchestration until we're large enough to absorb the toil.

Alt C — Single dispatcher process that spawns workers

  • One coordinator on Fly that owns N Skills, manages worker subprocesses
  • More moving parts, harder restart semantics, shared failure domain
  • Not picked: per-machine isolation is simpler and Fly bills per-second so it's not noticeably more expensive at our scale.

Alt D — Vercel Functions with cron polling

  • Skill ticks become cron-fired HTTP requests
  • No WS = polling-only agents = higher latency to fills + higher RPC volume
  • State must be re-fetched on every tick (no in-process cache of positions)
  • Not picked: latency and cost penalty is significant for active strategies; design constraint we don't want.

Alt E — AWS ECS Fargate

  • Production-grade container orchestration
  • Significantly more setup (VPC, IAM, ALB if needed)
  • Not picked: Fly gives 80% of the value with 5% of the setup at our scale.

Consequences

Positive

  • Isolation: one Skill's crash, memory leak, or runaway loop only affects itself
  • Fast cold start: Fly machines boot in ~1s; if a machine restarts after a crash, the deployment is back ticking within seconds
  • Per-Skill cost transparency: "this Skill costs $X/mo on infra" is a trivial query
  • Scale: N Skills = N machines, scales linearly with no orchestration code from us
  • Stop = free: Fly machines bill only while running; stopped deployment = $0
  • Fly Logs scoped per machine = per-Skill log streams without extra work

Negative / trade-offs

  • Adds Fly as a vendor (operationally separate from Vercel)
  • Fly Machines API has its own auth, rate limits, regions to think about
  • Cross-machine coordination (e.g., shared rate limits across all Skills) requires a coordination layer in Postgres
  • Fly outages affect all live trading on the platform

Things we'll need to revisit

  • When we have >100 live deployments, evaluate whether dispatcher-pool architecture saves enough cost to justify the complexity
  • If Fly availability becomes a problem, evaluate a multi-region active-passive setup
  • Latency: confirm Fly region (iad) is well-placed vs Hyperliquid's API endpoints. Move to a different region if latency becomes a strategy issue.

References

On this page