ADR-0023: Sim-worker is a scale-to-zero, on-demand pool
Backtests run on apps/sim-worker, a Node process on Fly.io that drains the
- Status: accepted
- Date: 2026-06-11
Context
Backtests run on apps/sim-worker, a Node process on Fly.io that drains the
sim_runs queue. Real-model sims run ~17s/tick (≈80 min for a day at 5m),
which is why they can't live on Vercel Functions (800s ceiling) — see the
worker's own header and the "Where sims run" section of
docs/architecture/simulation.md.
Until now the worker was a single always-on machine, provisioned at
deploy time via a [[vm]] block in fly.toml. Two problems:
- Cost. It billed 24/7 even with an empty queue. Backtests are bursty and infrequent; the machine sits idle most of the time.
- Throughput. One machine drains the queue strictly serially. A user who fires several backtests waits for them one after another, even though the work is embarrassingly parallel across runs.
The platform already has a proven dynamic-provisioning pattern: the
live-runner creates one Fly machine per deployment via the Machines API
(apps/web/lib/fly.ts, ADR-0004), pins its image in
platform_config.live_runner_image, and tears it down on stop. We want to
reuse that machinery rather than invent new infrastructure.
Decision
Make the sim-worker a dynamic pool that scales to zero:
- Wake on enqueue.
createAndRunSim/resumeSim/rerunSiminsert the queued row and then callensureSimWorkerCapacity()(apps/web/lib/sim-capacity.ts), which provisions machines via the Fly Machines API (apps/web/lib/sim-fly.ts). Noafter(), no cron — the Vercel Hobby plan freezes prod deploys on sub-daily crons, so a polling cron couldn't run often enough to feel responsive anyway. - Scale up to N. Capacity targets
min(SIM_WORKER_MAX_MACHINES, queued + running)live machines, so multiple backtests drain in parallel. The existing atomic CAS claim (status='queued'predicate on the UPDATE) already makes concurrent workers safe against double-claiming. - Scale down to zero by self-destruct. Each worker exits 0 after the
queue is idle past
SIM_WORKER_IDLE_EXIT_MS; machines are created withauto_destroy: true(+restart: on-failure), so a clean idle exit removes the machine while a crash restarts it in place. No stopped-machine sweep to maintain. - Heartbeat liveness replaces the boot orphan sweep. Each worker stamps
sim_runs.heartbeat_aton the row it owns (~SIM_WORKER_HEARTBEAT_MS). A non-terminal run whose heartbeat goes stale (SIM_WORKER_STALE_MS, default 3 min) is swept toerror. This sweep runs on every enqueue and every worker boot. The old "error everyrunningrow on boot" sweep was safe only with a single writer; with N workers it would kill a sibling's in-flight run. - Pin the image, don't roll machines. Machines boot the image pinned in
platform_config.sim_worker_image, upserted by thesim-worker-deployCI workflow (now build-only + push + pin, mirroring live-runner). A deploy never interrupts an in-flight backtest.
Schema: migration 20260611000000 adds sim_runs.heartbeat_at +
worker_id, a partial index for the staleness sweep, and seeds the
sim_worker_image config row.
Consequences
- No distributed lock on provisioning. Two concurrent enqueues can briefly create an extra machine or two beyond the strict target. This is self-correcting — a worker that finds nothing to claim hits its idle grace and auto-destroys — so we accept the bounded over-provision rather than add locking surface.
- Best-effort capacity.
ensureSimWorkerCapacity()never throws to the enqueue path: the queued row is durable, so a Fly hiccup leaves the sim queued and the next enqueue (or worker boot) retries. The tradeoff: if Fly is down and no further enqueues arrive, a queued sim waits. - Idle-exit race. A job enqueued in the exact window where the last worker is exiting can be missed until the next enqueue. A final claim-attempt before exit narrows the window; the residual is the same class as the point above and self-heals on the next enqueue. A low-frequency safety-net sweep is a future add (blocked today by the Hobby cron-freeze constraint).
- Multi-worker claim contention. At small N the pick-then-CAS claim is
fine (losers simply retry). If contention ever bites, switch to
SELECT ... FOR UPDATE SKIP LOCKEDvia the pg client — noted in the worker. - Token scope. The web app's
FLY_API_TOKEN(which already provisions live-runner machines) must also have access to theagentic-sim-workerapp — i.e. be org-scoped, not scoped to a single app — orcreateSimWorkerMachine401s. The CIFLY_API_TOKENrepo secret used bysim-worker-deployis separate and only needs deploy scope on that app.
Alternatives considered
Alt A — Keep one always-on machine
- Simplest; no new code.
- Not picked: the cost and serial-throughput problems above are exactly what prompted this. The machine is idle the vast majority of the time.
Alt B — Fly auto-stop/auto-start via an HTTP shim
- Give the worker an HTTP service so Fly's proxy auto-starts/stops it.
- Fly measures idle by HTTP connections, but our work is DB-queued with no HTTP traffic — a long backtest with no requests would look idle and get stopped. Would need an artificial keep-alive. Not picked: fights the mechanism.
Alt C — Scale 0↔1 only (cost win, no parallelism)
- Wake one worker on enqueue, self-stop when idle; keep the boot orphan sweep as-is (safe with a single writer).
- Smaller change, but leaves the serial-throughput problem unsolved.
- Not picked: we want parallel drain. Going to N is what forces the heartbeat-staleness rework — accepted as the cost of throughput.
Alt D — Vercel cron poller
- A scheduled endpoint counts queued rows and provisions/destroys.
- Decoupled from enqueue, but the Hobby sub-daily-cron freeze means it can't poll often enough to be responsive without a plan change. Not picked.