ADR-0023: Sim-worker is a scale-to-zero, on-demand pool

Status: accepted
Date: 2026-06-11

Context

Backtests run on apps/sim-worker, a Node process on Fly.io that drains the sim_runs queue. Real-model sims run ~17s/tick (≈80 min for a day at 5m), which is why they can't live on Vercel Functions (800s ceiling) — see the worker's own header and the "Where sims run" section of docs/architecture/simulation.md.

Until now the worker was a single always-on machine, provisioned at deploy time via a [[vm]] block in fly.toml. Two problems:

Cost. It billed 24/7 even with an empty queue. Backtests are bursty and infrequent; the machine sits idle most of the time.
Throughput. One machine drains the queue strictly serially. A user who fires several backtests waits for them one after another, even though the work is embarrassingly parallel across runs.

The platform already has a proven dynamic-provisioning pattern: the live-runner creates one Fly machine per deployment via the Machines API (apps/web/lib/fly.ts, ADR-0004), pins its image in platform_config.live_runner_image, and tears it down on stop. We want to reuse that machinery rather than invent new infrastructure.

Decision

Make the sim-worker a dynamic pool that scales to zero:

Wake on enqueue. createAndRunSim / resumeSim / rerunSim insert the queued row and then call ensureSimWorkerCapacity() (apps/web/lib/sim-capacity.ts), which provisions machines via the Fly Machines API (apps/web/lib/sim-fly.ts). No after(), no cron — the Vercel Hobby plan freezes prod deploys on sub-daily crons, so a polling cron couldn't run often enough to feel responsive anyway.
Scale up to N. Capacity targets min(SIM_WORKER_MAX_MACHINES, queued + running) live machines, so multiple backtests drain in parallel. The existing atomic CAS claim (status='queued' predicate on the UPDATE) already makes concurrent workers safe against double-claiming.
Scale down to zero by self-destruct. Each worker exits 0 after the queue is idle past SIM_WORKER_IDLE_EXIT_MS; machines are created with auto_destroy: true (+ restart: on-failure), so a clean idle exit removes the machine while a crash restarts it in place. No stopped-machine sweep to maintain.
Heartbeat liveness replaces the boot orphan sweep. Each worker stamps sim_runs.heartbeat_at on the row it owns (~SIM_WORKER_HEARTBEAT_MS). A non-terminal run whose heartbeat goes stale (SIM_WORKER_STALE_MS, default 3 min) is swept to error. This sweep runs on every enqueue and every worker boot. The old "error every running row on boot" sweep was safe only with a single writer; with N workers it would kill a sibling's in-flight run.
Pin the image, don't roll machines. Machines boot the image pinned in platform_config.sim_worker_image, upserted by the sim-worker-deploy CI workflow (now build-only + push + pin, mirroring live-runner). A deploy never interrupts an in-flight backtest.

Schema: migration 20260611000000 adds sim_runs.heartbeat_at + worker_id, a partial index for the staleness sweep, and seeds the sim_worker_image config row.

Consequences

No distributed lock on provisioning. Two concurrent enqueues can briefly create an extra machine or two beyond the strict target. This is self-correcting — a worker that finds nothing to claim hits its idle grace and auto-destroys — so we accept the bounded over-provision rather than add locking surface.
Best-effort capacity. ensureSimWorkerCapacity() never throws to the enqueue path: the queued row is durable, so a Fly hiccup leaves the sim queued and the next enqueue (or worker boot) retries. The tradeoff: if Fly is down and no further enqueues arrive, a queued sim waits.
Idle-exit race. A job enqueued in the exact window where the last worker is exiting can be missed until the next enqueue. A final claim-attempt before exit narrows the window; the residual is the same class as the point above and self-heals on the next enqueue. A low-frequency safety-net sweep is a future add (blocked today by the Hobby cron-freeze constraint).
Multi-worker claim contention. At small N the pick-then-CAS claim is fine (losers simply retry). If contention ever bites, switch to SELECT ... FOR UPDATE SKIP LOCKED via the pg client — noted in the worker.
Token scope. The web app's FLY_API_TOKEN (which already provisions live-runner machines) must also have access to the agentic-sim-worker app — i.e. be org-scoped, not scoped to a single app — or createSimWorkerMachine 401s. The CI FLY_API_TOKEN repo secret used by sim-worker-deploy is separate and only needs deploy scope on that app.

Alternatives considered

Alt A — Keep one always-on machine

Simplest; no new code.
Not picked: the cost and serial-throughput problems above are exactly what prompted this. The machine is idle the vast majority of the time.

Alt B — Fly auto-stop/auto-start via an HTTP shim

Give the worker an HTTP service so Fly's proxy auto-starts/stops it.
Fly measures idle by HTTP connections, but our work is DB-queued with no HTTP traffic — a long backtest with no requests would look idle and get stopped. Would need an artificial keep-alive. Not picked: fights the mechanism.

Alt C — Scale 0↔1 only (cost win, no parallelism)

Wake one worker on enqueue, self-stop when idle; keep the boot orphan sweep as-is (safe with a single writer).
Smaller change, but leaves the serial-throughput problem unsolved.
Not picked: we want parallel drain. Going to N is what forces the heartbeat-staleness rework — accepted as the cost of throughput.

Alt D — Vercel cron poller

A scheduled endpoint counts queued rows and provisions/destroys.
Decoupled from enqueue, but the Hobby sub-daily-cron freeze means it can't poll often enough to be responsive without a plan change. Not picked.

ADR-0023: Sim-worker is a scale-to-zero, on-demand pool

On this page