Problem
ant dev start becomes unable to bring up a working devnet after a few failed/interrupted starts on the same machine, even with a full LXC/container restart in between. Symptom: [1/3] Starting ant devnet… → Timed out waiting for devnet manifest repeatedly, with ant-devnet apparently making no progress. The fix that always works is rm -rf ~/.local/share/ant.
Root cause
src/bin/ant-devnet/main.rs (in WithAutonomi/ant-node) has two cleanup gaps that compound:
- Hardcoded persistent data dir —
default_root_dir() returns ~/.local/share/ant/ on Linux. Every run writes ~25 node identities under nodes/ and (unless Devnet::shutdown() is called) leaves them on disk. They have to be removed manually before the next run can be trusted.
std::mem::forget(testnet) leaks the AnvilInstance to keep anvil running past Testnet::new's scope. There is no tokio::signal handler, so when SIGTERM lands (which is exactly what ant dev stop's _kill_pid does, and what cmd_start.py does when its 6-min wait elapses) neither the anvil child nor the data-dir cleanup run. anvil orphans; the directory grows.
After a few interrupted cycles you have orphan anvils sitting on random ports plus partial node identities, and the next ant-devnet --preset default can't stabilize 25 nodes inside the wait window. From the outside this looks like a hard hang.
Bisect (so the failure mode is reproducible)
10 successful ant dev start / ant dev stop cycles, no manual cleanup between them, with pkill -9 anvil between cycles to remove the orphan-anvil variable. Stale node dirs alone aren't the trigger — accumulated dirs from successful cycles up to 225 didn't slow startup beyond noise:
| iter |
stale node dirs before |
start time |
outcome |
| 1 |
0 |
15 s |
ok |
| 2 |
25 |
17 s |
ok |
| 3 |
50 |
13 s |
ok |
| 4 |
75 |
19 s |
ok |
| 5 |
100 |
16 s |
ok |
| 6 |
125 |
14 s |
ok |
| 7 |
150 |
16 s |
ok |
| 8 |
175 |
19 s |
ok |
| 9 |
200 |
13 s |
ok |
| 10 |
225 |
16 s |
ok |
The hang reproduces reliably with a different mix: cycles where ant dev start is killed by its own manifest-wait timeout (i.e. devnet partially started but didn't finish stabilizing). Those leave behind (a) zombie [ant-devnet] <defunct> entries because the python parent reaped first, (b) orphan anvil processes, and (c) half-written node dirs. With 6–10 such interrupted cycles, subsequent ant dev start calls hang indefinitely until the LXC is restarted and ~/.local/share/ant/ is wiped.
How the rest of the ecosystem does this
| Where |
Isolation |
Cleanup discipline |
ant-node/tests/e2e/testnet.rs |
std::env::temp_dir().join("ant_test_{rand:x}") — unique per run |
RAII; the in-process tokio tasks die with the test process |
ant-node/scripts/test_e2e.sh |
MANIFEST_FILE="/tmp/ant_e2e_manifest_${TEST_RUN_ID}.json" |
trap cleanup EXIT kills the devnet PID and lingering children |
ant-client/ant-core/tests/support/mod.rs MiniTestnet |
tempfile::TempDir per node, held in a _temp_dirs: Vec<TempDir> field |
RAII via Drop; _testnet: Testnet field keeps anvil alive only for the testnet's lifetime |
ant-devnet (this bin) |
Hardcoded ~/.local/share/ant/, shared across runs |
cleanup_data_dir: true only on graceful Devnet::shutdown(); std::mem::forget(testnet) leaks anvil |
ant-devnet is the only one that holds state across runs, and the only one whose cleanup depends on Rust destructors running — which they don't, on SIGTERM.
Proposed fix
Two options, either of which addresses the hang class:
A. Make ant-devnet match its siblings (preferred)
In src/bin/ant-devnet/main.rs:
- Default the data dir to a
tempfile::TempDir (or temp_dir().join("ant_devnet_{rand}")) when --manifest is given without an explicit --data-dir. The manifest already encodes the path, so consumers can discover it.
- Install a tokio signal handler that, on SIGINT/SIGTERM:
- Calls
devnet.shutdown().await to honour cleanup_data_dir.
- Drops (rather than
forgets) the AnvilInstance, killing the anvil child.
- Then exits 0.
- Drop the
std::mem::forget(testnet) and instead store the Testnet in a long-lived variable in main so its Drop runs on normal exit too.
Roughly a single-file change of ~40 lines. Mirrors MiniTestnet's lifetime discipline.
B. Fix at the ant-dev layer (smaller surface, less ideal)
In ant-dev/src/ant_dev/cmd_start.py and cmd_stop.py:
- Generate a per-run manifest path (
/tmp/ant_devnet_{pid}_{ts}.json or under ~/.ant-dev/runs/<id>/) and pass it through.
- Have
cmd_stop pkill -f anvil and rm -rf the run-specific data dir, mirroring test_e2e.sh's trap cleanup EXIT.
- Have
cmd_start register a Python atexit / signal handler that does the same so a Ctrl-C during the manifest wait doesn't leak.
This works around the symptom without changing ant-devnet. Worth doing anyway if (A) takes time.
Repro
# fresh state
rm -rf ~/.local/share/ant ~/.ant-dev/*
# induce a few "killed mid-startup" cycles — easiest way is to start, ctrl-c
# during the manifest wait, repeat 6–8 times. Each leaves a zombie + an orphan anvil.
for i in 1 2 3 4 5 6 7 8; do
( ant dev start --ant-node-dir ~/Projects/ant-node & ); sleep 5; kill -9 $!
done
# after this, even a clean `ant dev start` hangs in [1/3] until you either
# restart the host/LXC or `rm -rf ~/.local/share/ant`.
Environment
- Ubuntu 24.04.4 LTS in an Incus LXC (Intel Meteor Lake host)
- ant-sdk + ant-node at
main (today's clone)
- antd v0.6.1 / ant-devnet built from the cloned ant-node tree
Found while iterating on the cross-SDK e2e harness; ties together the earlier reports #64 (undocumented anvil prereq) and underlies several flake patterns.
Problem
ant dev startbecomes unable to bring up a working devnet after a few failed/interrupted starts on the same machine, even with a full LXC/container restart in between. Symptom:[1/3] Starting ant devnet…→Timed out waiting for devnet manifestrepeatedly, withant-devnetapparently making no progress. The fix that always works isrm -rf ~/.local/share/ant.Root cause
src/bin/ant-devnet/main.rs(in WithAutonomi/ant-node) has two cleanup gaps that compound:default_root_dir()returns~/.local/share/ant/on Linux. Every run writes ~25 node identities undernodes/and (unlessDevnet::shutdown()is called) leaves them on disk. They have to be removed manually before the next run can be trusted.std::mem::forget(testnet)leaks theAnvilInstanceto keep anvil running pastTestnet::new's scope. There is notokio::signalhandler, so when SIGTERM lands (which is exactly whatant dev stop's_kill_piddoes, and whatcmd_start.pydoes when its 6-min wait elapses) neither the anvil child nor the data-dir cleanup run. anvil orphans; the directory grows.After a few interrupted cycles you have orphan anvils sitting on random ports plus partial node identities, and the next
ant-devnet --preset defaultcan't stabilize 25 nodes inside the wait window. From the outside this looks like a hard hang.Bisect (so the failure mode is reproducible)
10 successful
ant dev start/ant dev stopcycles, no manual cleanup between them, withpkill -9 anvilbetween cycles to remove the orphan-anvil variable. Stale node dirs alone aren't the trigger — accumulated dirs from successful cycles up to 225 didn't slow startup beyond noise:The hang reproduces reliably with a different mix: cycles where
ant dev startis killed by its own manifest-wait timeout (i.e. devnet partially started but didn't finish stabilizing). Those leave behind (a) zombie[ant-devnet] <defunct>entries because the python parent reaped first, (b) orphananvilprocesses, and (c) half-written node dirs. With 6–10 such interrupted cycles, subsequentant dev startcalls hang indefinitely until the LXC is restarted and~/.local/share/ant/is wiped.How the rest of the ecosystem does this
ant-node/tests/e2e/testnet.rsstd::env::temp_dir().join("ant_test_{rand:x}")— unique per runant-node/scripts/test_e2e.shMANIFEST_FILE="/tmp/ant_e2e_manifest_${TEST_RUN_ID}.json"trap cleanup EXITkills the devnet PID and lingering childrenant-client/ant-core/tests/support/mod.rsMiniTestnettempfile::TempDirper node, held in a_temp_dirs: Vec<TempDir>fieldDrop;_testnet: Testnetfield keeps anvil alive only for the testnet's lifetimeant-devnet(this bin)~/.local/share/ant/, shared across runscleanup_data_dir: trueonly on gracefulDevnet::shutdown();std::mem::forget(testnet)leaks anvilant-devnetis the only one that holds state across runs, and the only one whose cleanup depends on Rust destructors running — which they don't, on SIGTERM.Proposed fix
Two options, either of which addresses the hang class:
A. Make
ant-devnetmatch its siblings (preferred)In
src/bin/ant-devnet/main.rs:tempfile::TempDir(ortemp_dir().join("ant_devnet_{rand}")) when--manifestis given without an explicit--data-dir. The manifest already encodes the path, so consumers can discover it.devnet.shutdown().awaitto honourcleanup_data_dir.forgets) theAnvilInstance, killing the anvil child.std::mem::forget(testnet)and instead store theTestnetin a long-lived variable inmainso itsDropruns on normal exit too.Roughly a single-file change of ~40 lines. Mirrors
MiniTestnet's lifetime discipline.B. Fix at the
ant-devlayer (smaller surface, less ideal)In
ant-dev/src/ant_dev/cmd_start.pyandcmd_stop.py:/tmp/ant_devnet_{pid}_{ts}.jsonor under~/.ant-dev/runs/<id>/) and pass it through.cmd_stoppkill -f anvilandrm -rfthe run-specific data dir, mirroringtest_e2e.sh'strap cleanup EXIT.cmd_startregister a Pythonatexit/ signal handler that does the same so a Ctrl-C during the manifest wait doesn't leak.This works around the symptom without changing
ant-devnet. Worth doing anyway if (A) takes time.Repro
Environment
main(today's clone)Found while iterating on the cross-SDK e2e harness; ties together the earlier reports #64 (undocumented anvil prereq) and underlies several flake patterns.