Skip to content

ant-devnet lifecycle/cleanup diverges from ant-node's own testnet patterns; ant dev start hangs after kill-mid-spawn cycles #73

@Nic-dorman

Description

@Nic-dorman

Problem

ant dev start becomes unable to bring up a working devnet after a few failed/interrupted starts on the same machine, even with a full LXC/container restart in between. Symptom: [1/3] Starting ant devnet…Timed out waiting for devnet manifest repeatedly, with ant-devnet apparently making no progress. The fix that always works is rm -rf ~/.local/share/ant.

Root cause

src/bin/ant-devnet/main.rs (in WithAutonomi/ant-node) has two cleanup gaps that compound:

  1. Hardcoded persistent data dirdefault_root_dir() returns ~/.local/share/ant/ on Linux. Every run writes ~25 node identities under nodes/ and (unless Devnet::shutdown() is called) leaves them on disk. They have to be removed manually before the next run can be trusted.
  2. std::mem::forget(testnet) leaks the AnvilInstance to keep anvil running past Testnet::new's scope. There is no tokio::signal handler, so when SIGTERM lands (which is exactly what ant dev stop's _kill_pid does, and what cmd_start.py does when its 6-min wait elapses) neither the anvil child nor the data-dir cleanup run. anvil orphans; the directory grows.

After a few interrupted cycles you have orphan anvils sitting on random ports plus partial node identities, and the next ant-devnet --preset default can't stabilize 25 nodes inside the wait window. From the outside this looks like a hard hang.

Bisect (so the failure mode is reproducible)

10 successful ant dev start / ant dev stop cycles, no manual cleanup between them, with pkill -9 anvil between cycles to remove the orphan-anvil variable. Stale node dirs alone aren't the trigger — accumulated dirs from successful cycles up to 225 didn't slow startup beyond noise:

iter stale node dirs before start time outcome
1 0 15 s ok
2 25 17 s ok
3 50 13 s ok
4 75 19 s ok
5 100 16 s ok
6 125 14 s ok
7 150 16 s ok
8 175 19 s ok
9 200 13 s ok
10 225 16 s ok

The hang reproduces reliably with a different mix: cycles where ant dev start is killed by its own manifest-wait timeout (i.e. devnet partially started but didn't finish stabilizing). Those leave behind (a) zombie [ant-devnet] <defunct> entries because the python parent reaped first, (b) orphan anvil processes, and (c) half-written node dirs. With 6–10 such interrupted cycles, subsequent ant dev start calls hang indefinitely until the LXC is restarted and ~/.local/share/ant/ is wiped.

How the rest of the ecosystem does this

Where Isolation Cleanup discipline
ant-node/tests/e2e/testnet.rs std::env::temp_dir().join("ant_test_{rand:x}") — unique per run RAII; the in-process tokio tasks die with the test process
ant-node/scripts/test_e2e.sh MANIFEST_FILE="/tmp/ant_e2e_manifest_${TEST_RUN_ID}.json" trap cleanup EXIT kills the devnet PID and lingering children
ant-client/ant-core/tests/support/mod.rs MiniTestnet tempfile::TempDir per node, held in a _temp_dirs: Vec<TempDir> field RAII via Drop; _testnet: Testnet field keeps anvil alive only for the testnet's lifetime
ant-devnet (this bin) Hardcoded ~/.local/share/ant/, shared across runs cleanup_data_dir: true only on graceful Devnet::shutdown(); std::mem::forget(testnet) leaks anvil

ant-devnet is the only one that holds state across runs, and the only one whose cleanup depends on Rust destructors running — which they don't, on SIGTERM.

Proposed fix

Two options, either of which addresses the hang class:

A. Make ant-devnet match its siblings (preferred)

In src/bin/ant-devnet/main.rs:

  1. Default the data dir to a tempfile::TempDir (or temp_dir().join("ant_devnet_{rand}")) when --manifest is given without an explicit --data-dir. The manifest already encodes the path, so consumers can discover it.
  2. Install a tokio signal handler that, on SIGINT/SIGTERM:
    • Calls devnet.shutdown().await to honour cleanup_data_dir.
    • Drops (rather than forgets) the AnvilInstance, killing the anvil child.
    • Then exits 0.
  3. Drop the std::mem::forget(testnet) and instead store the Testnet in a long-lived variable in main so its Drop runs on normal exit too.

Roughly a single-file change of ~40 lines. Mirrors MiniTestnet's lifetime discipline.

B. Fix at the ant-dev layer (smaller surface, less ideal)

In ant-dev/src/ant_dev/cmd_start.py and cmd_stop.py:

  • Generate a per-run manifest path (/tmp/ant_devnet_{pid}_{ts}.json or under ~/.ant-dev/runs/<id>/) and pass it through.
  • Have cmd_stop pkill -f anvil and rm -rf the run-specific data dir, mirroring test_e2e.sh's trap cleanup EXIT.
  • Have cmd_start register a Python atexit / signal handler that does the same so a Ctrl-C during the manifest wait doesn't leak.

This works around the symptom without changing ant-devnet. Worth doing anyway if (A) takes time.

Repro

# fresh state
rm -rf ~/.local/share/ant ~/.ant-dev/*

# induce a few "killed mid-startup" cycles — easiest way is to start, ctrl-c
# during the manifest wait, repeat 6–8 times. Each leaves a zombie + an orphan anvil.
for i in 1 2 3 4 5 6 7 8; do
  ( ant dev start --ant-node-dir ~/Projects/ant-node & ); sleep 5; kill -9 $!
done

# after this, even a clean `ant dev start` hangs in [1/3] until you either
# restart the host/LXC or `rm -rf ~/.local/share/ant`.

Environment

  • Ubuntu 24.04.4 LTS in an Incus LXC (Intel Meteor Lake host)
  • ant-sdk + ant-node at main (today's clone)
  • antd v0.6.1 / ant-devnet built from the cloned ant-node tree

Found while iterating on the cross-SDK e2e harness; ties together the earlier reports #64 (undocumented anvil prereq) and underlies several flake patterns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions