ant-devnet lifecycle/cleanup diverges from ant-node's own testnet patterns; ant dev start hangs after kill-mid-spawn cycles

## Problem

`ant dev start` becomes unable to bring up a working devnet after a few failed/interrupted starts on the same machine, even with a full LXC/container restart in between. Symptom: `[1/3] Starting ant devnet…` → `Timed out waiting for devnet manifest` repeatedly, with `ant-devnet` apparently making no progress. The fix that always works is `rm -rf ~/.local/share/ant`.

## Root cause

`src/bin/ant-devnet/main.rs` (in [WithAutonomi/ant-node](https://github.com/WithAutonomi/ant-node)) has two cleanup gaps that compound:

1. **Hardcoded persistent data dir** — `default_root_dir()` returns `~/.local/share/ant/` on Linux. Every run writes ~25 node identities under `nodes/` and (unless `Devnet::shutdown()` is called) leaves them on disk. They have to be removed manually before the next run can be trusted.
2. **`std::mem::forget(testnet)` leaks the `AnvilInstance`** to keep anvil running past `Testnet::new`'s scope. There is no `tokio::signal` handler, so when SIGTERM lands (which is exactly what `ant dev stop`'s `_kill_pid` does, and what `cmd_start.py` does when its 6-min wait elapses) **neither the anvil child nor the data-dir cleanup run**. anvil orphans; the directory grows.

After a few interrupted cycles you have orphan anvils sitting on random ports plus partial node identities, and the next `ant-devnet --preset default` can't stabilize 25 nodes inside the wait window. From the outside this looks like a hard hang.

## Bisect (so the failure mode is reproducible)

10 successful `ant dev start` / `ant dev stop` cycles, no manual cleanup between them, with `pkill -9 anvil` between cycles to remove the orphan-anvil variable. **Stale node dirs alone aren't the trigger** — accumulated dirs from *successful* cycles up to 225 didn't slow startup beyond noise:

| iter | stale node dirs before | start time | outcome |
|---:|---:|---:|---|
| 1 |   0 | 15 s | ok |
| 2 |  25 | 17 s | ok |
| 3 |  50 | 13 s | ok |
| 4 |  75 | 19 s | ok |
| 5 | 100 | 16 s | ok |
| 6 | 125 | 14 s | ok |
| 7 | 150 | 16 s | ok |
| 8 | 175 | 19 s | ok |
| 9 | 200 | 13 s | ok |
| 10 | 225 | 16 s | ok |

The hang reproduces reliably with **a different mix**: cycles where `ant dev start` is killed by its own manifest-wait timeout (i.e. devnet partially started but didn't finish stabilizing). Those leave behind (a) zombie `[ant-devnet] <defunct>` entries because the python parent reaped first, (b) orphan `anvil` processes, and (c) half-written node dirs. With 6–10 such interrupted cycles, subsequent `ant dev start` calls hang indefinitely until the LXC is restarted **and** `~/.local/share/ant/` is wiped.

## How the rest of the ecosystem does this

| Where | Isolation | Cleanup discipline |
|---|---|---|
| `ant-node/tests/e2e/testnet.rs` | `std::env::temp_dir().join("ant_test_{rand:x}")` — unique per run | RAII; the in-process tokio tasks die with the test process |
| `ant-node/scripts/test_e2e.sh` | `MANIFEST_FILE="/tmp/ant_e2e_manifest_${TEST_RUN_ID}.json"` | `trap cleanup EXIT` kills the devnet PID **and** lingering children |
| `ant-client/ant-core/tests/support/mod.rs` `MiniTestnet` | `tempfile::TempDir` per node, held in a `_temp_dirs: Vec<TempDir>` field | RAII via `Drop`; `_testnet: Testnet` field keeps anvil alive *only* for the testnet's lifetime |
| **`ant-devnet`** (this bin) | **Hardcoded `~/.local/share/ant/`, shared across runs** | **`cleanup_data_dir: true` only on graceful `Devnet::shutdown()`; `std::mem::forget(testnet)` leaks anvil** |

`ant-devnet` is the only one that holds state across runs, and the only one whose cleanup depends on Rust destructors running — which they don't, on SIGTERM.

## Proposed fix

Two options, either of which addresses the hang class:

### A. Make `ant-devnet` match its siblings (preferred)

In `src/bin/ant-devnet/main.rs`:

1. Default the data dir to a `tempfile::TempDir` (or `temp_dir().join("ant_devnet_{rand}")`) when `--manifest` is given without an explicit `--data-dir`. The manifest already encodes the path, so consumers can discover it.
2. Install a tokio signal handler that, on SIGINT/SIGTERM:
   - Calls `devnet.shutdown().await` to honour `cleanup_data_dir`.
   - Drops (rather than `forget`s) the `AnvilInstance`, killing the anvil child.
   - Then exits 0.
3. Drop the `std::mem::forget(testnet)` and instead store the `Testnet` in a long-lived variable in `main` so its `Drop` runs on normal exit too.

Roughly a single-file change of ~40 lines. Mirrors `MiniTestnet`'s lifetime discipline.

### B. Fix at the `ant-dev` layer (smaller surface, less ideal)

In `ant-dev/src/ant_dev/cmd_start.py` and `cmd_stop.py`:

- Generate a per-run manifest path (`/tmp/ant_devnet_{pid}_{ts}.json` or under `~/.ant-dev/runs/<id>/`) and pass it through.
- Have `cmd_stop` `pkill -f anvil` and `rm -rf` the run-specific data dir, mirroring `test_e2e.sh`'s `trap cleanup EXIT`.
- Have `cmd_start` register a Python `atexit` / signal handler that does the same so a Ctrl-C during the manifest wait doesn't leak.

This works around the symptom without changing `ant-devnet`. Worth doing anyway if (A) takes time.

## Repro

```bash
# fresh state
rm -rf ~/.local/share/ant ~/.ant-dev/*

# induce a few "killed mid-startup" cycles — easiest way is to start, ctrl-c
# during the manifest wait, repeat 6–8 times. Each leaves a zombie + an orphan anvil.
for i in 1 2 3 4 5 6 7 8; do
  ( ant dev start --ant-node-dir ~/Projects/ant-node & ); sleep 5; kill -9 $!
done

# after this, even a clean `ant dev start` hangs in [1/3] until you either
# restart the host/LXC or `rm -rf ~/.local/share/ant`.
```

## Environment

- Ubuntu 24.04.4 LTS in an Incus LXC (Intel Meteor Lake host)
- ant-sdk + ant-node at `main` (today's clone)
- antd v0.6.1 / ant-devnet built from the cloned ant-node tree

---
Found while iterating on the cross-SDK e2e harness; ties together the earlier reports #64 (undocumented anvil prereq) and underlies several flake patterns.

Where	Isolation	Cleanup discipline
`ant-node/tests/e2e/testnet.rs`	`std::env::temp_dir().join("ant_test_{rand:x}")` — unique per run	RAII; the in-process tokio tasks die with the test process
`ant-node/scripts/test_e2e.sh`	`MANIFEST_FILE="/tmp/ant_e2e_manifest_${TEST_RUN_ID}.json"`	`trap cleanup EXIT` kills the devnet PID and lingering children
`ant-client/ant-core/tests/support/mod.rs` `MiniTestnet`	`tempfile::TempDir` per node, held in a `_temp_dirs: Vec<TempDir>` field	RAII via `Drop`; `_testnet: Testnet` field keeps anvil alive only for the testnet's lifetime
`ant-devnet` (this bin)	Hardcoded `~/.local/share/ant/`, shared across runs	`cleanup_data_dir: true` only on graceful `Devnet::shutdown()`; `std::mem::forget(testnet)` leaks anvil

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ant-devnet lifecycle/cleanup diverges from ant-node's own testnet patterns; ant dev start hangs after kill-mid-spawn cycles #73

Problem

Root cause

Bisect (so the failure mode is reproducible)

How the rest of the ecosystem does this

Proposed fix

A. Make `ant-devnet` match its siblings (preferred)

B. Fix at the `ant-dev` layer (smaller surface, less ideal)

Repro

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

iter	stale node dirs before	start time	outcome
1	0	15 s	ok
2	25	17 s	ok
3	50	13 s	ok
4	75	19 s	ok
5	100	16 s	ok
6	125	14 s	ok
7	150	16 s	ok
8	175	19 s	ok
9	200	13 s	ok
10	225	16 s	ok

ant-devnet lifecycle/cleanup diverges from ant-node's own testnet patterns; ant dev start hangs after kill-mid-spawn cycles #73

Description

Problem

Root cause

Bisect (so the failure mode is reproducible)

How the rest of the ecosystem does this

Proposed fix

A. Make ant-devnet match its siblings (preferred)

B. Fix at the ant-dev layer (smaller surface, less ideal)

Repro

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

A. Make `ant-devnet` match its siblings (preferred)

B. Fix at the `ant-dev` layer (smaller surface, less ideal)