Skip to content

fix(ant-dev): clean up orphan anvil/antnode and stale node identities on stop#81

Open
Nic-dorman wants to merge 1 commit into
mainfrom
fix/ant-dev-stop-orphan-anvil
Open

fix(ant-dev): clean up orphan anvil/antnode and stale node identities on stop#81
Nic-dorman wants to merge 1 commit into
mainfrom
fix/ant-dev-stop-orphan-anvil

Conversation

@Nic-dorman
Copy link
Copy Markdown
Collaborator

Helps mitigate #73 (Option B path).

ant-devnet keeps anvil alive past Testnet::new's scope with std::mem::forget(testnet) and relies on graceful Drop at process exit to clean it up. SIGTERM/SIGKILL skip destructors, so every ant dev stop leaks one anvil child and one ~/.local/share/ant/nodes/<peer_id>/ directory per spawned node (25 dirs on the default preset). After a handful of start/stop cycles — and especially after kill-mid-startup events — the LXC accumulates orphan anvils plus 100+ stale node dirs, and subsequent ant dev start runs flake or hang.

This is the Option B workaround proposed in #73 (the band-aid at the ant-dev layer). The proper fix is Option A: change ant-devnet/main.rs to use tempfile::TempDir + a tokio signal handler, mirroring how ant-client's MiniTestnet and ant-node's tests/e2e/testnet.rs already do it. That lives in WithAutonomi/ant-node and will go up as a separate PR there.

Changes in ant dev stop

  • pkill -9 -f anvil and pkill -9 -f .../antnode in addition to the existing ant-devnet pkill
  • rm -rf ~/.local/share/ant/{nodes,spill} so the next ant dev start begins from a clean slate
  • Centralised the existing pkill call sites into a _pkill() helper for readability

No behaviour change on Windows — pkill and the data-dir cleanup are POSIX-only branches.

Test plan

  • Before: ant dev start followed by ant dev stop left an orphan anvil and 25 dirs in ~/.local/share/ant/nodes/. Reproducible every run.
  • After: same startstop leaves zero processes and an empty data dir:
    --- after stop: should be empty ---
    (none - clean)
    --- nodes dir gone? ---
    ls: cannot access '/home/nic/.local/share/ant/nodes': No such file or directory
    (no nodes dir - clean)
    
  • Full cross-SDK e2e harness still green after the change (no SDK breakage)

… on stop

ant-devnet keeps anvil alive past Testnet::new scope via std::mem::forget
on the AnvilInstance, then relies on graceful Drop at process exit to
clean it up. SIGTERM/SIGKILL skip destructors, so every ant dev stop
leaks one anvil child and one ~/.local/share/ant/nodes/<peer_id>/ tree
for each of the spawned nodes. After a handful of start/stop or
killed-mid-startup cycles, the LXC accumulates orphan anvils plus 100+
stale node dirs, and subsequent ant dev start runs flake or hang.

This is a workaround at the ant-dev layer (Option B in #73). The proper
fix lives in ant-devnet itself (Option A: tempfile::TempDir + tokio
signal handler, mirroring how ant-clients MiniTestnet and ant-nodes
tests/e2e/testnet.rs already do it) and will be a separate PR against
WithAutonomi/ant-node.

In ant dev stop now:
- pkill anvil and antnode in addition to ant-devnet
- rm -rf ~/.local/share/ant/nodes and ~/.local/share/ant/spill so the
  next start begins from a clean state
- Centralise the pkill calls into a _pkill() helper

No behaviour change on Windows (the pkill / rm paths are POSIX-only).

Closes #16 (local task); helps mitigate #73 (upstream).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant