Save point triggering during ctrl+C or allocation wall limit and restarting from save point

# Written via Claude Code
# Falcon 0.4.1 — Checkpoint & Resume Enhancement Issues

Three related but independently-fileable GitHub issues for falcon 0.4.1.
They share a common need (durable training state) but solve different
failure modes:

- **Issue 1** — save when the user interrupts (Ctrl+C, kill).
- **Issue 2** — save before a scheduled job runs out of wall-clock time.
- **Issue 3** — restart a previously-saved run from where it stopped.

Issue 3 is the consumer of artefacts produced by Issues 1 & 2, but they
can be implemented and merged in any order: a partial checkpoint with no
resume path is still useful for post-mortem analysis, and resume support
is still useful for runs that finished cleanly.

---

## Issue 1 — Persist estimator state on graceful interrupt (SIGINT / SIGTERM)

### Summary

When a user presses Ctrl+C, or the OS sends SIGTERM (e.g. job
preemption, `kill <pid>`), the in-memory estimator state — network
weights, optimizer/scheduler state, EMA buffers, training history,
W&B run ID — is lost. The current shutdown path drains in-flight
training tasks but does not invoke `estimator.save()` before the
process exits, so partially-trained networks cannot be recovered.

### Current behaviour (0.4.1)

- `_GracefulShutdown` in `falcon/cli.py:263-294` catches SIGINT and
  sets a flag; second Ctrl+C raises `KeyboardInterrupt` immediately.
- `DeployedGraph._launch` in `falcon/core/deployed_graph.py:795-806`
  propagates the flag to each `NodeWrapper` via
  `node.request_stop.remote()`, which causes the training loop to
  break after the current epoch.
- `BaseEstimator.save()` (`falcon/core/base_estimator.py:82-90`) is
  only called from the post-`launch()` cleanup path. If shutdown
  happens via second Ctrl+C → `KeyboardInterrupt`, that path is
  skipped and no `.pth` files are written for that node.
- SIGTERM has no handler in `_GracefulShutdown` (only the interactive
  TUI in `falcon/interactive.py:80-81` registers it), so SLURM
  preemption or `scancel --signal=TERM` results in an immediate kill
  with zero state on disk.

### Proposed behaviour

1. Register `SIGTERM` alongside `SIGINT` in `_GracefulShutdown`
   (mirroring `interactive.py`), so non-interactive jobs respond to
   scheduler signals.
2. On *first* signal, after `request_stop.remote()`, wait for the
   current epoch to finish, then for each node with
   `networks_initialized == True` persist the **model file and
   network weights** plus auxiliary training state. Concretely:

   **Model file & network weights** — write the in-flight
   parameters (not just `_best_*`, which is what `save()` does
   today) so a Ctrl+C two epochs after the best validation loss
   doesn't lose those two epochs of progress:
   - Gaussian / `LossBasedEstimator`: `model.pth` (state dict of
     `self._model`, the `GaussianPosterior` + embedding), plus
     `model_best.pth` if a best snapshot exists, plus
     `init_parameters.pth` (constructor args for re-instantiation
     without the original config).
   - Flow: `conditional_flow.pth`, `marginal_flow.pth`,
     `embedding.pth`, `init_parameters.pth` — both *current* and
     `*_best.pth` variants.
   - Save format: `torch.save(state_dict, path)` (matches today's
     `save()`); also save the `nn.Module` class path inside
     `checkpoint.json` so `load()` can re-instantiate without
     needing the YAML.

   **Sidecar `checkpoint.json`** (per node) for resuming training:
   - epoch number, batch index, elapsed wall-time
   - optimizer state (`AdamW.state_dict()`)
   - scheduler state (`ReduceLROnPlateau.state_dict()`)
   - RNG state (`torch.get_rng_state()`, `torch.cuda.get_rng_state()`,
     NumPy, Python `random`)
   - EMA buffers and running stats inside `GaussianPosterior`
     (these are already nn.Module buffers, so they ride along in
     `model.pth` — call this out so it isn't missed)
   - dataset manager IDs of last-seen training / validation batches
   - W&B run ID (so Issue 3 can resume the same run)
   - falcon version, so Issue 3 can refuse a version mismatch
3. On *second* signal within ~5 s, fall back to today's behaviour
   (force exit, no save) so a hung node cannot block shutdown.
4. Surface the save in console output:
   `"⚠ Interrupted at epoch 47/300 — model + weights saved to {graph_dir}/z/"`

### Acceptance criteria

- Hitting Ctrl+C mid-training produces, for every node:
  - the network-weight files for that estimator
    (`model.pth` for Gaussian; `conditional_flow.pth`,
    `marginal_flow.pth`, `embedding.pth` for Flow)
  - `init_parameters.pth`
  - `checkpoint.json` with the auxiliary state listed above
- `kill -TERM <pid>` produces the same artefacts.
- Second Ctrl+C still hard-exits within ~5 s.
- `torch.load(model.pth)` followed by `estimator.load(node_dir)`
  produces an estimator whose `sample()` output matches the
  in-process estimator at interrupt time (bit-exact under fixed
  RNG seed).
- A unit/integration test launches `01_minimal`, sends SIGTERM
  during training, and asserts all files above exist and load
  cleanly.

### Out of scope

- Resuming the run — that's Issue 3.
- Periodic background snapshots — that's the next epoch-boundary
  save in Issue 2.

---

## Issue 2 — Wall-clock-budget-aware checkpointing for scheduled jobs

### Summary

On HPC schedulers (SLURM, PBS, LSF), a job is killed when its
allocation expires. Falcon currently has no awareness of the
remaining time budget, so a 24-hour SLURM job that's still training
at hour 23:59 will be SIGKILL'd with no warning and no checkpoint.
Issue 1 handles signals *when delivered*, but most schedulers
deliver the warning signal only seconds before the hard kill, which
is not enough time to drain a multi-minute epoch.

### Current behaviour (0.4.1)

- No reference to `SLURM_JOB_END_TIME`, `PBS_WALLTIME`, or any
  scheduler env var anywhere in the codebase.
- No periodic checkpoint outside the abstract `save()` contract.
- Even with Issue 1 implemented, a 30-second SIGTERM-to-SIGKILL
  grace period (SLURM default) is far shorter than a typical epoch
  for the Gaussian or Flow estimators (multi-second to multi-minute).

### Proposed behaviour

1. Add a `runtime` block to the YAML config:
   ```yaml
   runtime:
     wall_clock_budget: auto          # "auto" reads SLURM_JOB_END_TIME etc.
                                       # or an explicit duration like "23h30m"
     checkpoint_before_end: 5m         # save+exit when this much time remains
     periodic_checkpoint_every: 30m    # optional: snapshot mid-run
   ```
2. In `DeployedGraph._launch`'s outer loop
   (`falcon/core/deployed_graph.py:796`), check `time.monotonic()`
   against the deadline alongside the existing `stop_check()`.
   When `(deadline - now) <= checkpoint_before_end`, trigger the
   same path as Issue 1 — request stop, drain epoch, save, exit.
3. For `wall_clock_budget: auto`, read in priority order:
   - `SLURM_JOB_END_TIME` (epoch seconds)
   - `PBS_WALLTIME` + job start time
   - `LSB_JOB_PEND_TIME`
   Falls back to `null` (feature disabled) if none are set.
4. Periodic checkpoints are written at epoch boundaries (not
   mid-epoch) to keep state consistent. They overwrite the same
   `checkpoint.json` / `*.pth` files used by Issue 1 — there is
   one checkpoint per node, not a chain.
5. Emit a single console line at startup:
   `"⏱ Wall-clock budget: 23h59m (from SLURM_JOB_END_TIME) — will checkpoint at 23h54m"`

### Acceptance criteria

- Submitting `falcon launch` inside a 5-minute SLURM allocation
  with `checkpoint_before_end: 1m` produces a checkpoint at
  ~T+4m and exits cleanly before SLURM kills it.
- An integration test fakes `SLURM_JOB_END_TIME` to be 90 seconds
  in the future, sets `checkpoint_before_end: 30s`, and asserts
  the checkpoint is written within 60–65 s of launch.
- `periodic_checkpoint_every` produces fresh artefacts on a fixed
  cadence without ending training.

### Out of scope

- Multi-checkpoint history / rotation (single overwriting snapshot
  is enough for this iteration).
- Resuming the run — that's Issue 3.
- Schedulers other than SLURM / PBS / LSF (env-var hook is the
  extension point; users on Kubernetes/etc. can set
  `wall_clock_budget` explicitly).

---

## Issue 3 — `falcon launch --resume` to continue from a saved checkpoint

### Summary

Even though `BaseEstimator.load()` exists
(`falcon/core/base_estimator.py:92-100`) and every estimator
implements it, there is no user-facing way to *use* a saved
checkpoint to continue training. `falcon launch -o outputs/run` on
an existing directory either errors or starts from scratch — there
is no flag that says "pick up where the last run left off."

This issue depends on Issues 1 and 2 for the checkpoint contents
(epoch counter, optimizer/scheduler state, RNG, W&B run ID) but is
useful on its own: even runs that finish cleanly today could be
"continued" with more epochs if `--resume` were available.

### Current behaviour (0.4.1)

- `load()` is called nowhere in `falcon/cli.py` or
  `falcon/core/deployed_graph.py` (verified by `grep`).
- Re-running into an existing `graph_dir` re-initializes networks
  from scratch via `_initialize_networks()` in each estimator.
- The simulation buffer in `sim_dir/` *is* reused if present, so
  data is preserved across restarts — but the *trained networks*
  are not.

### Proposed behaviour

1. Add `--resume` flag to `falcon launch`:
   - `falcon launch -o outputs/run --resume`
     resumes the run in `outputs/run` from its checkpoint.
   - `falcon launch -o outputs/run --resume-from outputs/prior_run`
     copies a checkpoint from another directory and continues.
2. On resume:
   - Refuse to start if config in the saved run differs from the
     current config in a way that affects the network shape
     (`network.*`, `priors`, `embedding`); print a diff and exit.
   - Allow safe overrides like `loop.num_epochs` (training
     proceeds until the *new* target) and `inference.*` (only
     affects post-train sampling).
   - For each node with a checkpoint, call `estimator.load()`
     before training begins, restoring weights, optimizer state,
     scheduler state, EMA buffers, history, and RNG.
   - Resume the W&B run via the saved `run_id` so metrics land
     on the same W&B chart instead of starting a new one.
3. Print a clear status:
   `"↻ Resuming z from epoch 47/300 (last saved 2026-05-26T14:03)"`
4. If a node has no checkpoint (e.g. it was added to the graph
   *after* the prior run), train it from scratch and warn.

### Acceptance criteria

- Launch `01_minimal`, interrupt at epoch 50, run
  `falcon launch -o <same dir> --resume`, verify training
  continues at epoch 51 (not 1) and the W&B run is the same one.
- Final posterior samples from interrupted+resumed run match
  samples from an uninterrupted run within tolerance
  (sanity-check, not bit-exact).
- Config-mismatch refusal works: changing `network.hidden_dim`
  and resuming exits with a clear "incompatible config" error
  pointing at the changed key.
- Adding a new node to the graph and resuming logs a "training
  from scratch: {node}" warning and proceeds.

### Out of scope

- Branching / forking checkpoints (one linear resume chain only).
- Resuming partial epochs (resumption is always at an epoch
  boundary, matching the checkpoint cadence).
- Cross-version resume (require the same falcon version that
  wrote the checkpoint; record it in `checkpoint.json` and
  refuse mismatches).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save point triggering during ctrl+C or allocation wall limit and restarting from save point #75

Written via Claude Code

Falcon 0.4.1 — Checkpoint & Resume Enhancement Issues

Issue 1 — Persist estimator state on graceful interrupt (SIGINT / SIGTERM)

Summary

Current behaviour (0.4.1)

Proposed behaviour

Acceptance criteria

Out of scope

Issue 2 — Wall-clock-budget-aware checkpointing for scheduled jobs

Summary

Current behaviour (0.4.1)

Proposed behaviour

Acceptance criteria

Out of scope

Issue 3 — `falcon launch --resume` to continue from a saved checkpoint

Summary

Current behaviour (0.4.1)

Proposed behaviour

Acceptance criteria

Out of scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Save point triggering during ctrl+C or allocation wall limit and restarting from save point #75

Description

Written via Claude Code

Falcon 0.4.1 — Checkpoint & Resume Enhancement Issues

Issue 1 — Persist estimator state on graceful interrupt (SIGINT / SIGTERM)

Summary

Current behaviour (0.4.1)

Proposed behaviour

Acceptance criteria

Out of scope

Issue 2 — Wall-clock-budget-aware checkpointing for scheduled jobs

Summary

Current behaviour (0.4.1)

Proposed behaviour

Acceptance criteria

Out of scope

Issue 3 — falcon launch --resume to continue from a saved checkpoint

Summary

Current behaviour (0.4.1)

Proposed behaviour

Acceptance criteria

Out of scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Issue 3 — `falcon launch --resume` to continue from a saved checkpoint