Skip to content

Save point triggering during ctrl+C or allocation wall limit and restarting from save point #75

@JoySanghavi

Description

@JoySanghavi

Written via Claude Code

Falcon 0.4.1 — Checkpoint & Resume Enhancement Issues

Three related but independently-fileable GitHub issues for falcon 0.4.1.
They share a common need (durable training state) but solve different
failure modes:

  • Issue 1 — save when the user interrupts (Ctrl+C, kill).
  • Issue 2 — save before a scheduled job runs out of wall-clock time.
  • Issue 3 — restart a previously-saved run from where it stopped.

Issue 3 is the consumer of artefacts produced by Issues 1 & 2, but they
can be implemented and merged in any order: a partial checkpoint with no
resume path is still useful for post-mortem analysis, and resume support
is still useful for runs that finished cleanly.


Issue 1 — Persist estimator state on graceful interrupt (SIGINT / SIGTERM)

Summary

When a user presses Ctrl+C, or the OS sends SIGTERM (e.g. job
preemption, kill <pid>), the in-memory estimator state — network
weights, optimizer/scheduler state, EMA buffers, training history,
W&B run ID — is lost. The current shutdown path drains in-flight
training tasks but does not invoke estimator.save() before the
process exits, so partially-trained networks cannot be recovered.

Current behaviour (0.4.1)

  • _GracefulShutdown in falcon/cli.py:263-294 catches SIGINT and
    sets a flag; second Ctrl+C raises KeyboardInterrupt immediately.
  • DeployedGraph._launch in falcon/core/deployed_graph.py:795-806
    propagates the flag to each NodeWrapper via
    node.request_stop.remote(), which causes the training loop to
    break after the current epoch.
  • BaseEstimator.save() (falcon/core/base_estimator.py:82-90) is
    only called from the post-launch() cleanup path. If shutdown
    happens via second Ctrl+C → KeyboardInterrupt, that path is
    skipped and no .pth files are written for that node.
  • SIGTERM has no handler in _GracefulShutdown (only the interactive
    TUI in falcon/interactive.py:80-81 registers it), so SLURM
    preemption or scancel --signal=TERM results in an immediate kill
    with zero state on disk.

Proposed behaviour

  1. Register SIGTERM alongside SIGINT in _GracefulShutdown
    (mirroring interactive.py), so non-interactive jobs respond to
    scheduler signals.

  2. On first signal, after request_stop.remote(), wait for the
    current epoch to finish, then for each node with
    networks_initialized == True persist the model file and
    network weights
    plus auxiliary training state. Concretely:

    Model file & network weights — write the in-flight
    parameters (not just _best_*, which is what save() does
    today) so a Ctrl+C two epochs after the best validation loss
    doesn't lose those two epochs of progress:

    • Gaussian / LossBasedEstimator: model.pth (state dict of
      self._model, the GaussianPosterior + embedding), plus
      model_best.pth if a best snapshot exists, plus
      init_parameters.pth (constructor args for re-instantiation
      without the original config).
    • Flow: conditional_flow.pth, marginal_flow.pth,
      embedding.pth, init_parameters.pth — both current and
      *_best.pth variants.
    • Save format: torch.save(state_dict, path) (matches today's
      save()); also save the nn.Module class path inside
      checkpoint.json so load() can re-instantiate without
      needing the YAML.

    Sidecar checkpoint.json (per node) for resuming training:

    • epoch number, batch index, elapsed wall-time
    • optimizer state (AdamW.state_dict())
    • scheduler state (ReduceLROnPlateau.state_dict())
    • RNG state (torch.get_rng_state(), torch.cuda.get_rng_state(),
      NumPy, Python random)
    • EMA buffers and running stats inside GaussianPosterior
      (these are already nn.Module buffers, so they ride along in
      model.pth — call this out so it isn't missed)
    • dataset manager IDs of last-seen training / validation batches
    • W&B run ID (so Issue 3 can resume the same run)
    • falcon version, so Issue 3 can refuse a version mismatch
  3. On second signal within ~5 s, fall back to today's behaviour
    (force exit, no save) so a hung node cannot block shutdown.

  4. Surface the save in console output:
    "⚠ Interrupted at epoch 47/300 — model + weights saved to {graph_dir}/z/"

Acceptance criteria

  • Hitting Ctrl+C mid-training produces, for every node:
    • the network-weight files for that estimator
      (model.pth for Gaussian; conditional_flow.pth,
      marginal_flow.pth, embedding.pth for Flow)
    • init_parameters.pth
    • checkpoint.json with the auxiliary state listed above
  • kill -TERM <pid> produces the same artefacts.
  • Second Ctrl+C still hard-exits within ~5 s.
  • torch.load(model.pth) followed by estimator.load(node_dir)
    produces an estimator whose sample() output matches the
    in-process estimator at interrupt time (bit-exact under fixed
    RNG seed).
  • A unit/integration test launches 01_minimal, sends SIGTERM
    during training, and asserts all files above exist and load
    cleanly.

Out of scope

  • Resuming the run — that's Issue 3.
  • Periodic background snapshots — that's the next epoch-boundary
    save in Issue 2.

Issue 2 — Wall-clock-budget-aware checkpointing for scheduled jobs

Summary

On HPC schedulers (SLURM, PBS, LSF), a job is killed when its
allocation expires. Falcon currently has no awareness of the
remaining time budget, so a 24-hour SLURM job that's still training
at hour 23:59 will be SIGKILL'd with no warning and no checkpoint.
Issue 1 handles signals when delivered, but most schedulers
deliver the warning signal only seconds before the hard kill, which
is not enough time to drain a multi-minute epoch.

Current behaviour (0.4.1)

  • No reference to SLURM_JOB_END_TIME, PBS_WALLTIME, or any
    scheduler env var anywhere in the codebase.
  • No periodic checkpoint outside the abstract save() contract.
  • Even with Issue 1 implemented, a 30-second SIGTERM-to-SIGKILL
    grace period (SLURM default) is far shorter than a typical epoch
    for the Gaussian or Flow estimators (multi-second to multi-minute).

Proposed behaviour

  1. Add a runtime block to the YAML config:
    runtime:
      wall_clock_budget: auto          # "auto" reads SLURM_JOB_END_TIME etc.
                                        # or an explicit duration like "23h30m"
      checkpoint_before_end: 5m         # save+exit when this much time remains
      periodic_checkpoint_every: 30m    # optional: snapshot mid-run
  2. In DeployedGraph._launch's outer loop
    (falcon/core/deployed_graph.py:796), check time.monotonic()
    against the deadline alongside the existing stop_check().
    When (deadline - now) <= checkpoint_before_end, trigger the
    same path as Issue 1 — request stop, drain epoch, save, exit.
  3. For wall_clock_budget: auto, read in priority order:
    • SLURM_JOB_END_TIME (epoch seconds)
    • PBS_WALLTIME + job start time
    • LSB_JOB_PEND_TIME
      Falls back to null (feature disabled) if none are set.
  4. Periodic checkpoints are written at epoch boundaries (not
    mid-epoch) to keep state consistent. They overwrite the same
    checkpoint.json / *.pth files used by Issue 1 — there is
    one checkpoint per node, not a chain.
  5. Emit a single console line at startup:
    "⏱ Wall-clock budget: 23h59m (from SLURM_JOB_END_TIME) — will checkpoint at 23h54m"

Acceptance criteria

  • Submitting falcon launch inside a 5-minute SLURM allocation
    with checkpoint_before_end: 1m produces a checkpoint at
    ~T+4m and exits cleanly before SLURM kills it.
  • An integration test fakes SLURM_JOB_END_TIME to be 90 seconds
    in the future, sets checkpoint_before_end: 30s, and asserts
    the checkpoint is written within 60–65 s of launch.
  • periodic_checkpoint_every produces fresh artefacts on a fixed
    cadence without ending training.

Out of scope

  • Multi-checkpoint history / rotation (single overwriting snapshot
    is enough for this iteration).
  • Resuming the run — that's Issue 3.
  • Schedulers other than SLURM / PBS / LSF (env-var hook is the
    extension point; users on Kubernetes/etc. can set
    wall_clock_budget explicitly).

Issue 3 — falcon launch --resume to continue from a saved checkpoint

Summary

Even though BaseEstimator.load() exists
(falcon/core/base_estimator.py:92-100) and every estimator
implements it, there is no user-facing way to use a saved
checkpoint to continue training. falcon launch -o outputs/run on
an existing directory either errors or starts from scratch — there
is no flag that says "pick up where the last run left off."

This issue depends on Issues 1 and 2 for the checkpoint contents
(epoch counter, optimizer/scheduler state, RNG, W&B run ID) but is
useful on its own: even runs that finish cleanly today could be
"continued" with more epochs if --resume were available.

Current behaviour (0.4.1)

  • load() is called nowhere in falcon/cli.py or
    falcon/core/deployed_graph.py (verified by grep).
  • Re-running into an existing graph_dir re-initializes networks
    from scratch via _initialize_networks() in each estimator.
  • The simulation buffer in sim_dir/ is reused if present, so
    data is preserved across restarts — but the trained networks
    are not.

Proposed behaviour

  1. Add --resume flag to falcon launch:
    • falcon launch -o outputs/run --resume
      resumes the run in outputs/run from its checkpoint.
    • falcon launch -o outputs/run --resume-from outputs/prior_run
      copies a checkpoint from another directory and continues.
  2. On resume:
    • Refuse to start if config in the saved run differs from the
      current config in a way that affects the network shape
      (network.*, priors, embedding); print a diff and exit.
    • Allow safe overrides like loop.num_epochs (training
      proceeds until the new target) and inference.* (only
      affects post-train sampling).
    • For each node with a checkpoint, call estimator.load()
      before training begins, restoring weights, optimizer state,
      scheduler state, EMA buffers, history, and RNG.
    • Resume the W&B run via the saved run_id so metrics land
      on the same W&B chart instead of starting a new one.
  3. Print a clear status:
    "↻ Resuming z from epoch 47/300 (last saved 2026-05-26T14:03)"
  4. If a node has no checkpoint (e.g. it was added to the graph
    after the prior run), train it from scratch and warn.

Acceptance criteria

  • Launch 01_minimal, interrupt at epoch 50, run
    falcon launch -o <same dir> --resume, verify training
    continues at epoch 51 (not 1) and the W&B run is the same one.
  • Final posterior samples from interrupted+resumed run match
    samples from an uninterrupted run within tolerance
    (sanity-check, not bit-exact).
  • Config-mismatch refusal works: changing network.hidden_dim
    and resuming exits with a clear "incompatible config" error
    pointing at the changed key.
  • Adding a new node to the graph and resuming logs a "training
    from scratch: {node}" warning and proceeds.

Out of scope

  • Branching / forking checkpoints (one linear resume chain only).
  • Resuming partial epochs (resumption is always at an epoch
    boundary, matching the checkpoint cadence).
  • Cross-version resume (require the same falcon version that
    wrote the checkpoint; record it in checkpoint.json and
    refuse mismatches).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions