Written via Claude Code
Falcon 0.4.1 — Checkpoint & Resume Enhancement Issues
Three related but independently-fileable GitHub issues for falcon 0.4.1.
They share a common need (durable training state) but solve different
failure modes:
- Issue 1 — save when the user interrupts (Ctrl+C, kill).
- Issue 2 — save before a scheduled job runs out of wall-clock time.
- Issue 3 — restart a previously-saved run from where it stopped.
Issue 3 is the consumer of artefacts produced by Issues 1 & 2, but they
can be implemented and merged in any order: a partial checkpoint with no
resume path is still useful for post-mortem analysis, and resume support
is still useful for runs that finished cleanly.
Issue 1 — Persist estimator state on graceful interrupt (SIGINT / SIGTERM)
Summary
When a user presses Ctrl+C, or the OS sends SIGTERM (e.g. job
preemption, kill <pid>), the in-memory estimator state — network
weights, optimizer/scheduler state, EMA buffers, training history,
W&B run ID — is lost. The current shutdown path drains in-flight
training tasks but does not invoke estimator.save() before the
process exits, so partially-trained networks cannot be recovered.
Current behaviour (0.4.1)
_GracefulShutdown in falcon/cli.py:263-294 catches SIGINT and
sets a flag; second Ctrl+C raises KeyboardInterrupt immediately.
DeployedGraph._launch in falcon/core/deployed_graph.py:795-806
propagates the flag to each NodeWrapper via
node.request_stop.remote(), which causes the training loop to
break after the current epoch.
BaseEstimator.save() (falcon/core/base_estimator.py:82-90) is
only called from the post-launch() cleanup path. If shutdown
happens via second Ctrl+C → KeyboardInterrupt, that path is
skipped and no .pth files are written for that node.
- SIGTERM has no handler in
_GracefulShutdown (only the interactive
TUI in falcon/interactive.py:80-81 registers it), so SLURM
preemption or scancel --signal=TERM results in an immediate kill
with zero state on disk.
Proposed behaviour
-
Register SIGTERM alongside SIGINT in _GracefulShutdown
(mirroring interactive.py), so non-interactive jobs respond to
scheduler signals.
-
On first signal, after request_stop.remote(), wait for the
current epoch to finish, then for each node with
networks_initialized == True persist the model file and
network weights plus auxiliary training state. Concretely:
Model file & network weights — write the in-flight
parameters (not just _best_*, which is what save() does
today) so a Ctrl+C two epochs after the best validation loss
doesn't lose those two epochs of progress:
- Gaussian /
LossBasedEstimator: model.pth (state dict of
self._model, the GaussianPosterior + embedding), plus
model_best.pth if a best snapshot exists, plus
init_parameters.pth (constructor args for re-instantiation
without the original config).
- Flow:
conditional_flow.pth, marginal_flow.pth,
embedding.pth, init_parameters.pth — both current and
*_best.pth variants.
- Save format:
torch.save(state_dict, path) (matches today's
save()); also save the nn.Module class path inside
checkpoint.json so load() can re-instantiate without
needing the YAML.
Sidecar checkpoint.json (per node) for resuming training:
- epoch number, batch index, elapsed wall-time
- optimizer state (
AdamW.state_dict())
- scheduler state (
ReduceLROnPlateau.state_dict())
- RNG state (
torch.get_rng_state(), torch.cuda.get_rng_state(),
NumPy, Python random)
- EMA buffers and running stats inside
GaussianPosterior
(these are already nn.Module buffers, so they ride along in
model.pth — call this out so it isn't missed)
- dataset manager IDs of last-seen training / validation batches
- W&B run ID (so Issue 3 can resume the same run)
- falcon version, so Issue 3 can refuse a version mismatch
-
On second signal within ~5 s, fall back to today's behaviour
(force exit, no save) so a hung node cannot block shutdown.
-
Surface the save in console output:
"⚠ Interrupted at epoch 47/300 — model + weights saved to {graph_dir}/z/"
Acceptance criteria
- Hitting Ctrl+C mid-training produces, for every node:
- the network-weight files for that estimator
(model.pth for Gaussian; conditional_flow.pth,
marginal_flow.pth, embedding.pth for Flow)
init_parameters.pth
checkpoint.json with the auxiliary state listed above
kill -TERM <pid> produces the same artefacts.
- Second Ctrl+C still hard-exits within ~5 s.
torch.load(model.pth) followed by estimator.load(node_dir)
produces an estimator whose sample() output matches the
in-process estimator at interrupt time (bit-exact under fixed
RNG seed).
- A unit/integration test launches
01_minimal, sends SIGTERM
during training, and asserts all files above exist and load
cleanly.
Out of scope
- Resuming the run — that's Issue 3.
- Periodic background snapshots — that's the next epoch-boundary
save in Issue 2.
Issue 2 — Wall-clock-budget-aware checkpointing for scheduled jobs
Summary
On HPC schedulers (SLURM, PBS, LSF), a job is killed when its
allocation expires. Falcon currently has no awareness of the
remaining time budget, so a 24-hour SLURM job that's still training
at hour 23:59 will be SIGKILL'd with no warning and no checkpoint.
Issue 1 handles signals when delivered, but most schedulers
deliver the warning signal only seconds before the hard kill, which
is not enough time to drain a multi-minute epoch.
Current behaviour (0.4.1)
- No reference to
SLURM_JOB_END_TIME, PBS_WALLTIME, or any
scheduler env var anywhere in the codebase.
- No periodic checkpoint outside the abstract
save() contract.
- Even with Issue 1 implemented, a 30-second SIGTERM-to-SIGKILL
grace period (SLURM default) is far shorter than a typical epoch
for the Gaussian or Flow estimators (multi-second to multi-minute).
Proposed behaviour
- Add a
runtime block to the YAML config:
runtime:
wall_clock_budget: auto # "auto" reads SLURM_JOB_END_TIME etc.
# or an explicit duration like "23h30m"
checkpoint_before_end: 5m # save+exit when this much time remains
periodic_checkpoint_every: 30m # optional: snapshot mid-run
- In
DeployedGraph._launch's outer loop
(falcon/core/deployed_graph.py:796), check time.monotonic()
against the deadline alongside the existing stop_check().
When (deadline - now) <= checkpoint_before_end, trigger the
same path as Issue 1 — request stop, drain epoch, save, exit.
- For
wall_clock_budget: auto, read in priority order:
SLURM_JOB_END_TIME (epoch seconds)
PBS_WALLTIME + job start time
LSB_JOB_PEND_TIME
Falls back to null (feature disabled) if none are set.
- Periodic checkpoints are written at epoch boundaries (not
mid-epoch) to keep state consistent. They overwrite the same
checkpoint.json / *.pth files used by Issue 1 — there is
one checkpoint per node, not a chain.
- Emit a single console line at startup:
"⏱ Wall-clock budget: 23h59m (from SLURM_JOB_END_TIME) — will checkpoint at 23h54m"
Acceptance criteria
- Submitting
falcon launch inside a 5-minute SLURM allocation
with checkpoint_before_end: 1m produces a checkpoint at
~T+4m and exits cleanly before SLURM kills it.
- An integration test fakes
SLURM_JOB_END_TIME to be 90 seconds
in the future, sets checkpoint_before_end: 30s, and asserts
the checkpoint is written within 60–65 s of launch.
periodic_checkpoint_every produces fresh artefacts on a fixed
cadence without ending training.
Out of scope
- Multi-checkpoint history / rotation (single overwriting snapshot
is enough for this iteration).
- Resuming the run — that's Issue 3.
- Schedulers other than SLURM / PBS / LSF (env-var hook is the
extension point; users on Kubernetes/etc. can set
wall_clock_budget explicitly).
Issue 3 — falcon launch --resume to continue from a saved checkpoint
Summary
Even though BaseEstimator.load() exists
(falcon/core/base_estimator.py:92-100) and every estimator
implements it, there is no user-facing way to use a saved
checkpoint to continue training. falcon launch -o outputs/run on
an existing directory either errors or starts from scratch — there
is no flag that says "pick up where the last run left off."
This issue depends on Issues 1 and 2 for the checkpoint contents
(epoch counter, optimizer/scheduler state, RNG, W&B run ID) but is
useful on its own: even runs that finish cleanly today could be
"continued" with more epochs if --resume were available.
Current behaviour (0.4.1)
load() is called nowhere in falcon/cli.py or
falcon/core/deployed_graph.py (verified by grep).
- Re-running into an existing
graph_dir re-initializes networks
from scratch via _initialize_networks() in each estimator.
- The simulation buffer in
sim_dir/ is reused if present, so
data is preserved across restarts — but the trained networks
are not.
Proposed behaviour
- Add
--resume flag to falcon launch:
falcon launch -o outputs/run --resume
resumes the run in outputs/run from its checkpoint.
falcon launch -o outputs/run --resume-from outputs/prior_run
copies a checkpoint from another directory and continues.
- On resume:
- Refuse to start if config in the saved run differs from the
current config in a way that affects the network shape
(network.*, priors, embedding); print a diff and exit.
- Allow safe overrides like
loop.num_epochs (training
proceeds until the new target) and inference.* (only
affects post-train sampling).
- For each node with a checkpoint, call
estimator.load()
before training begins, restoring weights, optimizer state,
scheduler state, EMA buffers, history, and RNG.
- Resume the W&B run via the saved
run_id so metrics land
on the same W&B chart instead of starting a new one.
- Print a clear status:
"↻ Resuming z from epoch 47/300 (last saved 2026-05-26T14:03)"
- If a node has no checkpoint (e.g. it was added to the graph
after the prior run), train it from scratch and warn.
Acceptance criteria
- Launch
01_minimal, interrupt at epoch 50, run
falcon launch -o <same dir> --resume, verify training
continues at epoch 51 (not 1) and the W&B run is the same one.
- Final posterior samples from interrupted+resumed run match
samples from an uninterrupted run within tolerance
(sanity-check, not bit-exact).
- Config-mismatch refusal works: changing
network.hidden_dim
and resuming exits with a clear "incompatible config" error
pointing at the changed key.
- Adding a new node to the graph and resuming logs a "training
from scratch: {node}" warning and proceeds.
Out of scope
- Branching / forking checkpoints (one linear resume chain only).
- Resuming partial epochs (resumption is always at an epoch
boundary, matching the checkpoint cadence).
- Cross-version resume (require the same falcon version that
wrote the checkpoint; record it in checkpoint.json and
refuse mismatches).
Written via Claude Code
Falcon 0.4.1 — Checkpoint & Resume Enhancement Issues
Three related but independently-fileable GitHub issues for falcon 0.4.1.
They share a common need (durable training state) but solve different
failure modes:
Issue 3 is the consumer of artefacts produced by Issues 1 & 2, but they
can be implemented and merged in any order: a partial checkpoint with no
resume path is still useful for post-mortem analysis, and resume support
is still useful for runs that finished cleanly.
Issue 1 — Persist estimator state on graceful interrupt (SIGINT / SIGTERM)
Summary
When a user presses Ctrl+C, or the OS sends SIGTERM (e.g. job
preemption,
kill <pid>), the in-memory estimator state — networkweights, optimizer/scheduler state, EMA buffers, training history,
W&B run ID — is lost. The current shutdown path drains in-flight
training tasks but does not invoke
estimator.save()before theprocess exits, so partially-trained networks cannot be recovered.
Current behaviour (0.4.1)
_GracefulShutdowninfalcon/cli.py:263-294catches SIGINT andsets a flag; second Ctrl+C raises
KeyboardInterruptimmediately.DeployedGraph._launchinfalcon/core/deployed_graph.py:795-806propagates the flag to each
NodeWrappervianode.request_stop.remote(), which causes the training loop tobreak after the current epoch.
BaseEstimator.save()(falcon/core/base_estimator.py:82-90) isonly called from the post-
launch()cleanup path. If shutdownhappens via second Ctrl+C →
KeyboardInterrupt, that path isskipped and no
.pthfiles are written for that node._GracefulShutdown(only the interactiveTUI in
falcon/interactive.py:80-81registers it), so SLURMpreemption or
scancel --signal=TERMresults in an immediate killwith zero state on disk.
Proposed behaviour
Register
SIGTERMalongsideSIGINTin_GracefulShutdown(mirroring
interactive.py), so non-interactive jobs respond toscheduler signals.
On first signal, after
request_stop.remote(), wait for thecurrent epoch to finish, then for each node with
networks_initialized == Truepersist the model file andnetwork weights plus auxiliary training state. Concretely:
Model file & network weights — write the in-flight
parameters (not just
_best_*, which is whatsave()doestoday) so a Ctrl+C two epochs after the best validation loss
doesn't lose those two epochs of progress:
LossBasedEstimator:model.pth(state dict ofself._model, theGaussianPosterior+ embedding), plusmodel_best.pthif a best snapshot exists, plusinit_parameters.pth(constructor args for re-instantiationwithout the original config).
conditional_flow.pth,marginal_flow.pth,embedding.pth,init_parameters.pth— both current and*_best.pthvariants.torch.save(state_dict, path)(matches today'ssave()); also save thenn.Moduleclass path insidecheckpoint.jsonsoload()can re-instantiate withoutneeding the YAML.
Sidecar
checkpoint.json(per node) for resuming training:AdamW.state_dict())ReduceLROnPlateau.state_dict())torch.get_rng_state(),torch.cuda.get_rng_state(),NumPy, Python
random)GaussianPosterior(these are already nn.Module buffers, so they ride along in
model.pth— call this out so it isn't missed)On second signal within ~5 s, fall back to today's behaviour
(force exit, no save) so a hung node cannot block shutdown.
Surface the save in console output:
"⚠ Interrupted at epoch 47/300 — model + weights saved to {graph_dir}/z/"Acceptance criteria
(
model.pthfor Gaussian;conditional_flow.pth,marginal_flow.pth,embedding.pthfor Flow)init_parameters.pthcheckpoint.jsonwith the auxiliary state listed abovekill -TERM <pid>produces the same artefacts.torch.load(model.pth)followed byestimator.load(node_dir)produces an estimator whose
sample()output matches thein-process estimator at interrupt time (bit-exact under fixed
RNG seed).
01_minimal, sends SIGTERMduring training, and asserts all files above exist and load
cleanly.
Out of scope
save in Issue 2.
Issue 2 — Wall-clock-budget-aware checkpointing for scheduled jobs
Summary
On HPC schedulers (SLURM, PBS, LSF), a job is killed when its
allocation expires. Falcon currently has no awareness of the
remaining time budget, so a 24-hour SLURM job that's still training
at hour 23:59 will be SIGKILL'd with no warning and no checkpoint.
Issue 1 handles signals when delivered, but most schedulers
deliver the warning signal only seconds before the hard kill, which
is not enough time to drain a multi-minute epoch.
Current behaviour (0.4.1)
SLURM_JOB_END_TIME,PBS_WALLTIME, or anyscheduler env var anywhere in the codebase.
save()contract.grace period (SLURM default) is far shorter than a typical epoch
for the Gaussian or Flow estimators (multi-second to multi-minute).
Proposed behaviour
runtimeblock to the YAML config:DeployedGraph._launch's outer loop(
falcon/core/deployed_graph.py:796), checktime.monotonic()against the deadline alongside the existing
stop_check().When
(deadline - now) <= checkpoint_before_end, trigger thesame path as Issue 1 — request stop, drain epoch, save, exit.
wall_clock_budget: auto, read in priority order:SLURM_JOB_END_TIME(epoch seconds)PBS_WALLTIME+ job start timeLSB_JOB_PEND_TIMEFalls back to
null(feature disabled) if none are set.mid-epoch) to keep state consistent. They overwrite the same
checkpoint.json/*.pthfiles used by Issue 1 — there isone checkpoint per node, not a chain.
"⏱ Wall-clock budget: 23h59m (from SLURM_JOB_END_TIME) — will checkpoint at 23h54m"Acceptance criteria
falcon launchinside a 5-minute SLURM allocationwith
checkpoint_before_end: 1mproduces a checkpoint at~T+4m and exits cleanly before SLURM kills it.
SLURM_JOB_END_TIMEto be 90 secondsin the future, sets
checkpoint_before_end: 30s, and assertsthe checkpoint is written within 60–65 s of launch.
periodic_checkpoint_everyproduces fresh artefacts on a fixedcadence without ending training.
Out of scope
is enough for this iteration).
extension point; users on Kubernetes/etc. can set
wall_clock_budgetexplicitly).Issue 3 —
falcon launch --resumeto continue from a saved checkpointSummary
Even though
BaseEstimator.load()exists(
falcon/core/base_estimator.py:92-100) and every estimatorimplements it, there is no user-facing way to use a saved
checkpoint to continue training.
falcon launch -o outputs/runonan existing directory either errors or starts from scratch — there
is no flag that says "pick up where the last run left off."
This issue depends on Issues 1 and 2 for the checkpoint contents
(epoch counter, optimizer/scheduler state, RNG, W&B run ID) but is
useful on its own: even runs that finish cleanly today could be
"continued" with more epochs if
--resumewere available.Current behaviour (0.4.1)
load()is called nowhere infalcon/cli.pyorfalcon/core/deployed_graph.py(verified bygrep).graph_dirre-initializes networksfrom scratch via
_initialize_networks()in each estimator.sim_dir/is reused if present, sodata is preserved across restarts — but the trained networks
are not.
Proposed behaviour
--resumeflag tofalcon launch:falcon launch -o outputs/run --resumeresumes the run in
outputs/runfrom its checkpoint.falcon launch -o outputs/run --resume-from outputs/prior_runcopies a checkpoint from another directory and continues.
current config in a way that affects the network shape
(
network.*,priors,embedding); print a diff and exit.loop.num_epochs(trainingproceeds until the new target) and
inference.*(onlyaffects post-train sampling).
estimator.load()before training begins, restoring weights, optimizer state,
scheduler state, EMA buffers, history, and RNG.
run_idso metrics landon the same W&B chart instead of starting a new one.
"↻ Resuming z from epoch 47/300 (last saved 2026-05-26T14:03)"after the prior run), train it from scratch and warn.
Acceptance criteria
01_minimal, interrupt at epoch 50, runfalcon launch -o <same dir> --resume, verify trainingcontinues at epoch 51 (not 1) and the W&B run is the same one.
samples from an uninterrupted run within tolerance
(sanity-check, not bit-exact).
network.hidden_dimand resuming exits with a clear "incompatible config" error
pointing at the changed key.
from scratch: {node}" warning and proceeds.
Out of scope
boundary, matching the checkpoint cadence).
wrote the checkpoint; record it in
checkpoint.jsonandrefuse mismatches).