Skip to content

feat: enable immediate saving on UI stop and enhance optimizer state backup#727

Open
avan06 wants to merge 2 commits intoostris:mainfrom
avan06:save-checkpoint-on-stop
Open

feat: enable immediate saving on UI stop and enhance optimizer state backup#727
avan06 wants to merge 2 commits intoostris:mainfrom
avan06:save-checkpoint-on-stop

Conversation

@avan06
Copy link

@avan06 avan06 commented Feb 27, 2026

This PR improves the training interruption workflow by ensuring that training progress is captured immediately when a user requests a stop via the UI. It also introduces a backup mechanism for the optimizer state to provide users with more flexibility when resuming or rolling back training.

Key Changes

  1. Immediate Save on UI Stop (DiffusionTrainer)
    Modified maybe_stop to trigger self.save() as soon as a stop signal is detected.
    Added a _is_saving flag to manage the saving state, preventing infinite recursion or redundant calls during emergency saves.
    Benefit: Users can now resume training from the precise step of interruption rather than being forced to roll back to the last scheduled checkpoint. This ensures LoRA weights, Metadata, and optimizer states are perfectly synchronized at the exit point.

  2. Optimizer State Rotation (BaseSDTrainProcess)
    Implemented a backup system for the optimizer state. When saving, the existing optimizer.pt is moved to optimizer_prev.pt instead of being directly overwritten by the new state.

Example Scenario (How to roll back)

If a user decides they prefer the results from a previous scheduled save over the final interrupted save:

  1. Current State: You have a scheduled save at Step 500 (train_000000500.safetensors and optimizer_prev.pt) and an interruption save at Step 666 (train_000000666.safetensors and optimizer.pt).
  2. Rollback Process:
  • Delete the interruption files: train_000000666.safetensors and optimizer.pt.
  • Rename optimizer_prev.pt to optimizer.pt.
  1. Result: The trainer will successfully resume training from Step 500 using the correct historical optimizer state.

These changes have been verified in a local environment and are confirmed to be working as intended.

- Modified maybe_stop in DiffusionTrainer to trigger self.save() when a stop signal is detected.
- Added _is_saving flag to manage saving state and prevent infinite recursion during emergency saves.
- Ensures LoRA weights, Metadata, and optimizer state are synchronized at the exact exit step.
- Enables users to resume training from the precise interruption point instead of rolling back to the last scheduled save.
Now maintains a backup of the previous optimizer state. When saving, the current optimizer.pt is moved to optimizer_prev.pt rather than being overwritten.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant