Replies: 7 comments 41 replies
-
|
Edit: Upon further testing after backporting to the old branch (flux), something is interfering with ramtorch working correctly, vram use is not reducing as it should. As such please use the refresh branch instead, recommend a fresh directory: git clone https://github.com/67372a/LoRA_Easy_Training_Scripts -b refresh Hello @Darudado , I have been working on ramtorch mostly on the refresh branch, where I am rebasing on the latest sd_scripts from upstream, using it to align better with upstream and trim out old features that didn't have much benefit. The line of branches, front, back, and sd_scripts is refresh/refresh/sd3-upstream. As for ramtorch there, it works without issue now from what I can tell, I haven't backported everything to the old branch, but I can do that fairly easily and will today. To use ramtorch:
I plan to adjust sd_scripts to automatically default to the accelerate device unless one is provided, and probably have it default to CPU/RAM if use_ramtorch_network=True
The vram savings can be absurdly massive, for example, someone I have testing was able to train a 512 linear dim, 256 conv dim locon for SDXL, BS 7, and still didn't fully fill their VRAM, ~22GB out of 24GB, ofc, filled more system RAM. Others have been running as high as BS 18 with smaller dims. With any reasonable batch size (4+), the overhead of ramtorch and CPU offloading appears to be negligible, in fact, it may actually speed up things due to use of streams, asynchronous operations, non blocking, etc. The quality of outputs is not degraded in anyway as far as I have observed. |
Beta Was this translation helpful? Give feedback.
-
|
Hello, thank you for your kind answer and information. I had heard voices that you were working on it. Thanks for your work! So, offloading the model should work out of the box. Not knowing these two optimizers (OCGOpt and SimplifiedAdEmaMixExM), I probably don't want to experiment with them when I don't know how ramtorch is affecting training. I'll try to offload the model and train bigger batches, then check network/lora offload. Quite excited. |
Beta Was this translation helpful? Give feedback.
-
|
Correct, in the new branch, base model offloading should work regardless of the optimizer. The vram utilization of the network/lora and optimizer states varies:
So for large network cases, or optimizers with states that scale with parameters, the vram utilization, and thus off load benefit, is larger. Let me know which optimizers you tend to use and I can patch in the changes needed for them, assuming they are exposed in a way I can. It's tedious, but not difficult to do for one offs. I am just hoping to eventually come up with a solution to avoid having to do it manually, and so it can be applied to optimizers where the code isn't as readily patchable manually. |
Beta Was this translation helpful? Give feedback.
-
|
@Darudado @78752 please try running |
Beta Was this translation helpful? Give feedback.
-
|
@Darudado @78752 pushed some fixes and enhancements for ramtorch's helper for applying it to modules generally and lycoris specifically. use_ramtorch_network wasn't working as intended, now it appears to be working. RamTorch
lycoris
So overall, fixed use_ramtorch_network for lycoris, improved ramtorch application to models so it should be faster and use less RAM. I have not done further testing on Kohyas's lora impls yet. They don't seem to require any additional modifications, perhaps the improvements to ramtorch application will help there too. |
Beta Was this translation helpful? Give feedback.
-
|
@Darudado greatly appreciate the testing, I will do some more testing locally and see if I can figure out the gaps. For awareness, I did push some adjustments last night, updates to my ramtorch fork to make sure gradients are accumulated at fp32, and stochastically rounded if using bf16 weights. I also updated sd_scripts to split out use_ramtorch_vae (as value is negligible and slows it down a bit), and fixed some dtyping of ramtorch that can be problematic due to sd_scripts casting things later, after applying ramtorch. I still need to do some testing to determine if there is a gap due to casting of weights by the training script at certain points that could be problematic, or perhaps some module properties are getting lost (i.e. not properly copied) from the original linear modules. |
Beta Was this translation helpful? Give feedback.
-
|
@78752 @Darudado for awarenesses, as of 67372a/sd-scripts@9ea865b, system memory use should now be much lower for Kohya's lora implementations. The issue was holding strong references to the original base model weights, the fix addressed that, so now they can be properly offloaded. |
Beta Was this translation helpful? Give feedback.








Uh oh!
There was an error while loading. Please reload this page.
-
As per title, I wonder if ramtorch is a feature now. I wonder if it works, issues, etc...
Beta Was this translation helpful? Give feedback.
All reactions