RamTorch #51

Darudado · 2025-11-16T12:29:46Z

Darudado
Nov 16, 2025

As per title, I wonder if ramtorch is a feature now. I wonder if it works, issues, etc...

67372a · 2025-11-17T13:33:13Z

67372a
Nov 17, 2025
Maintainer

Edit: Upon further testing after backporting to the old branch (flux), something is interfering with ramtorch working correctly, vram use is not reducing as it should.

As such please use the refresh branch instead, recommend a fresh directory: git clone https://github.com/67372a/LoRA_Easy_Training_Scripts -b refresh

Hello @Darudado , I have been working on ramtorch mostly on the refresh branch, where I am rebasing on the latest sd_scripts from upstream, using it to align better with upstream and trim out old features that didn't have much benefit.

The line of branches, front, back, and sd_scripts is refresh/refresh/sd3-upstream.

As for ramtorch there, it works without issue now from what I can tell, I haven't backported everything to the old branch, but I can do that fairly easily and will today.

To use ramtorch:

extra training arg use_ramtorch=True enables it for the base model
extra training arg use_ramtorch_network=True enables it for the network/lora, NOTE, this requires the optimizer to have .to()s defined that move the parameter to the correct device, I have implemented that currently for OCGOpt and SimplifiedAdEmaMixExM, plan to see about a generalized solution to avoid having to manually update all the optimizers.
the previously mentioned optimizers default to offloading state to CPU/RAM (controllable vis storage_device optimizer argument, e.x. setting it to CUDA would place the states in vram).
Default handling for optimizers that support offloading:
- If state_storage_dtype is not provided as an arg, default to training dtype
- If state_storage_device is not provided as an arg, default to accelerator's device, unless use_ramtorch_network is true, then use cpu

I plan to adjust sd_scripts to automatically default to the accelerate device unless one is provided, and probably have it default to CPU/RAM if use_ramtorch_network=True

use lycoris based networks/loras, I have not tested or implemented changes to allow Kohyas's implementations to work, only lycoris

The vram savings can be absurdly massive, for example, someone I have testing was able to train a 512 linear dim, 256 conv dim locon for SDXL, BS 7, and still didn't fully fill their VRAM, ~22GB out of 24GB, ofc, filled more system RAM. Others have been running as high as BS 18 with smaller dims.

With any reasonable batch size (4+), the overhead of ramtorch and CPU offloading appears to be negligible, in fact, it may actually speed up things due to use of streams, asynchronous operations, non blocking, etc.

The quality of outputs is not degraded in anyway as far as I have observed.

0 replies

Darudado · 2025-11-18T10:07:57Z

Darudado
Nov 18, 2025
Author

Hello, thank you for your kind answer and information. I had heard voices that you were working on it. Thanks for your work!
I'll try to do some SDXL training this weekend as I'm waiting some new ram modules (prices went crazy!). I'm hopeful I'll be able to offload some big models to RAM later, opening a lot of opportunities.

So, offloading the model should work out of the box.
Meanwhile, offloading the network/lora needs some code change.
I quickly checked lodestone implementation, I understand that replacing functions on each optimizer is quite the time expense...
As I understand, the weight of the network/lora is fairly small? I'm unsure, if it's just more efficient to keep it in vram if I don't scale dim a lot.

Not knowing these two optimizers (OCGOpt and SimplifiedAdEmaMixExM), I probably don't want to experiment with them when I don't know how ramtorch is affecting training.

I'll try to offload the model and train bigger batches, then check network/lora offload.

Quite excited.

0 replies

67372a · 2025-11-18T12:51:55Z

67372a
Nov 18, 2025
Maintainer

@Darudado

Correct, in the new branch, base model offloading should work regardless of the optimizer.

The vram utilization of the network/lora and optimizer states varies:

rank/dim of the network, impacting parameter dimensions and optimizer state dimensions
the type of network, some have more modules and parameters than others, e.x. glora vs locon vs boft
the number of optimizer states that scale based on the parameters

So for large network cases, or optimizers with states that scale with parameters, the vram utilization, and thus off load benefit, is larger.

Let me know which optimizers you tend to use and I can patch in the changes needed for them, assuming they are exposed in a way I can. It's tedious, but not difficult to do for one offs. I am just hoping to eventually come up with a solution to avoid having to do it manually, and so it can be applied to optimizers where the code isn't as readily patchable manually.

15 replies

78752 Nov 20, 2025

I'm still having an issue with installing the forked repos of ramtorch and lycoris via pip on windows, I had to download the .zips of both.

67372a Nov 20, 2025
Maintainer

@78752 can you share what errors are occurring during the pip process, without them it is unclear to me what the issue might be. Also, are you on python 3.11 on have the latest git installed?

67372a Nov 20, 2025
Maintainer

Can also check to make sure pip is latest as well python -m pip install --upgrade pip with the venv active. I plan to look into add logic to automatically check pip for updates.

78752 Nov 20, 2025

@67372a I've tried this on python 10, 11, 12 and 13. I was using a version of git from 2024 and tried updating it to the latest release and that hasn't worked either.

This is what I'm seeing related to ramtorch in pip:

Collecting RamTorch@ git+https://github.com/67372a/RamTorch (from -r requirements.txt (line 37))
  Cloning https://github.com/67372a/RamTorch to e:\temp\pip-install-y6oyllxw\ramtorch_732280541acc4a848b387be2c4816f36
  Running command git clone --filter=blob:none --quiet https://github.com/67372a/RamTorch 'E:\temp\pip-install-y6oyllxw\ramtorch_732280541acc4a848b387be2c4816f36'
  Resolved https://github.com/67372a/RamTorch to commit db10dd1f13d63662f0f32cc2f54199497d5199af

Building wheels for collected packages: customized-optimizers, RamTorch, library
  Building wheel for customized-optimizers (pyproject.toml) ... done
  Created wheel for customized-optimizers: filename=customized_optimizers-1.0.1-py3-none-any.whl size=71130 sha256=aedddc5f8b1f7f2536a48b54fd6ce3ff673ebac62cab7758a7c5f86f83b8b491
  Stored in directory: E:\temp\pip-ephem-wheel-cache-1qkjcbbg\wheels\d6\03\ff\b762ee06061a3674b8b44a4776f415bfde3eb70cc95c390970
  Building wheel for RamTorch (pyproject.toml) ... done
  Created wheel for RamTorch: filename=ramtorch-0.2.1-py3-none-any.whl size=14278 sha256=e90e6a0a18843eb16063126ad98cbe42353b80b86e2baa70ef25afc320f965d9
  Stored in directory: E:\temp\pip-ephem-wheel-cache-1qkjcbbg\wheels\58\d3\33\6a62fab6fa6cecfa1f3a9aaa225c03ddd3bb89fd3567942d60
  Building editable for library (pyproject.toml) ... done
  Created wheel for library: filename=library-0.0.0-0.editable-py3-none-any.whl size=6894 sha256=53c80313a063c8b89cb4bc499b1c65f5a5d87f7be46317f1db5ec74bc381c94f
  Stored in directory: E:\temp\pip-ephem-wheel-cache-1qkjcbbg\wheels\a7\a8\c4\5c5cc7456b95b263f092f5d41dc50f661f0af7a7a32379a965
Successfully built customized-optimizers RamTorch library

But the actual ramtorch install folder is missing the modules and stochastic_optimizer folders and at least its helpers.py file isn't the one from your repo's latest commit. Lycoris is displaying the same kind of behavior, the files don't match your latest commit.

67372a Nov 20, 2025
Maintainer

Working on replicating it locally to determine a fix

67372a · 2025-11-20T17:33:20Z

67372a
Nov 20, 2025
Maintainer

@Darudado @78752 please try running pip install git+https://github.com/67372a/RamTorch --force-reinstall again in the sd_scripts venv. Was an issue with the ramtorch pyproject config that I corrected, should work now.

12 replies

78752 Nov 22, 2025

@67372a I tested this using the same dataset, batch size, seed, dim size, etc for both the kohya and lycoris versions of locon. Shouldn't any bottlenecks caused by the dataset or hardware have applied equally to both?

67372a Nov 22, 2025
Maintainer

@78752 Ah I skimmed, unfortunately they won't be like for like due to the implementations being different, in particular, how Kohya's handles the original weights vs how Lycoris does, I'd have to do digging to understand where the gap is that ramtorch causes a significant performance regression on kohya's implementation vs lycoris implementation.

Is your your vram use greater on kohya's vs lycoris? Or the same? The massive slow down could be due to a slight overflow to system memory that doesn't trigger a full OOM, which isn't handled well by CUDA at all. One thing, on Windows, would be to make sure CUDA - sysmem fallback policy is set to Prefer No Sysmem Fallback in the Nvidia control panel globally. This can prevent it from allowing overflow, though there is still a boundary where it can still happen.

On the sharpness point, I do agree with it probably being a factor of not using fp8, as perhaps it can fit more precisely to the base weights.

78752 Nov 22, 2025

@67372a Overall VRAM usage has consistently been a bit lower on kohya and there shouldn't have been any OOM issues with either network during the tests I've done.

Darudado Nov 23, 2025
Author

Yes, from my tests locon (kohya) is a lot more time and memory efficient, but the quality is lower than locon (lycoris).
I'm not really doing rigorous testing too. I'm mostly trying to use bigger ranks, it's something I couldn't do before.

I keep overfitting very fast or not updating much. I'll just keep changing LR until it seems stable.

67372a Nov 23, 2025
Maintainer

@78752 very odd, will have to do some tests myself to see.

@Darudado usually you want to keep conv less than or equal to network/linear, I've observed half of linear being a rule of thumb as well.

You certainly may need to adjust the LR, more parameters (higher dim/rank) tend to require lower LRs, assuming the ratio of alpha is kept constant, while higher effective batch sizes require higher LRs.

67372a · 2025-11-24T12:47:46Z

67372a
Nov 24, 2025
Maintainer

@Darudado @78752 pushed some fixes and enhancements for ramtorch's helper for applying it to modules generally and lycoris specifically. use_ramtorch_network wasn't working as intended, now it appears to be working.

RamTorch

now properly maintains pinning of memory for it's weights and biases, before was losing the due to reassignment vs copy. This should make it faster, as without pinning, all transfers would become blocking silently
passes forward requires_grad from original tensor setting during ramtorch init
now accepts and sets a target dtype, before it would always store the RAM weights as fp32 first, so now it should use less RAM and result in faster transfers. This is set in sd_scripts to align with weight_dtype.

lycoris

now properly async transfers weights from RAM, before it wasn't. To be transparent, bad impl by one LLM vs another for that piece as not really familiar, was more thorough this time and based on testing this new approach does look to work.

So overall, fixed use_ramtorch_network for lycoris, improved ramtorch application to models so it should be faster and use less RAM. I have not done further testing on Kohyas's lora impls yet. They don't seem to require any additional modifications, perhaps the improvements to ramtorch application will help there too.

12 replies

Darudado Nov 25, 2025
Author

@67372a I think there are still issues even after the update. I updated using update.bat and noticed it updated the branch code too, should be fine.

I tried to train higher rank lora again after the update, but unsatisfied by the results I started to suspect there are still problems.
I went back to an old workflow and dataset I used before ramtorch's implementation. Added both ramtorch flags to it and let it train again with the same settings, BS, LR, steps... I just added the flags.

It seems like it didn't train very well, it's not training everything. Regarding what I noticed, the style changed a bit and seemed to converge but way worse, the subjects of the images don't seem to vary much from the model itself.

Example images have been picked from the same epoch
Just the base model:

The old lora I trained without ramtorch:

The new lora with ramtorch:

I think I already shared this workflow, it's the workflow of the past training I did. The changes I made for the later training session just added the flags for ramtorch set to True.
WorkflowExportScorn.txt

Can you check again, just to be sure I'm not crazy?

67372a Nov 25, 2025
Maintainer

@Darudado for awareness, laplace timestep sampling, gradient_noise_scale, and loss_related_use_float64 are not supported on refresh, but if you ran with and without ramtorch on refresh, that shouldn't be a factor.

I wonder if you are out of sync, the submodules and way things are installed makes it a bit screwy once things get out of sync, easier to reinstall usually. Dunno if you already did after I made fixes to the .gitmodules references in front and backend.

refresh/refresh/sd3-upstream should be the branches for front/back/sd_scripts. update.bat should normally update the whole chain, it will call update.bat in the backend as well if it's linked, if it isn't.

You can also directly update ramtorch and lycoris inot the sd_scripts venv, though really you want to make sure everything is up to date, as there are changes to optimizers and sd_scripts that could matter:

pip install -U --force-reinstall --no-deps git+https://github.com/67372a/RamTorch
pip install -U --force-reinstall --no-deps git+https://github.com/67372a/LyCORIS@dev

I wouldn't expect exact like for like, for should at least see a similar level of training progress.

Darudado Nov 25, 2025
Author

I'll try a new install and repeating the test, maybe I missed something.

67372a Nov 26, 2025
Maintainer

@Darudado I appreciate your patience and feedback, it is very helpful for me to help determine if there are any issues or room for improvement.

Darudado Nov 27, 2025
Author

@67372a I did more testing, since I wasn't convinced.
I noticed the earlier test with ramtorch had an issue where it still had network dim(128) and alpha(64) from earlier tests, so I restarted the whole thing trying to be more careful and rigorous. I reinstalled the refresh branch too in another directory.

These are the 2 workflow I tested:
Ramtorch (both flags)
export-Ramtorch.txt
Without Ramtorch
export-NO-Ramtorch.txt
These should be the same as the earlier scorn workflow, but I'm attaching them to be more complete. In case I changed something during the 2 different sessions of tests and made some mistake.
I also let AI compare them, they should be correct:

The main functional difference is that the Ramtorch version enables RAMTorch optimization with the two additional parameters, while the non-Ramtorch version does not. The output names are also different to distinguish between the two training runs. Both configurations appear to be otherwise identical in terms of training parameters, model settings, and optimization strategies.

The objective is to train them to 24 epoch using the same config I used in the flux branch 2-3 weeks ago which also trained for 24 epoch and get a comparable result. The test is to learn a style that's not in the base model knowledge.

Example image of base model generation and generation using a lora made weeks ago with same config in the flux branch are in the message I wrote earlier. All generated with seed: 1337975433 and same workflow.

I started with the LoRA training with ramtorch, the result is that it burned right away at epoch 1.

I continued and trained the lora without Ramtorch, this time I got to 24 epoch, it seems to have trained the style correctly mostly, the position of the character also varied during training.

I restarted my pc, paranoia mostly, and then retried to train with ramtorch, epoch 1 burned right away again.

My feeling is that probably ramtorch lora/network or even scorn has still issues, so I disabled the flag for network and tried 1 epoch.
Seems like it's actually training and not burning.

I'm assuming there is still something wrong. I can't really say what, maybe network/scorn. I didn't notice anything out of the ordinary in the logs.
I didn't have the time to finish the training, so it may still break later... but it seemed to me like enough info to share.
Thanks

67372a · 2025-12-01T13:25:13Z

67372a
Dec 1, 2025
Maintainer

@Darudado greatly appreciate the testing, I will do some more testing locally and see if I can figure out the gaps.

For awareness, I did push some adjustments last night, updates to my ramtorch fork to make sure gradients are accumulated at fp32, and stochastically rounded if using bf16 weights. I also updated sd_scripts to split out use_ramtorch_vae (as value is negligible and slows it down a bit), and fixed some dtyping of ramtorch that can be problematic due to sd_scripts casting things later, after applying ramtorch.

I still need to do some testing to determine if there is a gap due to casting of weights by the training script at certain points that could be problematic, or perhaps some module properties are getting lost (i.e. not properly copied) from the original linear modules.

1 reply

Darudado Dec 3, 2025
Author

greatly appreciate the testing

No problem, I use the tools so I want to give feedback. I'm grateful myself.

I will do some more testing locally

Absolutely, test on your side. May just be my machine or something. Lot of possible failure points.

Since I disabled network ramtorch, results have been fine.

67372a · 2026-02-22T18:23:02Z

67372a
Feb 22, 2026
Maintainer

@78752 @Darudado for awarenesses, as of 67372a/sd-scripts@9ea865b, system memory use should now be much lower for Kohya's lora implementations. The issue was holding strong references to the original base model weights, the fix addressed that, so now they can be properly offloaded.

1 reply

Darudado Feb 24, 2026
Author

That's a good news, I'll try it later

RamTorch #51

Uh oh!

Darudado Nov 16, 2025

Replies: 7 comments · 41 replies

Uh oh!

Uh oh!

67372a Nov 17, 2025 Maintainer

Uh oh!

Darudado Nov 18, 2025 Author

Uh oh!

67372a Nov 18, 2025 Maintainer

Uh oh!

Uh oh!

78752 Nov 20, 2025

Uh oh!

Uh oh!

67372a Nov 20, 2025 Maintainer

Uh oh!

Uh oh!

67372a Nov 20, 2025 Maintainer

Uh oh!

78752 Nov 20, 2025

Uh oh!

67372a Nov 20, 2025 Maintainer

Uh oh!

67372a Nov 20, 2025 Maintainer

Uh oh!

78752 Nov 22, 2025

Uh oh!

67372a Nov 22, 2025 Maintainer

Uh oh!

78752 Nov 22, 2025

Uh oh!

Darudado Nov 23, 2025 Author

Uh oh!

Uh oh!

67372a Nov 23, 2025 Maintainer

Uh oh!

Uh oh!

67372a Nov 24, 2025 Maintainer

RamTorch

lycoris

Uh oh!

Uh oh!

Darudado Nov 25, 2025 Author

Uh oh!

67372a Nov 25, 2025 Maintainer

Uh oh!

Darudado Nov 25, 2025 Author

Uh oh!

67372a Nov 26, 2025 Maintainer

Uh oh!

Darudado Nov 27, 2025 Author

Uh oh!

67372a Dec 1, 2025 Maintainer

Uh oh!

Uh oh!

Darudado Dec 3, 2025 Author

Uh oh!

67372a Feb 22, 2026 Maintainer

Uh oh!

Darudado Feb 24, 2026 Author

Darudado
Nov 16, 2025

Replies: 7 comments 41 replies

67372a
Nov 17, 2025
Maintainer

Darudado
Nov 18, 2025
Author

67372a
Nov 18, 2025
Maintainer

67372a Nov 20, 2025
Maintainer

67372a Nov 20, 2025
Maintainer

67372a Nov 20, 2025
Maintainer

67372a
Nov 20, 2025
Maintainer

67372a Nov 22, 2025
Maintainer

Darudado Nov 23, 2025
Author

67372a Nov 23, 2025
Maintainer

67372a
Nov 24, 2025
Maintainer

Darudado Nov 25, 2025
Author

67372a Nov 25, 2025
Maintainer

Darudado Nov 25, 2025
Author

67372a Nov 26, 2025
Maintainer

Darudado Nov 27, 2025
Author

67372a
Dec 1, 2025
Maintainer

Darudado Dec 3, 2025
Author

67372a
Feb 22, 2026
Maintainer

Darudado Feb 24, 2026
Author