BETA-Tuned Timestep Distribution#1225
Conversation
|
Does this apply to all models? Only diffusion models are Beta-sampled during inference. Flow matching models are sampled with linear sigmas and often with timestep-shifting ("Flux-shift"). is that correct? did #1124 also only apply to diffusion, not to flow matching? |
It’s a tunable distribution, but it’s specifically intended for diffusion models (SD, SDXL, etc.).
Here's examples:
The issue is that #1124 lacks a theoretical basis (it’s more of a heuristic method) but it functions similarly. Also, while it supports flow matching by accepting sigmas, requiring both betas and sigmas added too much code. |
|
Do we have any results of our own showing this actually works on SD1.5 and SDXL and not on these specific datasets? The paper only covers training at 32x32, 128x128 and 256x256 which are not resolutions either model can do? |
It is a known observation in diffusion papers that the later timesteps are relatively easy for the model compared to others (since most of the image is still noise). |
|
So we havent tried it for any training, at all? |
|
You mean testing? Yes, I tested it in my recent runs (SDXL - 1024) and they went very well. |
Isn't the opposite the case? Later (= low) timesteps are hard, very late timesteps are impossible (which is what MIN_SNR_GAMMA attempted to solve)
I'm hesitant with this PR for two more reasons:
|
Outdated models have many flaws that recent papers try to address. This is one of those cases; I’ve read about five papers proposing a similar method (sampling more heavily from 'hard' timesteps). However, I’ll close this PR if you aren't planning to support SD/SDXL-specific features anymore |



This PR implements the timestep distribution proposed in the paper:
Beta-Tuned Timestep Diffusion Model
This method aims to align timestep sampling with the diffusion model's forward pass, resulting in faster convergence and improved training performance. The paper observes that the data distribution changes most significantly during the initial timesteps, rendering standard uniform sampling sub-optimal.
Usage
BETAtimestep distribution.Noising biasto1(corresponds to Beta in the paper; recommended: 1).Noising weightto< 1(corresponds to Alpha in the paper; recommended: 0.8).Note: This is compatible with existing loss weighting strategies (e.g., Min-SNR, Debiased, etc.).