Bugfix for consecutive training steps using the same batch#202
Open
oadams wants to merge 1 commit intodscripka:mainfrom
Open
Bugfix for consecutive training steps using the same batch#202oadams wants to merge 1 commit intodscripka:mainfrom
oadams wants to merge 1 commit intodscripka:mainfrom
Conversation
There was a bug where consecutive training steps would use the same batch. The number of training steps using the same batch is equal to `n_cpu`. This is to do with the number of workers prefetching data in the dataloader, and somehow each of them independently keeping track of their own `data_counter` member in `mmap_batch_generator`. I don't completely understand how the prefetching caused this, all I know is that it did. The fix substantially changes the training dynamics by bringing stochasticity back into training. In particular training error is much lower and validation accuracy and recall are much higher. It's noteworthy that validation false positives are also much higher. The training logic in this codebase is quite idiosyncratic so it's likely some of the settings and logic need to be recalibrated to work well with correct minibatch SGD to minimize false positives.
dankorotin
added a commit
to dankorotin/openWakeWord
that referenced
this pull request
Apr 15, 2026
…ause Three training-side bugs inherited from upstream. All three were identified in our own Phase 0 survey and have sat on upstream for months without resolution; each is fixed here with a root-cause change rather than a workaround. 1. Gradient accumulation was dead code (upstream issue dscripka#316) openwakeword.train.Model.train_model called optimizer.zero_grad() at the top of every iteration and only called loss.backward() once the running accumulated-sample count crossed the 128-sample target. Gradients from all intermediate-window iterations were silently wiped before they could accumulate in param.grad, so only the final iteration of each window contributed to the parameter update. The "accumulation" logic was tracking sample counts but never actually accumulating gradients. Fix: zero grads only at the window boundary (immediately after optimizer.step()), call loss.backward() on every iteration (gradients accumulate naturally in param.grad), commit optimizer.step() when the accumulated-sample threshold is crossed, and scale each iteration's loss by (batch_size / gradient_accum_target) so the effective learning rate no longer drifts with the accumulation-window length. A new gradient_accum_target kwarg (default 128) replaces the hard-coded threshold. tests/test_train_gradient_accumulation.py locks in the fix with two assertions: (a) parameters change after training with sub-threshold batch sizes, (b) history['loss'] logs one entry per committed window. Both tests pass locally with the [train] extras installed; they skip cleanly via pytest.importorskip otherwise. 2. DataLoader prefetch batch duplication (upstream PR dscripka#202, unmerged) mmap_batch_generator in data.py keeps a per-worker data_counter. With num_workers=n_cpus, prefetch_factor=16 each DataLoader worker independently replayed the first batches from its own counter, so training saw every batch n_cpus times in a row before advancing. On a 56-CPU machine that is 56× batch duplication. Fix: drop to single-worker data loading, matching oadams' upstream patch in PR dscripka#202. A proper multi-worker solution would shard the generator via torch.utils.data.get_worker_info inside IterDataset.__iter__; that is flagged as a follow-up in the inline comment and in CHANGELOG. 3. acoustics==0.2.6 is broken on SciPy ≥ 1.15 acoustics imports scipy.special.sph_harm, which was removed in SciPy 1.15 (replaced by sph_harm_y). Upstream pinned acoustics as a training extra, which would force us to hold SciPy back to < 1.15 in the [train] extra just to generate one 1-D colored-noise array per batch during data augmentation. Fix: drop the acoustics dependency entirely. Colored noise is now generated inline by a small FFT-based helper (_colored_noise, five colors: white/pink/blue/brown/violet) in data.py. The call site in augment_clips switches from acoustics.generator.noise(...) to the new helper. CHANGELOG explains the swap. No changes to the runtime inference surface — all three touch openwakeword.train / openwakeword.data and are relevant only during retraining. Existing test suite (21 runtime tests + 1 custom-verifier test) remains green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There was a bug where consecutive training steps would use the same batch. The number of consecutive training steps using the same batch is equal to
n_cpu. This is to do with the number of workers prefetching data in the dataloader, and somehow each of them independently keeping track of their owndata_countermember inmmap_batch_generator. I don't completely understand how the prefetching caused this, all I know is that it did.The fix substantially changes the training dynamics by bringing a lot more stochasticity into training. In particular training error is much lower and validation accuracy and recall are much higher. It's noteworthy that validation false positives are also much higher. The training logic in this codebase is quite idiosyncratic so it's likely some of the settings and logic need to be recalibrated to work well with correct minibatch SGD.
Depending on ones settings for n_steps, dataset size and number of CPUs it's also much more likely without the fix that the model won't see all the training examples. For example, if you have 100,000 positive examples and are using 50 positive examples per batch, then it should take 2,000 steps for an epoch. But if you're on a machine with 56 CPUs then without the fix your model won't see all the data because it takes 56x more steps to see the same amount of unique data samples.
Some indicative training profiles are below on my own data. Grey is without the fix, blue is with the fix.
Note that I made a number of other changes to the training code as part of the runs that gave the above profile. Most notably disabling the max_negative_weight scaling, so it's merely indicative. A comparison based on the state of the code in this PR is made below, after reverting my other changes:
<style type="text/css"></style>