Bugfix for consecutive training steps using the same batch by oadams · Pull Request #202 · dscripka/openWakeWord

oadams · 2024-09-16T05:26:56Z

There was a bug where consecutive training steps would use the same batch. The number of consecutive training steps using the same batch is equal to n_cpu. This is to do with the number of workers prefetching data in the dataloader, and somehow each of them independently keeping track of their own data_counter member in mmap_batch_generator. I don't completely understand how the prefetching caused this, all I know is that it did.

The fix substantially changes the training dynamics by bringing a lot more stochasticity into training. In particular training error is much lower and validation accuracy and recall are much higher. It's noteworthy that validation false positives are also much higher. The training logic in this codebase is quite idiosyncratic so it's likely some of the settings and logic need to be recalibrated to work well with correct minibatch SGD.

Depending on ones settings for n_steps, dataset size and number of CPUs it's also much more likely without the fix that the model won't see all the training examples. For example, if you have 100,000 positive examples and are using 50 positive examples per batch, then it should take 2,000 steps for an epoch. But if you're on a machine with 56 CPUs then without the fix your model won't see all the data because it takes 56x more steps to see the same amount of unique data samples.

Some indicative training profiles are below on my own data. Grey is without the fix, blue is with the fix.

Note that I made a number of other changes to the training code as part of the runs that gave the above profile. Most notably disabling the max_negative_weight scaling, so it's merely indicative. A comparison based on the state of the code in this PR is made below, after reverting my other changes:

--	Accuracy	Recall	FPPH
No fix	0.674	0.350	1.150
Fix	0.764	0.534	17.080

There was a bug where consecutive training steps would use the same batch. The number of training steps using the same batch is equal to `n_cpu`. This is to do with the number of workers prefetching data in the dataloader, and somehow each of them independently keeping track of their own `data_counter` member in `mmap_batch_generator`. I don't completely understand how the prefetching caused this, all I know is that it did. The fix substantially changes the training dynamics by bringing stochasticity back into training. In particular training error is much lower and validation accuracy and recall are much higher. It's noteworthy that validation false positives are also much higher. The training logic in this codebase is quite idiosyncratic so it's likely some of the settings and logic need to be recalibrated to work well with correct minibatch SGD to minimize false positives.

…ause Three training-side bugs inherited from upstream. All three were identified in our own Phase 0 survey and have sat on upstream for months without resolution; each is fixed here with a root-cause change rather than a workaround. 1. Gradient accumulation was dead code (upstream issue dscripka#316) openwakeword.train.Model.train_model called optimizer.zero_grad() at the top of every iteration and only called loss.backward() once the running accumulated-sample count crossed the 128-sample target. Gradients from all intermediate-window iterations were silently wiped before they could accumulate in param.grad, so only the final iteration of each window contributed to the parameter update. The "accumulation" logic was tracking sample counts but never actually accumulating gradients. Fix: zero grads only at the window boundary (immediately after optimizer.step()), call loss.backward() on every iteration (gradients accumulate naturally in param.grad), commit optimizer.step() when the accumulated-sample threshold is crossed, and scale each iteration's loss by (batch_size / gradient_accum_target) so the effective learning rate no longer drifts with the accumulation-window length. A new gradient_accum_target kwarg (default 128) replaces the hard-coded threshold. tests/test_train_gradient_accumulation.py locks in the fix with two assertions: (a) parameters change after training with sub-threshold batch sizes, (b) history['loss'] logs one entry per committed window. Both tests pass locally with the [train] extras installed; they skip cleanly via pytest.importorskip otherwise. 2. DataLoader prefetch batch duplication (upstream PR dscripka#202, unmerged) mmap_batch_generator in data.py keeps a per-worker data_counter. With num_workers=n_cpus, prefetch_factor=16 each DataLoader worker independently replayed the first batches from its own counter, so training saw every batch n_cpus times in a row before advancing. On a 56-CPU machine that is 56× batch duplication. Fix: drop to single-worker data loading, matching oadams' upstream patch in PR dscripka#202. A proper multi-worker solution would shard the generator via torch.utils.data.get_worker_info inside IterDataset.__iter__; that is flagged as a follow-up in the inline comment and in CHANGELOG. 3. acoustics==0.2.6 is broken on SciPy ≥ 1.15 acoustics imports scipy.special.sph_harm, which was removed in SciPy 1.15 (replaced by sph_harm_y). Upstream pinned acoustics as a training extra, which would force us to hold SciPy back to < 1.15 in the [train] extra just to generate one 1-D colored-noise array per batch during data augmentation. Fix: drop the acoustics dependency entirely. Colored noise is now generated inline by a small FFT-based helper (_colored_noise, five colors: white/pink/blue/brown/violet) in data.py. The call site in augment_clips switches from acoustics.generator.noise(...) to the new helper. CHANGELOG explains the swap. No changes to the runtime inference surface — all three touch openwakeword.train / openwakeword.data and are relevant only during retraining. Existing test suite (21 runtime tests + 1 custom-verifier test) remains green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix for consecutive training steps using the same batch#202

Bugfix for consecutive training steps using the same batch#202
oadams wants to merge 1 commit intodscripka:mainfrom
zdata-inc:dataloader_prefetch_bug

oadams commented Sep 16, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oadams commented Sep 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

oadams commented Sep 16, 2024 •

edited

Loading