Skip to content

Bugfix for consecutive training steps using the same batch#202

Open
oadams wants to merge 1 commit intodscripka:mainfrom
zdata-inc:dataloader_prefetch_bug
Open

Bugfix for consecutive training steps using the same batch#202
oadams wants to merge 1 commit intodscripka:mainfrom
zdata-inc:dataloader_prefetch_bug

Conversation

@oadams
Copy link
Copy Markdown

@oadams oadams commented Sep 16, 2024

There was a bug where consecutive training steps would use the same batch. The number of consecutive training steps using the same batch is equal to n_cpu. This is to do with the number of workers prefetching data in the dataloader, and somehow each of them independently keeping track of their own data_counter member in mmap_batch_generator. I don't completely understand how the prefetching caused this, all I know is that it did.

The fix substantially changes the training dynamics by bringing a lot more stochasticity into training. In particular training error is much lower and validation accuracy and recall are much higher. It's noteworthy that validation false positives are also much higher. The training logic in this codebase is quite idiosyncratic so it's likely some of the settings and logic need to be recalibrated to work well with correct minibatch SGD.

Depending on ones settings for n_steps, dataset size and number of CPUs it's also much more likely without the fix that the model won't see all the training examples. For example, if you have 100,000 positive examples and are using 50 positive examples per batch, then it should take 2,000 steps for an epoch. But if you're on a machine with 56 CPUs then without the fix your model won't see all the data because it takes 56x more steps to see the same amount of unique data samples.

Some indicative training profiles are below on my own data. Grey is without the fix, blue is with the fix.

Screenshot 2024-09-16 at 3 13 01 PM Screenshot 2024-09-16 at 3 14 48 PM

Note that I made a number of other changes to the training code as part of the runs that gave the above profile. Most notably disabling the max_negative_weight scaling, so it's merely indicative. A comparison based on the state of the code in this PR is made below, after reverting my other changes:

<style type="text/css"></style>

-- Accuracy Recall FPPH
No fix 0.674 0.350 1.150
Fix 0.764 0.534 17.080

There was a bug where consecutive training steps would use the same batch. The
number of training steps using the same batch is equal to `n_cpu`. This is to
do with the number of workers prefetching data in the dataloader, and somehow
each of them independently keeping track of their own `data_counter` member in
`mmap_batch_generator`. I don't completely understand how the prefetching
caused this, all I know is that it did.

The fix substantially changes the training dynamics by bringing stochasticity
back into training. In particular training error is much lower and validation
accuracy and recall are much higher. It's noteworthy that validation false
positives are also much higher. The training logic in this codebase is quite
idiosyncratic so it's likely some of the settings and logic need to be
recalibrated to work well with correct minibatch SGD to minimize false
positives.
dankorotin added a commit to dankorotin/openWakeWord that referenced this pull request Apr 15, 2026
…ause

Three training-side bugs inherited from upstream. All three were identified
in our own Phase 0 survey and have sat on upstream for months without
resolution; each is fixed here with a root-cause change rather than a
workaround.

1. Gradient accumulation was dead code (upstream issue dscripka#316)

   openwakeword.train.Model.train_model called optimizer.zero_grad() at the
   top of every iteration and only called loss.backward() once the running
   accumulated-sample count crossed the 128-sample target. Gradients from
   all intermediate-window iterations were silently wiped before they could
   accumulate in param.grad, so only the final iteration of each window
   contributed to the parameter update. The "accumulation" logic was
   tracking sample counts but never actually accumulating gradients.

   Fix: zero grads only at the window boundary (immediately after
   optimizer.step()), call loss.backward() on every iteration (gradients
   accumulate naturally in param.grad), commit optimizer.step() when the
   accumulated-sample threshold is crossed, and scale each iteration's loss
   by (batch_size / gradient_accum_target) so the effective learning rate
   no longer drifts with the accumulation-window length. A new
   gradient_accum_target kwarg (default 128) replaces the hard-coded
   threshold.

   tests/test_train_gradient_accumulation.py locks in the fix with two
   assertions: (a) parameters change after training with sub-threshold
   batch sizes, (b) history['loss'] logs one entry per committed window.
   Both tests pass locally with the [train] extras installed; they skip
   cleanly via pytest.importorskip otherwise.

2. DataLoader prefetch batch duplication (upstream PR dscripka#202, unmerged)

   mmap_batch_generator in data.py keeps a per-worker data_counter. With
   num_workers=n_cpus, prefetch_factor=16 each DataLoader worker
   independently replayed the first batches from its own counter, so
   training saw every batch n_cpus times in a row before advancing. On a
   56-CPU machine that is 56× batch duplication.

   Fix: drop to single-worker data loading, matching oadams' upstream patch
   in PR dscripka#202. A proper multi-worker solution would shard the generator
   via torch.utils.data.get_worker_info inside IterDataset.__iter__; that
   is flagged as a follow-up in the inline comment and in CHANGELOG.

3. acoustics==0.2.6 is broken on SciPy ≥ 1.15

   acoustics imports scipy.special.sph_harm, which was removed in SciPy
   1.15 (replaced by sph_harm_y). Upstream pinned acoustics as a training
   extra, which would force us to hold SciPy back to < 1.15 in the
   [train] extra just to generate one 1-D colored-noise array per batch
   during data augmentation.

   Fix: drop the acoustics dependency entirely. Colored noise is now
   generated inline by a small FFT-based helper (_colored_noise, five
   colors: white/pink/blue/brown/violet) in data.py. The call site in
   augment_clips switches from acoustics.generator.noise(...) to the new
   helper. CHANGELOG explains the swap.

No changes to the runtime inference surface — all three touch
openwakeword.train / openwakeword.data and are relevant only during
retraining. Existing test suite (21 runtime tests + 1 custom-verifier
test) remains green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant