Speed up DSAlign by galv · Pull Request #38 · mlcommons/peoples-speech

galv · 2021-07-06T21:04:45Z

Add a unit-test (dsalign_lib_test.py) for checking that these speed
ups actually work.

I add cython as a dependency in order to make the smithwaterman
function faster. This makes the runtime of "sw_align" approximately
10x faster. "sw_align_old" is retained in case we ever want to check
that the new function exactly matches the old output (I already
checked that it does with the unit test).

This is the current results from profiling
dsalign_lib_test.py. Previously we took over 200 seconds to align this
segment, but now we are at around 120 seconds.

Ordered by: internal time
ncalls tottime percall cumtime percall
filename:lineno(function)
52003 48.730 0.001 111.226 0.002 text.py:184(similarity)
69031918 39.739 0.000 57.229 0.000 utils.py:105(enweight)
69215860 12.950 0.000 12.993 0.000 text.py:152(ngrams)
670 10.514 0.016 10.523 0.016 search.py:49(sw_align)
68719564 4.510 0.000 4.510 0.000 {built-in method
builtins.abs}
13218994 2.369 0.000 2.369 0.000 {built-in method
builtins.min}
30927766 2.250 0.000 2.250 0.000
init.py:570(missing)
601 1.939 0.003 12.488 0.021 search.py:107(find_best)
52003 0.422 0.000 111.648 0.002
dsalign_lib.py:196()
52576 0.206 0.000 0.206 0.000 {built-in method
builtins.sum}
104678 0.187 0.000 0.269 0.000
init.py:550(init)
1 0.170 0.170 0.238 0.238
text.py:63(add_original_text)
1278000/1277989 0.150 0.000 0.150 0.000 {built-in method
builtins.len}
312018 0.139 0.000 0.139 0.000
text.py:168(weighted_ngrams)
52003 0.103 0.000 111.786 0.002 dsalign_lib.py:215()

github-actions · 2021-07-06T21:04:58Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Add a unit-test (dsalign_lib_test.py) for checking that these speed ups actually work. I add cython as a dependency in order to make the smithwaterman function faster. This makes the runtime of "sw_align" approximately 10x faster. "sw_align_old" is retained in case we ever want to check that the new function exactly matches the old output (I already checked that it does with the unit test). This is the current results from profiling dsalign_lib_test.py. Previously we took over 200 seconds to align this segment, but now we are at around 120 seconds. Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 52003 48.730 0.001 111.226 0.002 text.py:184(similarity) 69031918 39.739 0.000 57.229 0.000 utils.py:105(enweight) 69215860 12.950 0.000 12.993 0.000 text.py:152(ngrams) 670 10.514 0.016 10.523 0.016 search.py:49(sw_align) 68719564 4.510 0.000 4.510 0.000 {built-in method builtins.abs} 13218994 2.369 0.000 2.369 0.000 {built-in method builtins.min} 30927766 2.250 0.000 2.250 0.000 __init__.py:570(__missing__) 601 1.939 0.003 12.488 0.021 search.py:107(find_best) 52003 0.422 0.000 111.648 0.002 dsalign_lib.py:196(<lambda>) 52576 0.206 0.000 0.206 0.000 {built-in method builtins.sum} 104678 0.187 0.000 0.269 0.000 __init__.py:550(__init__) 1 0.170 0.170 0.238 0.238 text.py:63(add_original_text) 1278000/1277989 0.150 0.000 0.150 0.000 {built-in method builtins.len} 312018 0.139 0.000 0.139 0.000 text.py:168(weighted_ngrams) 52003 0.103 0.000 111.786 0.002 dsalign_lib.py:215(<lambda>)

Ignore [noise] and other silence-like words. Convert them to silence in the ctm file. Do basic text normalization with gruut Split only on word boundaries in forced alignment. Disable (just by commenting out) the gap alignment stage of DSAlign. I have not found it helpful. It tends to include or disclude text that isn't part of the original audio.

Lots of changes in here that I did not do a good job of documenting. Sorry.

Kaldi requires data in sorted order according to key. Keeping the tar file data sorted by key makes that easier to support (i.e., the HDD won't have to seek around as much). Remove " " from key names, since kaldi doesn't support i " " in key names. Allow option to output audio codec in whatever format you want. Right now, it's wav file format because Ceron encountered some issues with loading flac from tar files in nemo. Comment out some bazel targets that we don't need right now (sorry...)

Ciroye previously approved these changes Jul 7, 2021

View reviewed changes

galv force-pushed the daniel/speed-up-alignment branch 2 times, most recently from ae4cdc4 to c4bf91a Compare September 14, 2021 21:18

galv added 4 commits September 16, 2021 17:26

Segment flac files Creation stages.

8ef3d04

Lots of changes in here that I did not do a good job of documenting. Sorry.

Rerun black.

b9263ff

galv force-pushed the daniel/speed-up-alignment branch from c4bf91a to b9263ff Compare September 16, 2021 17:50

galv and others added 2 commits September 19, 2021 20:35

fixup: Format Python code with Black

21b8878

nathanwasson dismissed Ciroye’s stale review via 21b8878 May 16, 2023 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up DSAlign#38

Speed up DSAlign#38
galv wants to merge 6 commits into
mainfrom
daniel/speed-up-alignment

galv commented Jul 6, 2021

Uh oh!

github-actions Bot commented Jul 6, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

galv commented Jul 6, 2021

Uh oh!

github-actions Bot commented Jul 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jul 6, 2021 •

edited

Loading