Speed up DSAlign#38
Open
galv wants to merge 6 commits into
Open
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Ciroye
previously approved these changes
Jul 7, 2021
ae4cdc4 to
c4bf91a
Compare
Add a unit-test (dsalign_lib_test.py) for checking that these speed
ups actually work.
I add cython as a dependency in order to make the smithwaterman
function faster. This makes the runtime of "sw_align" approximately
10x faster. "sw_align_old" is retained in case we ever want to check
that the new function exactly matches the old output (I already
checked that it does with the unit test).
This is the current results from profiling
dsalign_lib_test.py. Previously we took over 200 seconds to align this
segment, but now we are at around 120 seconds.
Ordered by: internal time
ncalls tottime percall cumtime percall
filename:lineno(function)
52003 48.730 0.001 111.226 0.002 text.py:184(similarity)
69031918 39.739 0.000 57.229 0.000 utils.py:105(enweight)
69215860 12.950 0.000 12.993 0.000 text.py:152(ngrams)
670 10.514 0.016 10.523 0.016 search.py:49(sw_align)
68719564 4.510 0.000 4.510 0.000 {built-in method
builtins.abs}
13218994 2.369 0.000 2.369 0.000 {built-in method
builtins.min}
30927766 2.250 0.000 2.250 0.000
__init__.py:570(__missing__)
601 1.939 0.003 12.488 0.021 search.py:107(find_best)
52003 0.422 0.000 111.648 0.002
dsalign_lib.py:196(<lambda>)
52576 0.206 0.000 0.206 0.000 {built-in method
builtins.sum}
104678 0.187 0.000 0.269 0.000
__init__.py:550(__init__)
1 0.170 0.170 0.238 0.238
text.py:63(add_original_text)
1278000/1277989 0.150 0.000 0.150 0.000 {built-in method
builtins.len}
312018 0.139 0.000 0.139 0.000
text.py:168(weighted_ngrams)
52003 0.103 0.000 111.786 0.002 dsalign_lib.py:215(<lambda>)
Ignore [noise] and other silence-like words. Convert them to silence in the ctm file. Do basic text normalization with gruut Split only on word boundaries in forced alignment. Disable (just by commenting out) the gap alignment stage of DSAlign. I have not found it helpful. It tends to include or disclude text that isn't part of the original audio.
Lots of changes in here that I did not do a good job of documenting. Sorry.
c4bf91a to
b9263ff
Compare
Kaldi requires data in sorted order according to key. Keeping the tar file data sorted by key makes that easier to support (i.e., the HDD won't have to seek around as much). Remove " " from key names, since kaldi doesn't support i " " in key names. Allow option to output audio codec in whatever format you want. Right now, it's wav file format because Ceron encountered some issues with loading flac from tar files in nemo. Comment out some bazel targets that we don't need right now (sorry...)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a unit-test (dsalign_lib_test.py) for checking that these speed
ups actually work.
I add cython as a dependency in order to make the smithwaterman
function faster. This makes the runtime of "sw_align" approximately
10x faster. "sw_align_old" is retained in case we ever want to check
that the new function exactly matches the old output (I already
checked that it does with the unit test).
This is the current results from profiling
dsalign_lib_test.py. Previously we took over 200 seconds to align this
segment, but now we are at around 120 seconds.
Ordered by: internal time
ncalls tottime percall cumtime percall
filename:lineno(function)
52003 48.730 0.001 111.226 0.002 text.py:184(similarity)
69031918 39.739 0.000 57.229 0.000 utils.py:105(enweight)
69215860 12.950 0.000 12.993 0.000 text.py:152(ngrams)
670 10.514 0.016 10.523 0.016 search.py:49(sw_align)
68719564 4.510 0.000 4.510 0.000 {built-in method
builtins.abs}
13218994 2.369 0.000 2.369 0.000 {built-in method
builtins.min}
30927766 2.250 0.000 2.250 0.000
init.py:570(missing)
601 1.939 0.003 12.488 0.021 search.py:107(find_best)
52003 0.422 0.000 111.648 0.002
dsalign_lib.py:196()
52576 0.206 0.000 0.206 0.000 {built-in method
builtins.sum}
104678 0.187 0.000 0.269 0.000
init.py:550(init)
1 0.170 0.170 0.238 0.238
text.py:63(add_original_text)
1278000/1277989 0.150 0.000 0.150 0.000 {built-in method
builtins.len}
312018 0.139 0.000 0.139 0.000
text.py:168(weighted_ngrams)
52003 0.103 0.000 111.786 0.002 dsalign_lib.py:215()