Add a transformer-based default English fallback#74
Conversation
|
Something seems wrong with the en_us one I uploaded to huggingface... I will try training that again soon. |
|
Thanks for the PR! @PeterReid A fallback model has long been on the bucket list, and I am impressed with the size being <10 MB each. The dictionary files are periodically patched when errors are found. I have pushed the latest bump to #75 (still a draft PR, may need verification) and also uploaded the latest dictionaries just now to https://huggingface.co/datasets/hexgrad/misaki so if you are retraining, consider grabbing the latest dictionaries off HF. Edit: Consider also using the silver dictionaries, either for training, validation, or both? |
|
Thanks for the advice and I'm glad this seems like a right direction to you! I have published updated models to huggingface. I updated to the newer dictionaries from #75 and used 90% of the silver as more training data, and 10% as eval. After a lot of messing with the training parameters, I've gotten the model to 3MB and performing better than it did before (based on eval loss and its reading of that poem). jabberwocky_us.mp4jabberwocky_gb.mp4 |
I trained some BartForConditionalGeneration models from the
transformerspackage to serve as a default fallback. This potentially drops the need for espeak for out-of-domain words. I used en_gold and gb_gold as the basis for the training data.The main tweak I did to it was because regular plural versions of words do not seem to appear in it. To fix that, I scraped pluralized versions of words from wiki.train.raw and got their pronunciations the same way misaki does when it sees a plural.
Here are some sample lines from the poem Jabberwocky, and the G2P that it did with the new fallback.
"Beware the Jubjub bird, and shun
The frumious Bandersnatch!"
frumious -> fɹˈuːmɪəs
Jubjub -> ʤˈʌbʤʌb
twas_brillig_en.mp4
"Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;"
wabe -> wˈAb
gimble -> ɡˈɪmbᵊl
toves -> tˈOvz
brillig -> bɹˈɪlɪɡ
jubjub_gb.mp4
Here's my training code, which is not exactly cleaned, but at least explains what I did.