-
Notifications
You must be signed in to change notification settings - Fork 12
Added stt-finetune notebook #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
View / edit / reply to this conversation on ReviewNB jqug commented on 2024-02-20T12:29:39Z Line #3. !git clone https://github.com/jqug/leb.git We could use https://github.com/sunbirdai/leb.git instead (more stable than my personal repo which could have untested changes) |
|
View / edit / reply to this conversation on ReviewNB jqug commented on 2024-02-20T12:29:40Z Line #5. name: multispeaker-lug If you change this to studio-lug then it will use the TTS data. (multispeaker-lug is more aimed at ASR, with several speakers, background noise, etc). jqug commented on 2024-02-20T13:51:52Z Oh sorry, I got confused here and thought it was TTS for a moment, ignore this. |
|
View / edit / reply to this conversation on ReviewNB jqug commented on 2024-02-20T12:29:40Z Line #5. name: multispeaker-lug studio-lug |
|
View / edit / reply to this conversation on ReviewNB jqug commented on 2024-02-20T12:29:41Z Line #2. # check that all files have the correct sampling rate Maybe if these print statements aren't needed anymore they could be deleted? |
|
View / edit / reply to this conversation on ReviewNB jqug commented on 2024-02-20T12:29:43Z Line #2. class DataCollatorCTCWithPadding: Can you try leb.utils.DataCollatorCTCWithPadding here? It should just be the same code I think (checked into leb so that we don't need to redefine each time in the training scripts) |
|
View / edit / reply to this conversation on ReviewNB jqug commented on 2024-02-20T12:29:44Z Line #1. wer_metric = datasets.load_metric("wer")
wer_metric = evaluate.load('wer') should make the deprecation warning go away. |
|
View / edit / reply to this conversation on ReviewNB jqug commented on 2024-02-20T12:29:45Z These settings could actually go into the config so that all the parameters are logged together in MLFlow/wandb/etc.
config = ''' training: output_dir: output/mms-lug per_device_train_batch_size: 2 # ... and so on... then put this into action with
training_args = TrainingArguments(**config['training'])
Example notebook here. |
|
Oh sorry, I got confused here and thought it was TTS for a moment, ignore this. View entire conversation on ReviewNB |
|
View / edit / reply to this conversation on ReviewNB jqug commented on 2024-02-20T13:56:51Z Line #17. - remove_punctuation If using the most recent leb version, this changed from (the implementation was updated to use the cleantext library) |
|
Added eval notebook please review @jqug , @sharonibejih feedback welcome |
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 10025497 | Triggered | Hugging Face user access token | 8b4ba7c | notebooks/leb_salt_evaluation.ipynb | View secret |
| 10025497 | Triggered | Hugging Face user access token | b55f39a | notebooks/leb_salt_evaluation.ipynb | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Our GitHub checks need improvements? Share your feedbacks!
| @@ -0,0 +1,1628 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. auth_token = "xxxxx"
maybe test if os.environ["HF_TOKEN"] has been set, and if not then use getpass to define the auth token?
Reply via ReviewNB
| @@ -0,0 +1,1628 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #13. os.environ["HF_TOKEN"] = "xxxxx"
should this be something like:
os.environ["HF_TOKEN"] = auth_tokenself.auth_token = os.environ.get("HF_TOKEN")
Reply via ReviewNB
| @@ -0,0 +1,1628 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #49. pipe.model.load_adapter(language)
Would this pipeline include a language model to improve the decoding? Wondering how that would be set, so that the right LM is loaded.
Reply via ReviewNB
| @@ -0,0 +1,1628 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #71. batch_size = 8
This could be an optional argument to __init__, so that we set self.batch_size and it defaults to 8 (since we might run this pipeline on different machine configurations.
Reply via ReviewNB
|
@jqug , @sharonibejih after training, please save the adapter as follows |
Notebook to finetune asr models with leb module