Skip to content

Conversation

@ak3ra
Copy link
Contributor

@ak3ra ak3ra commented Feb 19, 2024

Notebook to finetune asr models with leb module

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@ak3ra ak3ra requested review from jqug and sharonibejih February 19, 2024 13:02
@review-notebook-app
Copy link

review-notebook-app bot commented Feb 20, 2024

View / edit / reply to this conversation on ReviewNB

jqug commented on 2024-02-20T12:29:39Z
----------------------------------------------------------------

Line #3.    !git clone https://github.com/jqug/leb.git

We could use https://github.com/sunbirdai/leb.git instead (more stable than my personal repo which could have untested changes)


@review-notebook-app
Copy link

review-notebook-app bot commented Feb 20, 2024

View / edit / reply to this conversation on ReviewNB

jqug commented on 2024-02-20T12:29:40Z
----------------------------------------------------------------

Line #5.        name: multispeaker-lug

If you change this to studio-lug then it will use the TTS data. (multispeaker-lug is more aimed at ASR, with several speakers, background noise, etc).


jqug commented on 2024-02-20T13:51:52Z
----------------------------------------------------------------

Oh sorry, I got confused here and thought it was TTS for a moment, ignore this.

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 20, 2024

View / edit / reply to this conversation on ReviewNB

jqug commented on 2024-02-20T12:29:40Z
----------------------------------------------------------------

Line #5.        name: multispeaker-lug

studio-lug


@review-notebook-app
Copy link

review-notebook-app bot commented Feb 20, 2024

View / edit / reply to this conversation on ReviewNB

jqug commented on 2024-02-20T12:29:41Z
----------------------------------------------------------------

Line #2.        # check that all files have the correct sampling rate

Maybe if these print statements aren't needed anymore they could be deleted?


@review-notebook-app
Copy link

review-notebook-app bot commented Feb 20, 2024

View / edit / reply to this conversation on ReviewNB

jqug commented on 2024-02-20T12:29:43Z
----------------------------------------------------------------

Line #2.    class DataCollatorCTCWithPadding:

Can you try leb.utils.DataCollatorCTCWithPadding here? It should just be the same code I think (checked into leb so that we don't need to redefine each time in the training scripts)


@review-notebook-app
Copy link

review-notebook-app bot commented Feb 20, 2024

View / edit / reply to this conversation on ReviewNB

jqug commented on 2024-02-20T12:29:44Z
----------------------------------------------------------------

Line #1.    wer_metric = datasets.load_metric("wer")

wer_metric = evaluate.load('wer') should make the deprecation warning go away.


@review-notebook-app
Copy link

review-notebook-app bot commented Feb 20, 2024

View / edit / reply to this conversation on ReviewNB

jqug commented on 2024-02-20T12:29:45Z
----------------------------------------------------------------

These settings could actually go into the config so that all the parameters are logged together in MLFlow/wandb/etc.

config = '''
	training:
		output_dir: output/mms-lug
		per_device_train_batch_size: 2
		# ... and so on...
datasets:
	# Dataset settings can go in the config too

'''

then put this into action with

training_args = TrainingArguments(**config['training'])

Example notebook here.



Copy link
Contributor

jqug commented Feb 20, 2024

Oh sorry, I got confused here and thought it was TTS for a moment, ignore this.


View entire conversation on ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Feb 20, 2024

View / edit / reply to this conversation on ReviewNB

jqug commented on 2024-02-20T13:56:51Z
----------------------------------------------------------------

Line #17.        - remove_punctuation

If using the most recent leb version, this changed from remove_punctuation to clean_and_remove_punctuation

(the implementation was updated to use the cleantext library)


@ak3ra
Copy link
Contributor Author

ak3ra commented Mar 18, 2024

Added eval notebook please review @jqug , @sharonibejih feedback welcome

@gitguardian
Copy link

gitguardian bot commented Mar 18, 2024

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
10025497 Triggered Hugging Face user access token 8b4ba7c notebooks/leb_salt_evaluation.ipynb View secret
10025497 Triggered Hugging Face user access token b55f39a notebooks/leb_salt_evaluation.ipynb View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

@@ -0,0 +1,1628 @@
{
Copy link
Contributor

@jqug jqug Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    auth_token = "xxxxx"

maybe test if os.environ["HF_TOKEN"]  has been set, and if not then use getpass to define the auth token?


Reply via ReviewNB

@@ -0,0 +1,1628 @@
{
Copy link
Contributor

@jqug jqug Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #13.            os.environ["HF_TOKEN"] = "xxxxx"

should this be something like:

os.environ["HF_TOKEN"] = auth_token

self.auth_token = os.environ.get("HF_TOKEN")



Reply via ReviewNB

@@ -0,0 +1,1628 @@
{
Copy link
Contributor

@jqug jqug Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #49.            pipe.model.load_adapter(language)

Would this pipeline include a language model to improve the decoding? Wondering how that would be set, so that the right LM is loaded.


Reply via ReviewNB

@@ -0,0 +1,1628 @@
{
Copy link
Contributor

@jqug jqug Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #71.            batch_size = 8

This could be an optional argument to __init__, so that we set self.batch_size and it defaults to 8 (since we might run this pipeline on different machine configurations.


Reply via ReviewNB

@ak3ra
Copy link
Contributor Author

ak3ra commented Mar 19, 2024

@jqug , @sharonibejih after training, please save the adapter as follows

# first we import WAV2VEC2_ADAPTER_SAFE_FILE
from transformers.models.wav2vec2.modeling_wav2vec2 import WAV2VEC2_ADAPTER_SAFE_FILE
from safetensors.torch import save_file as safe_save_file

...
...

model.gradient_checkpointing_enable()
model.init_adapter_layers()
model.freeze_base_model()

adapter_weights = model._get_adapters()
for param in adapter_weights.values():
    param.requires_grad = True

...

...
trainer.train()

# after training

adapter_file = WAV2VEC2_ADAPTER_SAFE_FILE.format(target_lang)
adapter_file = os.path.join(training_args.output_dir, adapter_file)
safe_save_file(model._get_adapters(), adapter_file, metadata={"format": "pt"})

trainer.push_to_hub()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants