Data deduplication#44
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
galv
left a comment
There was a problem hiding this comment.
You are missing a BUILD file for deduplicate.py. It's fine to put deduplicate.py in galvasr2/, but let's talk about BUILD file creation before merging this.
| @@ -0,0 +1,123 @@ | |||
| import logging | |||
| from galvasr2.align.spark.align_lib import load_audio_id_text_id_mapping, load_transcripts | |||
| from datasketch import MinHash, MinHashLSH, MinHashLSHForest | |||
There was a problem hiding this comment.
can you add the appropriate packages to environment.yml? I don't think we have datasketch or nltk right now.
| logging.getLogger("py4j").setLevel(logging.ERROR) | ||
| catalogue_df = load_audio_id_text_id_mapping(spark, data_trans_index) | ||
| training_sample_rows = catalogue_df.collect() | ||
| # Comment this out to load everything. It might takes ~15 minute, in my experience, on an 8 core machine. |
There was a problem hiding this comment.
I think you should delete "Comment this out to load everything.".
Okay to keep the note about how long loading takes. By the way, is that an old comment? My expectation was that our spark 3.1.2 upgrade fixed the slowdown with loading transcripts.
| catalogue_df = load_audio_id_text_id_mapping(spark, data_trans_index) | ||
| training_sample_rows = catalogue_df.collect() | ||
| # Comment this out to load everything. It might takes ~15 minute, in my experience, on an 8 core machine. | ||
| if self.num_rows > 1: |
There was a problem hiding this comment.
I am not enthusiastic about self.num_rows == 1 being a special case. I would recommend declaring num_rows: Option[int] = None in __init__ instead. Then you can do if self.num_rows is not None: as the condition here.
Add