Data deduplication by Ciroye · Pull Request #44 · mlcommons/peoples-speech

Ciroye · 2021-07-27T15:50:58Z

Add

All the necessary functions to run the deduplication pipeline

github-actions · 2021-07-27T15:51:13Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

galv

You are missing a BUILD file for deduplicate.py. It's fine to put deduplicate.py in galvasr2/, but let's talk about BUILD file creation before merging this.

galv · 2021-08-09T15:10:32Z

@@ -0,0 +1,123 @@
+import logging
+from galvasr2.align.spark.align_lib import load_audio_id_text_id_mapping, load_transcripts
+from datasketch import MinHash, MinHashLSH, MinHashLSHForest


can you add the appropriate packages to environment.yml? I don't think we have datasketch or nltk right now.

galv · 2021-08-09T15:13:34Z

+        logging.getLogger("py4j").setLevel(logging.ERROR)
+        catalogue_df = load_audio_id_text_id_mapping(spark, data_trans_index)
+        training_sample_rows = catalogue_df.collect()
+        # Comment this out to load everything. It might takes ~15 minute, in my experience, on an 8 core machine.


I think you should delete "Comment this out to load everything.".

Okay to keep the note about how long loading takes. By the way, is that an old comment? My expectation was that our spark 3.1.2 upgrade fixed the slowdown with loading transcripts.

galv · 2021-08-09T15:19:26Z

+        catalogue_df = load_audio_id_text_id_mapping(spark, data_trans_index)
+        training_sample_rows = catalogue_df.collect()
+        # Comment this out to load everything. It might takes ~15 minute, in my experience, on an 8 core machine.
+        if self.num_rows > 1:


I am not enthusiastic about self.num_rows == 1 being a special case. I would recommend declaring num_rows: Option[int] = None in __init__ instead. Then you can do if self.num_rows is not None: as the condition here.

Ciroye added 2 commits July 26, 2021 16:15

data deduplication pipeline

e434230

deduplication functions

29c32e9

galv reviewed Aug 9, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data deduplication#44

Data deduplication#44
Ciroye wants to merge 2 commits into
mainfrom
juanciro/data-deduplication

Ciroye commented Jul 27, 2021

Uh oh!

github-actions Bot commented Jul 27, 2021

Uh oh!

galv left a comment

Uh oh!

galv Aug 9, 2021

Uh oh!

galv Aug 9, 2021

Uh oh!

galv Aug 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ciroye commented Jul 27, 2021

Add

Uh oh!

github-actions Bot commented Jul 27, 2021

Uh oh!

galv left a comment

Choose a reason for hiding this comment

Uh oh!

galv Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

galv Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

galv Aug 9, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants