Skip to content

Data deduplication#44

Open
Ciroye wants to merge 2 commits into
mainfrom
juanciro/data-deduplication
Open

Data deduplication#44
Ciroye wants to merge 2 commits into
mainfrom
juanciro/data-deduplication

Conversation

@Ciroye
Copy link
Copy Markdown
Collaborator

@Ciroye Ciroye commented Jul 27, 2021

Add

  • All the necessary functions to run the deduplication pipeline

@github-actions
Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Copy Markdown
Collaborator

@galv galv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are missing a BUILD file for deduplicate.py. It's fine to put deduplicate.py in galvasr2/, but let's talk about BUILD file creation before merging this.

Comment thread galvasr2/deduplicate.py
@@ -0,0 +1,123 @@
import logging
from galvasr2.align.spark.align_lib import load_audio_id_text_id_mapping, load_transcripts
from datasketch import MinHash, MinHashLSH, MinHashLSHForest
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the appropriate packages to environment.yml? I don't think we have datasketch or nltk right now.

Comment thread galvasr2/deduplicate.py
logging.getLogger("py4j").setLevel(logging.ERROR)
catalogue_df = load_audio_id_text_id_mapping(spark, data_trans_index)
training_sample_rows = catalogue_df.collect()
# Comment this out to load everything. It might takes ~15 minute, in my experience, on an 8 core machine.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should delete "Comment this out to load everything.".

Okay to keep the note about how long loading takes. By the way, is that an old comment? My expectation was that our spark 3.1.2 upgrade fixed the slowdown with loading transcripts.

Comment thread galvasr2/deduplicate.py
catalogue_df = load_audio_id_text_id_mapping(spark, data_trans_index)
training_sample_rows = catalogue_df.collect()
# Comment this out to load everything. It might takes ~15 minute, in my experience, on an 8 core machine.
if self.num_rows > 1:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not enthusiastic about self.num_rows == 1 being a special case. I would recommend declaring num_rows: Option[int] = None in __init__ instead. Then you can do if self.num_rows is not None: as the condition here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants