Skip to content

preprocess and tokenize datasets#58

Open
tyoc213 wants to merge 2 commits intomainfrom
tokenize-datasets
Open

preprocess and tokenize datasets#58
tyoc213 wants to merge 2 commits intomainfrom
tokenize-datasets

Conversation

@tyoc213
Copy link
Contributor

@tyoc213 tyoc213 commented Nov 7, 2025

Rewrite of the work done in #43

@tyoc213 tyoc213 force-pushed the tokenize-datasets branch 6 times, most recently from ec5b555 to b1d64ef Compare November 26, 2025 18:17
@tyoc213 tyoc213 force-pushed the tokenize-datasets branch 2 times, most recently from f9c5f49 to b35efc0 Compare December 5, 2025 17:26
@vishalbakshi
Copy link
Contributor

vishalbakshi commented Jan 5, 2026

TBD if we need to do this or not

@tyoc213 we should replace this:

    def __iter__(self) -> Iterable[dict[str, NDArray]]:
        buffer = []
        for sample in self.hf_dataset:
            encoded = self.tokenizer(
                sample['text'],
                truncation=False,
                padding=False,
            )
            iids = encoded['input_ids']
            buffer = buffer + self.bos_tokens + iids + self.eos_tokens
            while len(buffer) >= self.max_length:
                concat_sample = buffer[:self.max_length]
                buffer = buffer[self.max_length:] if self.should_wrap else []
                yield {
                    # convert to ndarray to store in MDS format
                    'tokens': np.asarray(concat_sample, dtype=np.int32),
                }

to something like this:

    def __iter__(self) -> Iterable[dict[str, NDArray]]:
        for sample in self.hf_dataset:
            encoded = self.tokenizer(
                sample['text'],
                truncation=False,
                padding=False,
            )
            iids = self.bos_tokens + encoded['input_ids'] + self.eos_tokens
            yield {
                # convert to ndarray to store in MDS format
                'tokens': np.asarray(iids[:self.max_length],  dtype=np.int32),
            }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants