Proposing iterable dataset to use miditok with streaming mode

# Context

I have worked on a project where I needed to tokenize a dataset with 14 GB of size. As I don't have enough hard disk space in my PC, and I'd like to work with miditok, my solution was to use the Hugging Face load dataset in streaming mode, so I could process the MIDI data on the fly with miditok `MusicTokenizer` object.

# Feature Proposal

In order to facilitate the miditok's use in stream mode dataset contexts, the proposed feature consists of:

- Create an abstract dataset (suggested name is `_StreamingDatasetABC` to standardize with the existing abstract class in repo). This class inherits from `torch.utils.data.IterableDataset` and `ABC` and it will holds samples (and optionally labels) and implements the basic magic methods;
- Create both `IterableDatasetMIDI` and `IterableDatasetJSON` so they can handle the logic that each one is required to perform;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposing iterable dataset to use miditok with streaming mode #273

Context

Feature Proposal

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Proposing iterable dataset to use miditok with streaming mode #273

Description

Context

Feature Proposal

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions