Skip to content

Datasets core#900

Open
ThomSerg wants to merge 196 commits into
masterfrom
dataset_core
Open

Datasets core#900
ThomSerg wants to merge 196 commits into
masterfrom
dataset_core

Conversation

@ThomSerg

@ThomSerg ThomSerg commented Apr 1, 2026

Copy link
Copy Markdown
Collaborator

A first PR in a larger sequence of upcoming ones, bringing the work on datasets, IO and benchmarks from branch benchmark_datasets into master. Tried to keep it as minimal as possible, also with just a single dataset implementation; XCSP3Dataset. In some places you will notice some placeholders / things that don't seem needed right now but that future PRs will use to build upon. Some of these are labelled with "TODO".

@ThomSerg ThomSerg requested a review from OrestisLomis April 17, 2026 08:51

@OrestisLomis OrestisLomis left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is alright as is now. See comment for more detailed discussion.

Comment thread cpmpy/tools/datasets/core.py Outdated
}


class IndexedDataset(Dataset):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that forcing someone who wants to implement a Dataset to throw errors for abstract functions he doesn't need is maybe not the cleanest design, however I would argue it is also not necessary. Both the cases you mentioned can naturally be iterated over, the indexing is perhaps a little trickier and maybe not so clean, but still possible by naively just starting from the startpoint every time you want to index an instance. This can be improved with hashing/caching perhaps, but up to the implementor.

@tias tias left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

also nicely extensive test-suite.

Some questions/remarks where I think it can be a bit cleaner/simpler, but you know better what is still coming so responses very welcome

Comment thread cpmpy/tools/datasets/core.py Outdated
Comment thread cpmpy/tools/datasets/core.py Outdated
bytes_num /= 1024.0


class classproperty:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems like sugar coating... do we care?

things like this can stop us from using mypyc in the future, while file parsing/loading is something where C code can really shine...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(read other comments lower first)

@ThomSerg ThomSerg Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning for this was to be able to define "abstract" on class fields, so that when a user is developing a new dataset they get immediate feedback when they forgot to overwrite one of these fields (at import time instead of construction time). With that decorator it acts just like an abstractmethod. But I do also fear that this will cause issues with mypyc. An alternative would be to just use normal fields with placeholder values that should be replaced (or leave empty) and then use the __init_subclass__ method to check, before construction, that the user did so.

class Dataset(ABC):
    name: ClassVar[str]
    description: ClassVar[str]
    homepage: ClassVar[str]
    citation: ClassVar[List[str]] = []

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)
        if cls is Dataset:
            return
        for attr in ("name", "description", "homepage"):
            if attr not in cls.__dict__:
                raise TypeError(f"{cls.__name__} must define class attribute {attr!r}")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So +- what you also propose later on. And thinking of it, no need to use the special __init__subclass__, could indeed just check in the regular constructor.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I do like __init_subclass__, allows for definition-time exceptions instead of construction-time, just like the previous classproperty.

Comment thread cpmpy/tools/datasets/core.py
Comment thread cpmpy/tools/datasets/core.py
Comment thread cpmpy/tools/datasets/core.py Outdated
Comment thread cpmpy/tools/datasets/core.py Outdated
fp = futures[future]
print(f"Error collecting metadata for {fp.name}: {e}")

def _collect_one_metadata(self, file_path):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

euh... we have 'instance_metadata()', why do we need this? it looks duplicate... (its also untyped)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned it up a lot to remove all this duplicate code. Have also removed parallel metadata extraction (it was not really used and added a lot of extra code complexity). Might revisit if the future, but fine for now.

Comment thread cpmpy/tools/datasets/core.py
Comment thread cpmpy/tools/datasets/core.py Outdated
Comment thread cpmpy/tools/datasets/core.py Outdated
@tias tias added this to the v0.20 milestone Jun 8, 2026
@ThomSerg

Copy link
Copy Markdown
Collaborator Author

Based on the feedback, I simplified the dataset classes. Also added it to the CPMpy API documentation.

I do feel that we should also add a "Datasets and Benchmarking" page in the advanced section of the docs, giving some examples on how to use the datasets in practice, loading them into CPMpy models and solving, translating to a different format, etc. But that's for a separate PR.

@ThomSerg ThomSerg requested a review from tias June 15, 2026 14:33

@tias tias left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great!

  • docs: I agree. I do see good and extensive docs in datasets/core.py; I would not recommend writing a lot of new documentation that is already present in the files; rather I would suggest that you add a short motivating paragraph in the user docs and then point to these module docs?

  • I see XCSP3 dataset, but I don't see a change to the current tools/xcsp3; I would expect that the tool switches to using this dataset reader?
    or is that not unlinkable from the IO PR (in which case we can just continue merging this now)

@ThomSerg

Copy link
Copy Markdown
Collaborator Author

Resolved the two comments

  • Added minimal user docs with reference to extensive api docs
  • old xcsp3 tools now make use of new dataset

@tias tias left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a readme update and 2 'raise' fixes/changes in benchmark.py of xcsp3

Can you check them? If good, then please do merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants