Datasets core by ThomSerg · Pull Request #900 · CPMpy/cpmpy

ThomSerg · 2026-04-01T14:00:58Z

A first PR in a larger sequence of upcoming ones, bringing the work on datasets, IO and benchmarks from branch benchmark_datasets into master. Tried to keep it as minimal as possible, also with just a single dataset implementation; XCSP3Dataset. In some places you will notice some placeholders / things that don't seem needed right now but that future PRs will use to build upon. Some of these are labelled with "TODO".

OrestisLomis

I think it is alright as is now. See comment for more detailed discussion.

OrestisLomis · 2026-04-17T14:04:49Z

+        }
+
+
+class IndexedDataset(Dataset):


I do agree that forcing someone who wants to implement a Dataset to throw errors for abstract functions he doesn't need is maybe not the cleanest design, however I would argue it is also not necessary. Both the cases you mentioned can naturally be iterated over, the indexing is perhaps a little trickier and maybe not so clean, but still possible by naively just starting from the startpoint every time you want to index an instance. This can be improved with hashing/caching perhaps, but up to the implementor.

tias

great!

also nicely extensive test-suite.

Some questions/remarks where I think it can be a bit cleaner/simpler, but you know better what is still coming so responses very welcome

tias · 2026-04-21T07:39:04Z

+        bytes_num /= 1024.0
+
+
+class classproperty:


that seems like sugar coating... do we care?

things like this can stop us from using mypyc in the future, while file parsing/loading is something where C code can really shine...

(read other comments lower first)

My reasoning for this was to be able to define "abstract" on class fields, so that when a user is developing a new dataset they get immediate feedback when they forgot to overwrite one of these fields (at import time instead of construction time). With that decorator it acts just like an abstractmethod. But I do also fear that this will cause issues with mypyc. An alternative would be to just use normal fields with placeholder values that should be replaced (or leave empty) and then use the __init_subclass__ method to check, before construction, that the user did so.

class Dataset(ABC): name: ClassVar[str] description: ClassVar[str] homepage: ClassVar[str] citation: ClassVar[List[str]] = [] def __init_subclass__(cls, **kwargs): super().__init_subclass__(**kwargs) if cls is Dataset: return for attr in ("name", "description", "homepage"): if attr not in cls.__dict__: raise TypeError(f"{cls.__name__} must define class attribute {attr!r}")

So +- what you also propose later on. And thinking of it, no need to use the special __init__subclass__, could indeed just check in the regular constructor.

Actually I do like __init_subclass__, allows for definition-time exceptions instead of construction-time, just like the previous classproperty.

tias · 2026-04-21T07:52:06Z

+                        fp = futures[future]
+                        print(f"Error collecting metadata for {fp.name}: {e}")
+
+    def _collect_one_metadata(self, file_path):


euh... we have 'instance_metadata()', why do we need this? it looks duplicate... (its also untyped)

Cleaned it up a lot to remove all this duplicate code. Have also removed parallel metadata extraction (it was not really used and added a lot of extra code complexity). Might revisit if the future, but fine for now.

Remove parallel metadata extraction, merge helper functions, remove silent collection errors

via __init_subclass__

ThomSerg · 2026-06-15T14:09:09Z

Based on the feedback, I simplified the dataset classes. Also added it to the CPMpy API documentation.

I do feel that we should also add a "Datasets and Benchmarking" page in the advanced section of the docs, giving some examples on how to use the datasets in practice, loading them into CPMpy models and solving, translating to a different format, etc. But that's for a separate PR.

tias

Yes, great!

docs: I agree. I do see good and extensive docs in datasets/core.py; I would not recommend writing a lot of new documentation that is already present in the files; rather I would suggest that you add a short motivating paragraph in the user docs and then point to these module docs?
I see XCSP3 dataset, but I don't see a change to the current tools/xcsp3; I would expect that the tool switches to using this dataset reader?
or is that not unlinkable from the IO PR (in which case we can just continue merging this now)

ThomSerg · 2026-06-22T15:55:34Z

Resolved the two comments

Added minimal user docs with reference to extensive api docs
old xcsp3 tools now make use of new dataset

tias

I made a readme update and 2 'raise' fixes/changes in benchmark.py of xcsp3

Can you check them? If good, then please do merge!

ThomSerg added 30 commits September 11, 2025 17:49

WCNF parser

feead09

Small docstring change

5ade48e

OPB parser

7f52f5f

Move parser out of init and add cli

548de8e

Add MSE and OPB datasets

4505025

Rename datasets to dataset

2b26034

Dataset specific 'open'

e238c29

Dataset module init file

669875a

Add benchmark runners

c1bd2fe

Formatting

83454e0

XCSP3 as dataset and benchmark

7f2d363

Parsers with changeable 'open'

9173c9f

Type-hints and docstrings

52b95de

Add TODOs

bf5ecd2

Mising helper functions

5dc3886

Print stacktrace of process

7209c62

Fix arguments

f66c8c5

Fix overwritten open

6ab8b32

Read as string instead of StringIO

34c8a9e

Read as text instead of binary

fd55b3a

Sigterm callbacks

2be9fa6

Attempt at fixing some nested memory exceptions

2e64623

Overwritable exit status

5b92680

Validate dataset arguments

8fff254

Check non-empty dataset

2b4a8f0

Add feedback finished downloading

b68144d

Small fixes

b08df43

Fix intermediate solutions and time tracking

431b065

Increase intermediate solution time resolution

7d98c35

Missing default return argument

4664051

ThomSerg added 4 commits April 17, 2026 09:50

Docstring changes

6e0aaa1

More robust instance name extraction

62b985c

Remove IndexedDataset

d01a076

Update docstring

049549c

ThomSerg requested a review from OrestisLomis April 17, 2026 08:51

OrestisLomis approved these changes Apr 20, 2026

View reviewed changes

tias reviewed Apr 21, 2026

View reviewed changes

tweaks

668e7f5

tias added this to the v0.20 milestone Jun 8, 2026

ThomSerg added 9 commits June 12, 2026 09:39

Merge branch 'master' into dataset_core

f1c71e7

Remove TODOs, for later PR

f75b4a6

Remove custom download pretty printer

6d46567

Convert raise to pass

1d4673e

Simplify

5c0510f

Remove parallel metadata extraction, merge helper functions, remove silent collection errors

Update file docstring

041b0c9

Move classproperty to definition-time check

f003a22

via __init_subclass__

Add typing

8dbc763

API documentation datasets

5dd6176

ThomSerg requested a review from tias June 15, 2026 14:33

tias approved these changes Jun 19, 2026

View reviewed changes

ThomSerg added 5 commits June 22, 2026 10:54

Convert xcsp3 tools to new dataset

3e1a463

Docs structure

e6d02a6

Docs slight reordering

a4c0f6f

include from_files in docs

3d4375c

add datasets to user docs

bde2506

tias added 2 commits June 22, 2026 22:50

update xcsp3 README

dd25dd5

xcsp3 benchmark, 2 strange lines

80bdba5

tias approved these changes Jun 22, 2026

View reviewed changes

Conversation

ThomSerg commented Apr 1, 2026

Uh oh!

OrestisLomis left a comment

Choose a reason for hiding this comment

Uh oh!

OrestisLomis Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

tias left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tias Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

tias Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ThomSerg Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomSerg Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

ThomSerg Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tias Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ThomSerg Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ThomSerg commented Jun 15, 2026

Uh oh!

tias left a comment

Choose a reason for hiding this comment

Uh oh!

ThomSerg commented Jun 22, 2026

Uh oh!

tias left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ThomSerg Jun 12, 2026 •

edited

Loading