Datasets core#900
Conversation
OrestisLomis
left a comment
There was a problem hiding this comment.
I think it is alright as is now. See comment for more detailed discussion.
| } | ||
|
|
||
|
|
||
| class IndexedDataset(Dataset): |
There was a problem hiding this comment.
I do agree that forcing someone who wants to implement a Dataset to throw errors for abstract functions he doesn't need is maybe not the cleanest design, however I would argue it is also not necessary. Both the cases you mentioned can naturally be iterated over, the indexing is perhaps a little trickier and maybe not so clean, but still possible by naively just starting from the startpoint every time you want to index an instance. This can be improved with hashing/caching perhaps, but up to the implementor.
tias
left a comment
There was a problem hiding this comment.
great!
also nicely extensive test-suite.
Some questions/remarks where I think it can be a bit cleaner/simpler, but you know better what is still coming so responses very welcome
| bytes_num /= 1024.0 | ||
|
|
||
|
|
||
| class classproperty: |
There was a problem hiding this comment.
that seems like sugar coating... do we care?
things like this can stop us from using mypyc in the future, while file parsing/loading is something where C code can really shine...
There was a problem hiding this comment.
(read other comments lower first)
There was a problem hiding this comment.
My reasoning for this was to be able to define "abstract" on class fields, so that when a user is developing a new dataset they get immediate feedback when they forgot to overwrite one of these fields (at import time instead of construction time). With that decorator it acts just like an abstractmethod. But I do also fear that this will cause issues with mypyc. An alternative would be to just use normal fields with placeholder values that should be replaced (or leave empty) and then use the __init_subclass__ method to check, before construction, that the user did so.
class Dataset(ABC):
name: ClassVar[str]
description: ClassVar[str]
homepage: ClassVar[str]
citation: ClassVar[List[str]] = []
def __init_subclass__(cls, **kwargs):
super().__init_subclass__(**kwargs)
if cls is Dataset:
return
for attr in ("name", "description", "homepage"):
if attr not in cls.__dict__:
raise TypeError(f"{cls.__name__} must define class attribute {attr!r}")There was a problem hiding this comment.
So +- what you also propose later on. And thinking of it, no need to use the special __init__subclass__, could indeed just check in the regular constructor.
There was a problem hiding this comment.
Actually I do like __init_subclass__, allows for definition-time exceptions instead of construction-time, just like the previous classproperty.
| fp = futures[future] | ||
| print(f"Error collecting metadata for {fp.name}: {e}") | ||
|
|
||
| def _collect_one_metadata(self, file_path): |
There was a problem hiding this comment.
euh... we have 'instance_metadata()', why do we need this? it looks duplicate... (its also untyped)
There was a problem hiding this comment.
Cleaned it up a lot to remove all this duplicate code. Have also removed parallel metadata extraction (it was not really used and added a lot of extra code complexity). Might revisit if the future, but fine for now.
via __init_subclass__
|
Based on the feedback, I simplified the dataset classes. Also added it to the CPMpy API documentation. I do feel that we should also add a "Datasets and Benchmarking" page in the advanced section of the docs, giving some examples on how to use the datasets in practice, loading them into CPMpy models and solving, translating to a different format, etc. But that's for a separate PR. |
tias
left a comment
There was a problem hiding this comment.
Yes, great!
-
docs: I agree. I do see good and extensive docs in datasets/core.py; I would not recommend writing a lot of new documentation that is already present in the files; rather I would suggest that you add a short motivating paragraph in the user docs and then point to these module docs?
-
I see XCSP3 dataset, but I don't see a change to the current tools/xcsp3; I would expect that the tool switches to using this dataset reader?
or is that not unlinkable from the IO PR (in which case we can just continue merging this now)
|
Resolved the two comments
|
tias
left a comment
There was a problem hiding this comment.
I made a readme update and 2 'raise' fixes/changes in benchmark.py of xcsp3
Can you check them? If good, then please do merge!
A first PR in a larger sequence of upcoming ones, bringing the work on datasets, IO and benchmarks from branch
benchmark_datasetsinto master. Tried to keep it as minimal as possible, also with just a single dataset implementation; XCSP3Dataset. In some places you will notice some placeholders / things that don't seem needed right now but that future PRs will use to build upon. Some of these are labelled with "TODO".