-
Notifications
You must be signed in to change notification settings - Fork 25
feat: Dataset#category #406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Could you update the config files so we can what the use of category label would look like? Also is a user now required to add a dataset category to each dataset they put in a config? |
|
The |
I'll avoid this in the main |
|
I'm missing what we want to support with this category attribute. We want to list many datasets in a single config file and then somewhere else say only run on datasets where category=X? That was part of #309, but I'm not seeing the advantages of doing that all within a single config file. |
|
This would be useful for cross-dataset-category analysis (i.e. unified statistics that check how well algorithms do on certain dataset categories than other categories). This is currently no longer useful for parameter tuning. |
|
Would this PR’s goal also be to compute summary statistics and ML outputs across all datasets and across algorithms? I’m not fully sure what the intended use case is for cross–dataset-class comparisons other than it being maybe being useful for the benchmarking study. Also, I think there may be a simpler way to implement this: instead of specifying dataset categories per algorithm, could we add an option in the ML step to automatically run over all datasets listed in the config and all the enabled algorithms by default? That would avoid needing extra per-algorithm category settings. |
|
This PR's goal is motivated by both the benchmarking study and the CI for spras-benchmarking. We talked about this PR some time ago, and decided that it was not needed since we can accomplish the same goal with two configs, though this ends up being quite annoying in practice (spras-benchmarking CI currently does this). I'm confused by your ML suggestion: we would still need a way to filter datasets to not run on certain algorithms. |
This is a short PR that just adds the Dataset#category parameter and parses it/stores it without usage, as I want to make sure that we agree on
categoryovercategories: the latter, while more general, seems biologically useless.