feat: Dataset#category #406

tristan-f-r · 2025-10-06T20:37:55Z

Depends on test: dataset #326

This is a short PR that just adds the Dataset#category parameter and parses it/stores it without usage, as I want to make sure that we agree on category over categories: the latter, while more general, seems biologically useless.

read-the-docs-community · 2025-10-06T20:38:58Z

Documentation build overview

📚 spras | 🛠️ Build #31218800 | 📁 Comparing dc40e7f against latest (18f2cf8)

🔍 Preview build

No files changed.

ntalluri · 2025-10-07T15:37:45Z

Could you update the config files so we can what the use of category label would look like? Also is a user now required to add a dataset category to each dataset they put in a config?

tristan-f-r · 2025-10-07T16:00:09Z

The Optional tag in the schema means that category would be optional.

spras/config/schema.py

tristan-f-r · 2025-10-07T17:07:44Z

Could you update the config files so we can what the use of category label would look like?

I'll avoid this in the main config.yaml until we have sample data for a new dataset. I'll add accompanying documentation once we have some way to have runs only run on specific dataset categories.

agitter · 2025-10-10T22:03:18Z

I'm missing what we want to support with this category attribute. We want to list many datasets in a single config file and then somewhere else say only run on datasets where category=X? That was part of #309, but I'm not seeing the advantages of doing that all within a single config file.

tristan-f-r · 2025-10-14T17:34:22Z

This would be useful for cross-dataset-category analysis (i.e. unified statistics that check how well algorithms do on certain dataset categories than other categories). This is currently no longer useful for parameter tuning.

ntalluri · 2026-02-05T16:42:57Z

Would this PR’s goal also be to compute summary statistics and ML outputs across all datasets and across algorithms? I’m not fully sure what the intended use case is for cross–dataset-class comparisons other than it being maybe being useful for the benchmarking study.

Also, I think there may be a simpler way to implement this: instead of specifying dataset categories per algorithm, could we add an option in the ML step to automatically run over all datasets listed in the config and all the enabled algorithms by default? That would avoid needing extra per-algorithm category settings.

tristan-f-r · 2026-02-05T18:41:09Z

This PR's goal is motivated by both the benchmarking study and the CI for spras-benchmarking. We talked about this PR some time ago, and decided that it was not needed since we can accomplish the same goal with two configs, though this ends up being quite annoying in practice (spras-benchmarking CI currently does this).

I'm confused by your ML suggestion: we would still need a way to filter datasets to not run on certain algorithms.

feat: Dataset#category

43d746b

tristan-f-r added blocked-by-other-pr enhancement New feature or request labels Oct 6, 2025

chore: make Datset#category optional in schema

df9c295

tristan-f-r added the tuning Workflow-spanning algorithm tuning label Oct 6, 2025

tristan-f-r mentioned this pull request Oct 6, 2025

[config] spras-run-on for alg runs #309

Open

feat: store dataset categories

edb25c4

tristan-f-r mentioned this pull request Oct 7, 2025

Support two-stage parameter grid search #318

Open

ntalluri reviewed Oct 7, 2025

View reviewed changes

spras/config/schema.py Outdated Show resolved Hide resolved

Update spras/config/schema.py

965f87c

tristan-f-r removed the tuning Workflow-spanning algorithm tuning label Oct 9, 2025

github-actions bot added the merge-conflict This PR has merge conflicts. label Jan 31, 2026

tristan-f-r added 2 commits January 30, 2026 20:42

Merge branch 'main' into categories

5d76c6d

chore: update dataset schema with categories

dc40e7f

tristan-f-r removed the blocked-by-other-pr label Jan 31, 2026

github-actions bot removed the merge-conflict This PR has merge conflicts. label Jan 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Dataset#category #406

feat: Dataset#category #406

Uh oh!

tristan-f-r commented Oct 6, 2025 •

edited

Loading

Uh oh!

read-the-docs-community bot commented Oct 6, 2025 •

edited

Loading

Uh oh!

ntalluri commented Oct 7, 2025

Uh oh!

tristan-f-r commented Oct 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

tristan-f-r commented Oct 7, 2025

Uh oh!

agitter commented Oct 10, 2025

Uh oh!

tristan-f-r commented Oct 14, 2025

Uh oh!

ntalluri commented Feb 5, 2026

Uh oh!

tristan-f-r commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Dataset#category #406

Are you sure you want to change the base?

feat: Dataset#category #406

Uh oh!

Conversation

tristan-f-r commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

read-the-docs-community bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

ntalluri commented Oct 7, 2025

Uh oh!

tristan-f-r commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tristan-f-r commented Oct 7, 2025

Uh oh!

agitter commented Oct 10, 2025

Uh oh!

tristan-f-r commented Oct 14, 2025

Uh oh!

ntalluri commented Feb 5, 2026

Uh oh!

tristan-f-r commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tristan-f-r commented Oct 6, 2025 •

edited

Loading

read-the-docs-community bot commented Oct 6, 2025 •

edited

Loading

tristan-f-r commented Oct 7, 2025 •

edited

Loading