Refactor code into every_eval_ever namespace and add modal CLI by Erotemic · Pull Request #57 · evaleval/every_eval_ever

Erotemic · 2026-03-02T22:22:00Z

Here's an initial pass at a refactor to make this module easier to deploy to pypi and use as a standalone tool.

All library code and json schemas are now under the every_eval_ever namespace.
I did keep symlinks to the json schemas as I know external links reference them. They can be removed in the future.
Added an CLI entrypoint in pyproject.toml so that every_eval_ever becomes a command line tool.
Added modal CLIs using basic argparse for conversions and schema validation.
Made specific backend dependencies optional (the converter CLI will give you a helpful message if you don't have them installed and try to use them)
The validate and duplicate checks are moved out of utils and into the main library, but the oneoff scripts I kept in utils. It probably would be best to rename that to "scripts", "dev", or "maintain" because right now it looks like its something that should belong in the library even though it is just repo maintenance.
Removed a lot of unnecessary boilerplate from tests now that the library can just be imported by name.

I didn't do github actions stuff, but you can at least uv pip install -e . the repo now and get a nice CLI.

On slack there was mention of github actions existing, but I didn't see a .github directory and lack of CI tests for the PR makes me nervous. Are there CI tests somewhere? I do see that actions are running in the actions tab, but I've never set them up without a .github folder.

Note: the script in utils/helm/parse_helm_leaderboards.sh seems to have an error which I don't think is introduced by this PR. I've left it alone for now, but in general uv run commands would need to be updated with uv run --extra <extra> for whichever extra they needed.

… level schema)

Erotemic · 2026-03-02T23:03:34Z

Also note the github diff reports that the schema files were deleted and re-added. I think this is a consequence of adding the symlinks to keep existing links alive as I did git mv them. Regardless, you can validate that there is no difference in those files with:

# show original hashes from the main branch
sha256sum <(git show main:instance_level_eval.schema.json) <(git show main:eval.schema.json)
# show updated hashes on this branch 
sha256sum every_eval_ever/schemas/instance_level_eval.schema.json every_eval_ever/schemas/eval.schema.json

Which on my machine results in:

ab656c857373393c76eade428fa4b7b346438e90aaedafd6add9f1b0b6954639  /dev/fd/63
a8c534a87358c9f38e5a059c921f0c4ad51a6653f90c349e84e82c6a5bbc9719  /dev/fd/62
ab656c857373393c76eade428fa4b7b346438e90aaedafd6add9f1b0b6954639  every_eval_ever/schemas/instance_level_eval.schema.json
a8c534a87358c9f38e5a059c921f0c4ad51a6653f90c349e84e82c6a5bbc9719  every_eval_ever/schemas/eval.schema.json

damian1996 · 2026-03-04T16:30:34Z

every_eval_ever/converters/README.md


 ```bash
-uv run python3 -m eval_converters.inspect --log_path tests/data/inspect/2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json
+uv run --extra inspect every_eval_ever convert inspect eee --log-path tests/data/inspect/2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json


The full commands (below examples) are outdated after these changes. We probably may remove them.

Also, I’d like to simplify this command. What is the purpose of the word eee? I suspect every-eval-ever convert inspect would be clearer.

This is because their is a destination positional argument that indicates which format to convert to. It has a default so its not required to include it. It would probably be cleaner to simply remove it as there is only one destination format. I can do that.

Also as a note to myself, I see log_path got changed to log-path, and I should undo that.

WRT to full commands after, I will clean that up.

Also, if you are open to adding a dependency on my scriptconfig CLI library we can organize the argument parsers in a much more declarative way that keeps the entrypoint much closer to the code they belong to. The argparse code works, but I always think it looks ugly. With scriptconfig you also get several quality of life improvements including opt-in autocomplete for free as well as nice colored CLIs based on its tie in with rich-argparse and argcomplete. I'd recommend landing a baseline pure-stdlib approach first, but if you are interested I could make a follow on PR.

Maybe a follow-up PR with dependency for scriptconfig CLI library would be best option, keep things as simple as possible in this PR.

damian1996 · 2026-03-04T17:00:42Z

README.md

+Validation command:
+
+```bash
+every_eval_ever validate <path-or-dir>


We do not run validation and duplication checks in the github. It's only valid in the HF during uploading new data. So we don't need it here. Convert option is enough

We probably should though, at least in a CI test suite to cover more of the code with tests. We should also show how to do it in the README.

It's harder to check duplicates now because data aren't stored in the github anymore (and it's still run when uploading data to HF, so don't see sense to do it). Validation of data will be rewritten slightly (full validation via pydantic) so it isn't needed imo.

instance_level_eval.schema.json

eval.schema.json

Erotemic · 2026-03-16T17:35:35Z

@damian1996 I've resolved merge conflicts, cleaned up the diff, and addressed prior discussion points.

The schemas on main changed, so I reverified that the merge updated them correctly:

# show original hashes from the main branch
sha256sum <(git show main:instance_level_eval.schema.json) <(git show main:eval.schema.json)
# show updated hashes on this branch 
sha256sum every_eval_ever/schemas/instance_level_eval.schema.json every_eval_ever/schemas/eval.schema.json

9aed691d0395e1ba35bfdd326ce693bd45a3a30c513bff0b3904e301e7db600f  /dev/fd/63
9a6154f41116cdf56d0b7968a24aa93900a9e6d4a9c799827a42ec079e3539c5  /dev/fd/62
9aed691d0395e1ba35bfdd326ce693bd45a3a30c513bff0b3904e301e7db600f  every_eval_ever/schemas/instance_level_eval.schema.json
9a6154f41116cdf56d0b7968a24aa93900a9e6d4a9c799827a42ec079e3539c5  every_eval_ever/schemas/eval.schema.json

Note: resolving the merge conflicts got a little messy, so I would recommend squashing this PR if it is accepted instead of creating a merge commit. That will keep the history of the repo cleaner. If desired, I could do a rebase here, but I'm punting on that unless you want it.

nelaturuharsha · 2026-03-18T10:50:46Z

Hi @Erotemic, the validation workflow is now going to be via pydantic + other changes will land once #69 is merged and there are some clean-ups regarding dependencies via #71.

Could you check if the incoming changes affect your PR? I'd assume the validation changing might but I've not had a chance to look through yours. Will prioritize this PR once I am done with changes to the validation pipeline (I am working on it assuming this PR will be merged - but don't want to rush this one).

Erotemic · 2026-03-18T17:04:17Z

@nelaturuharsha

#71 did cause conflicts, but I merged in the latest main and addressed them.

For #69 there will be merge conflicts, but the core intent of it and this PR are different and compatible, if you want to merge that one first, I can update my PR accordingly. Otherwise, if this is merged first, then the changes in #71 basically need to be moved into the correct new locations and integrated with the CLI.

damian1996 · 2026-03-18T17:22:24Z

@damian1996 I've resolved merge conflicts, cleaned up the diff, and addressed prior discussion points.

The schemas on main changed, so I reverified that the merge updated them correctly:
# show original hashes from the main branch
sha256sum <(git show main:instance_level_eval.schema.json) <(git show main:eval.schema.json)
# show updated hashes on this branch 
sha256sum every_eval_ever/schemas/instance_level_eval.schema.json every_eval_ever/schemas/eval.schema.json
9aed691d0395e1ba35bfdd326ce693bd45a3a30c513bff0b3904e301e7db600f  /dev/fd/63
9a6154f41116cdf56d0b7968a24aa93900a9e6d4a9c799827a42ec079e3539c5  /dev/fd/62
9aed691d0395e1ba35bfdd326ce693bd45a3a30c513bff0b3904e301e7db600f  every_eval_ever/schemas/instance_level_eval.schema.json
9a6154f41116cdf56d0b7968a24aa93900a9e6d4a9c799827a42ec079e3539c5  every_eval_ever/schemas/eval.schema.json
Note: resolving the merge conflicts got a little messy, so I would recommend squashing this PR if it is accepted instead of creating a merge commit. That will keep the history of the repo cleaner. If desired, I could do a rebase here, but I'm punting on that unless you want it.

It looks good to me but maybe wait until @nelaturuharsha will merge his PR and then let him make a pass through it as well.

nelaturuharsha · 2026-03-18T19:14:12Z

@Erotemic Thank you, #69 is now in - so if you could upstream those to your PR it'd be awesome. We should also think about a github action that publishes a release that has been tagged to pypi. I've already planned for the the validator to install the cli in the future.

Erotemic · 2026-03-20T00:50:49Z

@damian1996 ready for re-review

nelaturuharsha · 2026-03-20T15:47:59Z

@evijit can you trigger co-pilot review? Thanks!

Copilot

Pull request overview

Refactors the repository into an installable every_eval_ever Python package (suitable for PyPI), bundles schemas under the package namespace, and adds a top-level CLI (plus converter CLIs) alongside CI workflows that validate data via the package entrypoint.

Changes:

Moved library code + generated Pydantic models + JSON schemas under every_eval_ever/ and updated imports across utils/tests.
Added a unified CLI (every_eval_ever console script) with validate, check-duplicates, and convert subcommands, plus converter-specific CLIs.
Added GitHub Actions workflows for validating data and regenerating types based on package-local schemas.

Reviewed changes

Copilot reviewed 45 out of 59 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
utils/rewardbench/adapter.py	Switch imports to `every_eval_ever.*` and remove `sys.path` manipulation.
utils/hfopenllm_v2/adapter.py	Switch imports to `every_eval_ever.*` and remove `sys.path` manipulation.
utils/helm/adapter.py	Switch imports to `every_eval_ever.*` and remove `sys.path` manipulation.
utils/global-mmlu-lite/adapter.py	Switch imports to `every_eval_ever.*` and remove `sys.path` manipulation.
tests/test_validate.py	Update validate imports and fixture schema versions.
tests/test_lm_eval_adapter.py	Update imports to new package converter paths.
tests/test_inspect_instance_level_adapter.py	Update imports to new package converter paths.
tests/test_inspect_adapter.py	Update imports to new package converter paths.
tests/test_helm_instance_level_adapter.py	Update imports to new package converter paths.
tests/test_helm_adapter.py	Update imports to new package converter paths.
tests/test_check_duplicate_entries.py	Test against `every_eval_ever.check_duplicate_entries` and new `main(argv)` return code.
pyproject.toml	Add build-system + setuptools package discovery/data and CLI entrypoint.
post_codegen.py	Update paths/usage for package-local schemas and generated types.
eval.schema.json	Convert root schema file to a symlink target pointing into `every_eval_ever/schemas/`.
instance_level_eval.schema.json	Convert root schema file to a symlink target pointing into `every_eval_ever/schemas/`.
every_eval_ever/validate.py	Move validator into package and make `main(argv)->int` for CLI integration.
every_eval_ever/check_duplicate_entries.py	Move duplicate checker into package and make `main(argv)->int` for CLI integration.
every_eval_ever/schemas/eval.schema.json	Add bundled aggregate schema under package.
every_eval_ever/schemas/instance_level_eval.schema.json	Add bundled instance-level schema under package.
every_eval_ever/schemas/init.py	Mark schemas as a package for resource loading.
every_eval_ever/schema.py	Add helpers for loading bundled schemas via `importlib.resources`.
every_eval_ever/eval_types.py	Relocate generated aggregate Pydantic types under package.
every_eval_ever/instance_level_types.py	Relocate generated instance-level Pydantic types under package.
every_eval_ever/helpers/schema.py	Update imports and extend `make_source_metadata(..., additional_details=...)`.
every_eval_ever/helpers/io.py	Update imports to package namespace.
every_eval_ever/helpers/fetch.py	Add HTTP fetching utilities (requests-based).
every_eval_ever/helpers/developer.py	Add developer/model ID derivation helpers.
every_eval_ever/helpers/init.py	Export helper utilities via `__all__`.
every_eval_ever/converters/init.py	Reintroduce converter `SCHEMA_VERSION` constant under new namespace.
every_eval_ever/converters/README.md	Update docs/commands to use new package CLI + extras.
every_eval_ever/converters/common/adapter.py	Update imports to package namespace for shared adapter base.
every_eval_ever/converters/common/error.py	Add shared adapter exception types.
every_eval_ever/converters/common/utils.py	Add shared utils (timestamps, HF lookup, hashing).
every_eval_ever/converters/common/init.py	Package marker for `common` converters module.
every_eval_ever/converters/inspect/adapter.py	Update imports + make inspect dependency optional with runtime gating.
every_eval_ever/converters/inspect/instance_level_adapter.py	Make inspect dependency optional with runtime gating.
every_eval_ever/converters/inspect/utils.py	Update imports to new common utils / types.
every_eval_ever/converters/inspect/main.py	Update entrypoint imports to new namespace.
every_eval_ever/converters/inspect/init.py	Package marker for inspect converter module.
every_eval_ever/converters/helm/adapter.py	Make helm dependency optional with runtime gating; update imports.
every_eval_ever/converters/helm/instance_level_adapter.py	Make helm dependency optional with runtime gating; update imports.
every_eval_ever/converters/helm/utils.py	Make helm dependency optional for type import.
every_eval_ever/converters/helm/main.py	Update entrypoint imports to new namespace.
every_eval_ever/converters/helm/init.py	Package marker for helm converter module.
every_eval_ever/converters/lm_eval/adapter.py	Update imports to new namespace.
every_eval_ever/converters/lm_eval/instance_level_adapter.py	Update imports to new namespace.
every_eval_ever/converters/lm_eval/utils.py	Add lm-eval helper utilities for parsing args, locating samples, etc.
every_eval_ever/converters/lm_eval/main.py	Add CLI for lm-eval conversion under package namespace.
every_eval_ever/converters/lm_eval/init.py	Package marker for lm-eval converter module.
every_eval_ever/cli.py	Add top-level CLI: validate, check-duplicates, convert (lm_eval/inspect/helm).
every_eval_ever/init.py	Define lightweight public API via lazy module loading.
every_eval_ever/main.py	Add `python -m every_eval_ever` entrypoint wrapper.
eval_converters/init.py	Remove old namespace’s `SCHEMA_VERSION` constant.
README.md	Update documentation to reflect package install, new CLI, and new workflow path.
.github/workflows/validate-data.yml	Add data validation workflow invoking the package CLI.
.github/workflows/regenerate_types.yml	Update type regeneration workflow to use package-local schemas/outputs.

Comments suppressed due to low confidence (2)

every_eval_ever/converters/inspect/adapter.py:106

AdapterMetadata requires supported_library_versions, but metadata() constructs it with only name, version, and description. Accessing InspectAIAdapter.metadata will raise TypeError: __init__() missing 1 required positional argument. Populate supported_library_versions (or make the field optional/defaulted in the dataclass).
every_eval_ever/converters/helm/adapter.py:376
Several HELMAdapter method contracts are inconsistent with BaseEvaluationAdapter: _transform_single() is annotated as returning a tuple but returns only EvaluationLog, and transform_from_directory() takes an extra output_path parameter compared to the abstract base signature. These mismatches make it hard to call the adapter polymorphically and confuse static analysis. Additionally, metadata() constructs AdapterMetadata without the required supported_library_versions field, which will raise TypeError if accessed. Consider aligning these signatures/annotations with the base class and populating all required AdapterMetadata fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

every_eval_ever/__main__.py

every_eval_ever/schema.py

every_eval_ever/converters/__init__.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Erotemic added 7 commits March 2, 2026 16:56

Refactor code into every_eval_ever namespace and add modal CLI

ee6bee3

Refactor code in utils into the main library

00a781a

Optimize response time for CLI

df25c85

Forgot to add check duplicates

e956a16

Add CLI descriptions

af24ec0

Fix validation workflow to point at correct schemas (and add instance…

68fb190

… level schema)

remove unnecessary path munging

3a2bb4c

Fix uv invocations in converters README

fd537df

nelaturuharsha requested review from damian1996, janbatzner and nelaturuharsha and removed request for damian1996 and nelaturuharsha March 3, 2026 15:34

damian1996 reviewed Mar 4, 2026

View reviewed changes

Erotemic added 3 commits March 16, 2026 16:31

Resolve inspect merge conflicts and simplify converter CLI

1c41cb9

Clean up PR diff and move validation workflow

fad5c3b

Sync namespaced schemas with main

a6a1211

Erotemic requested a review from damian1996 March 16, 2026 17:32

merge into main

9781176

Erotemic added 4 commits March 19, 2026 20:37

Reconcile namespaced package with mainline validation

c6d861b

Merge branch 'main' into pypi-refactor

653f688

Remove root validate wrapper

0a32ca6

Reduce diff

a823cda

reduce diff

f350014

nelaturuharsha requested a review from evijit March 20, 2026 15:48

evijit requested a review from Copilot March 20, 2026 15:49

Copilot started reviewing on behalf of evijit March 20, 2026 15:49 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

every_eval_ever/__main__.py Outdated Show resolved Hide resolved

every_eval_ever/schema.py Outdated Show resolved Hide resolved

every_eval_ever/converters/__init__.py Outdated Show resolved Hide resolved

Erotemic and others added 4 commits March 20, 2026 12:06

Update every_eval_ever/converters/__init__.py

f344d8b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update every_eval_ever/__main__.py

a0b7dc4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Support zipped packages with importlib resources

4ddf725

Fix tests

e631024

nelaturuharsha merged commit 431a956 into evaleval:main Mar 20, 2026

Conversation

Erotemic commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Erotemic commented Mar 2, 2026

Uh oh!

damian1996 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Erotemic Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

damian1996 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

damian1996 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Erotemic Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

damian1996 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Erotemic commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nelaturuharsha commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Erotemic commented Mar 18, 2026

Uh oh!

damian1996 commented Mar 18, 2026

Uh oh!

nelaturuharsha commented Mar 18, 2026

Uh oh!

Erotemic commented Mar 20, 2026

Uh oh!

nelaturuharsha commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Erotemic commented Mar 2, 2026 •

edited

Loading

Erotemic commented Mar 16, 2026 •

edited

Loading

nelaturuharsha commented Mar 18, 2026 •

edited

Loading