Skip to content

Refactor code into every_eval_ever namespace and add modal CLI#57

Merged
nelaturuharsha merged 21 commits intoevaleval:mainfrom
Erotemic:pypi-refactor
Mar 20, 2026
Merged

Refactor code into every_eval_ever namespace and add modal CLI#57
nelaturuharsha merged 21 commits intoevaleval:mainfrom
Erotemic:pypi-refactor

Conversation

@Erotemic
Copy link
Copy Markdown
Contributor

@Erotemic Erotemic commented Mar 2, 2026

Here's an initial pass at a refactor to make this module easier to deploy to pypi and use as a standalone tool.

  • All library code and json schemas are now under the every_eval_ever namespace.
  • I did keep symlinks to the json schemas as I know external links reference them. They can be removed in the future.
  • Added an CLI entrypoint in pyproject.toml so that every_eval_ever becomes a command line tool.
  • Added modal CLIs using basic argparse for conversions and schema validation.
  • Made specific backend dependencies optional (the converter CLI will give you a helpful message if you don't have them installed and try to use them)
  • The validate and duplicate checks are moved out of utils and into the main library, but the oneoff scripts I kept in utils. It probably would be best to rename that to "scripts", "dev", or "maintain" because right now it looks like its something that should belong in the library even though it is just repo maintenance.
  • Removed a lot of unnecessary boilerplate from tests now that the library can just be imported by name.

I didn't do github actions stuff, but you can at least uv pip install -e . the repo now and get a nice CLI.

On slack there was mention of github actions existing, but I didn't see a .github directory and lack of CI tests for the PR makes me nervous. Are there CI tests somewhere? I do see that actions are running in the actions tab, but I've never set them up without a .github folder.

Note: the script in utils/helm/parse_helm_leaderboards.sh seems to have an error which I don't think is introduced by this PR. I've left it alone for now, but in general uv run commands would need to be updated with uv run --extra <extra> for whichever extra they needed.

@Erotemic
Copy link
Copy Markdown
Contributor Author

Erotemic commented Mar 2, 2026

Also note the github diff reports that the schema files were deleted and re-added. I think this is a consequence of adding the symlinks to keep existing links alive as I did git mv them. Regardless, you can validate that there is no difference in those files with:

# show original hashes from the main branch
sha256sum <(git show main:instance_level_eval.schema.json) <(git show main:eval.schema.json)
# show updated hashes on this branch 
sha256sum every_eval_ever/schemas/instance_level_eval.schema.json every_eval_ever/schemas/eval.schema.json

Which on my machine results in:

ab656c857373393c76eade428fa4b7b346438e90aaedafd6add9f1b0b6954639  /dev/fd/63
a8c534a87358c9f38e5a059c921f0c4ad51a6653f90c349e84e82c6a5bbc9719  /dev/fd/62
ab656c857373393c76eade428fa4b7b346438e90aaedafd6add9f1b0b6954639  every_eval_ever/schemas/instance_level_eval.schema.json
a8c534a87358c9f38e5a059c921f0c4ad51a6653f90c349e84e82c6a5bbc9719  every_eval_ever/schemas/eval.schema.json

@nelaturuharsha nelaturuharsha requested review from damian1996, janbatzner and nelaturuharsha and removed request for damian1996 and nelaturuharsha March 3, 2026 15:34

```bash
uv run python3 -m eval_converters.inspect --log_path tests/data/inspect/2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json
uv run --extra inspect every_eval_ever convert inspect eee --log-path tests/data/inspect/2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full commands (below examples) are outdated after these changes. We probably may remove them.

Also, I’d like to simplify this command. What is the purpose of the word eee? I suspect every-eval-ever convert inspect would be clearer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because their is a destination positional argument that indicates which format to convert to. It has a default so its not required to include it. It would probably be cleaner to simply remove it as there is only one destination format. I can do that.

Also as a note to myself, I see log_path got changed to log-path, and I should undo that.

WRT to full commands after, I will clean that up.

Also, if you are open to adding a dependency on my scriptconfig CLI library we can organize the argument parsers in a much more declarative way that keeps the entrypoint much closer to the code they belong to. The argparse code works, but I always think it looks ugly. With scriptconfig you also get several quality of life improvements including opt-in autocomplete for free as well as nice colored CLIs based on its tie in with rich-argparse and argcomplete. I'd recommend landing a baseline pure-stdlib approach first, but if you are interested I could make a follow on PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a follow-up PR with dependency for scriptconfig CLI library would be best option, keep things as simple as possible in this PR.

README.md Outdated
Validation command:

```bash
every_eval_ever validate <path-or-dir>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not run validation and duplication checks in the github. It's only valid in the HF during uploading new data. So we don't need it here. Convert option is enough

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should though, at least in a CI test suite to cover more of the code with tests. We should also show how to do it in the README.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's harder to check duplicates now because data aren't stored in the github anymore (and it's still run when uploading data to HF, so don't see sense to do it). Validation of data will be rewritten slightly (full validation via pydantic) so it isn't needed imo.

@Erotemic Erotemic requested a review from damian1996 March 16, 2026 17:32
@Erotemic
Copy link
Copy Markdown
Contributor Author

Erotemic commented Mar 16, 2026

@damian1996 I've resolved merge conflicts, cleaned up the diff, and addressed prior discussion points.

The schemas on main changed, so I reverified that the merge updated them correctly:

# show original hashes from the main branch
sha256sum <(git show main:instance_level_eval.schema.json) <(git show main:eval.schema.json)
# show updated hashes on this branch 
sha256sum every_eval_ever/schemas/instance_level_eval.schema.json every_eval_ever/schemas/eval.schema.json
9aed691d0395e1ba35bfdd326ce693bd45a3a30c513bff0b3904e301e7db600f  /dev/fd/63
9a6154f41116cdf56d0b7968a24aa93900a9e6d4a9c799827a42ec079e3539c5  /dev/fd/62
9aed691d0395e1ba35bfdd326ce693bd45a3a30c513bff0b3904e301e7db600f  every_eval_ever/schemas/instance_level_eval.schema.json
9a6154f41116cdf56d0b7968a24aa93900a9e6d4a9c799827a42ec079e3539c5  every_eval_ever/schemas/eval.schema.json

Note: resolving the merge conflicts got a little messy, so I would recommend squashing this PR if it is accepted instead of creating a merge commit. That will keep the history of the repo cleaner. If desired, I could do a rebase here, but I'm punting on that unless you want it.

@nelaturuharsha
Copy link
Copy Markdown
Collaborator

nelaturuharsha commented Mar 18, 2026

Hi @Erotemic, the validation workflow is now going to be via pydantic + other changes will land once #69 is merged and there are some clean-ups regarding dependencies via #71.

Could you check if the incoming changes affect your PR? I'd assume the validation changing might but I've not had a chance to look through yours. Will prioritize this PR once I am done with changes to the validation pipeline (I am working on it assuming this PR will be merged - but don't want to rush this one).

@Erotemic
Copy link
Copy Markdown
Contributor Author

@nelaturuharsha

#71 did cause conflicts, but I merged in the latest main and addressed them.

For #69 there will be merge conflicts, but the core intent of it and this PR are different and compatible, if you want to merge that one first, I can update my PR accordingly. Otherwise, if this is merged first, then the changes in #71 basically need to be moved into the correct new locations and integrated with the CLI.

@damian1996
Copy link
Copy Markdown
Collaborator

@damian1996 I've resolved merge conflicts, cleaned up the diff, and addressed prior discussion points.

The schemas on main changed, so I reverified that the merge updated them correctly:

# show original hashes from the main branch
sha256sum <(git show main:instance_level_eval.schema.json) <(git show main:eval.schema.json)
# show updated hashes on this branch 
sha256sum every_eval_ever/schemas/instance_level_eval.schema.json every_eval_ever/schemas/eval.schema.json
9aed691d0395e1ba35bfdd326ce693bd45a3a30c513bff0b3904e301e7db600f  /dev/fd/63
9a6154f41116cdf56d0b7968a24aa93900a9e6d4a9c799827a42ec079e3539c5  /dev/fd/62
9aed691d0395e1ba35bfdd326ce693bd45a3a30c513bff0b3904e301e7db600f  every_eval_ever/schemas/instance_level_eval.schema.json
9a6154f41116cdf56d0b7968a24aa93900a9e6d4a9c799827a42ec079e3539c5  every_eval_ever/schemas/eval.schema.json

Note: resolving the merge conflicts got a little messy, so I would recommend squashing this PR if it is accepted instead of creating a merge commit. That will keep the history of the repo cleaner. If desired, I could do a rebase here, but I'm punting on that unless you want it.

It looks good to me but maybe wait until @nelaturuharsha will merge his PR and then let him make a pass through it as well.

@nelaturuharsha
Copy link
Copy Markdown
Collaborator

@Erotemic Thank you, #69 is now in - so if you could upstream those to your PR it'd be awesome. We should also think about a github action that publishes a release that has been tagged to pypi. I've already planned for the the validator to install the cli in the future.

@Erotemic
Copy link
Copy Markdown
Contributor Author

@damian1996 ready for re-review

@nelaturuharsha
Copy link
Copy Markdown
Collaborator

@evijit can you trigger co-pilot review? Thanks!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the repository into an installable every_eval_ever Python package (suitable for PyPI), bundles schemas under the package namespace, and adds a top-level CLI (plus converter CLIs) alongside CI workflows that validate data via the package entrypoint.

Changes:

  • Moved library code + generated Pydantic models + JSON schemas under every_eval_ever/ and updated imports across utils/tests.
  • Added a unified CLI (every_eval_ever console script) with validate, check-duplicates, and convert subcommands, plus converter-specific CLIs.
  • Added GitHub Actions workflows for validating data and regenerating types based on package-local schemas.

Reviewed changes

Copilot reviewed 45 out of 59 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
utils/rewardbench/adapter.py Switch imports to every_eval_ever.* and remove sys.path manipulation.
utils/hfopenllm_v2/adapter.py Switch imports to every_eval_ever.* and remove sys.path manipulation.
utils/helm/adapter.py Switch imports to every_eval_ever.* and remove sys.path manipulation.
utils/global-mmlu-lite/adapter.py Switch imports to every_eval_ever.* and remove sys.path manipulation.
tests/test_validate.py Update validate imports and fixture schema versions.
tests/test_lm_eval_adapter.py Update imports to new package converter paths.
tests/test_inspect_instance_level_adapter.py Update imports to new package converter paths.
tests/test_inspect_adapter.py Update imports to new package converter paths.
tests/test_helm_instance_level_adapter.py Update imports to new package converter paths.
tests/test_helm_adapter.py Update imports to new package converter paths.
tests/test_check_duplicate_entries.py Test against every_eval_ever.check_duplicate_entries and new main(argv) return code.
pyproject.toml Add build-system + setuptools package discovery/data and CLI entrypoint.
post_codegen.py Update paths/usage for package-local schemas and generated types.
eval.schema.json Convert root schema file to a symlink target pointing into every_eval_ever/schemas/.
instance_level_eval.schema.json Convert root schema file to a symlink target pointing into every_eval_ever/schemas/.
every_eval_ever/validate.py Move validator into package and make main(argv)->int for CLI integration.
every_eval_ever/check_duplicate_entries.py Move duplicate checker into package and make main(argv)->int for CLI integration.
every_eval_ever/schemas/eval.schema.json Add bundled aggregate schema under package.
every_eval_ever/schemas/instance_level_eval.schema.json Add bundled instance-level schema under package.
every_eval_ever/schemas/init.py Mark schemas as a package for resource loading.
every_eval_ever/schema.py Add helpers for loading bundled schemas via importlib.resources.
every_eval_ever/eval_types.py Relocate generated aggregate Pydantic types under package.
every_eval_ever/instance_level_types.py Relocate generated instance-level Pydantic types under package.
every_eval_ever/helpers/schema.py Update imports and extend make_source_metadata(..., additional_details=...).
every_eval_ever/helpers/io.py Update imports to package namespace.
every_eval_ever/helpers/fetch.py Add HTTP fetching utilities (requests-based).
every_eval_ever/helpers/developer.py Add developer/model ID derivation helpers.
every_eval_ever/helpers/init.py Export helper utilities via __all__.
every_eval_ever/converters/init.py Reintroduce converter SCHEMA_VERSION constant under new namespace.
every_eval_ever/converters/README.md Update docs/commands to use new package CLI + extras.
every_eval_ever/converters/common/adapter.py Update imports to package namespace for shared adapter base.
every_eval_ever/converters/common/error.py Add shared adapter exception types.
every_eval_ever/converters/common/utils.py Add shared utils (timestamps, HF lookup, hashing).
every_eval_ever/converters/common/init.py Package marker for common converters module.
every_eval_ever/converters/inspect/adapter.py Update imports + make inspect dependency optional with runtime gating.
every_eval_ever/converters/inspect/instance_level_adapter.py Make inspect dependency optional with runtime gating.
every_eval_ever/converters/inspect/utils.py Update imports to new common utils / types.
every_eval_ever/converters/inspect/main.py Update entrypoint imports to new namespace.
every_eval_ever/converters/inspect/init.py Package marker for inspect converter module.
every_eval_ever/converters/helm/adapter.py Make helm dependency optional with runtime gating; update imports.
every_eval_ever/converters/helm/instance_level_adapter.py Make helm dependency optional with runtime gating; update imports.
every_eval_ever/converters/helm/utils.py Make helm dependency optional for type import.
every_eval_ever/converters/helm/main.py Update entrypoint imports to new namespace.
every_eval_ever/converters/helm/init.py Package marker for helm converter module.
every_eval_ever/converters/lm_eval/adapter.py Update imports to new namespace.
every_eval_ever/converters/lm_eval/instance_level_adapter.py Update imports to new namespace.
every_eval_ever/converters/lm_eval/utils.py Add lm-eval helper utilities for parsing args, locating samples, etc.
every_eval_ever/converters/lm_eval/main.py Add CLI for lm-eval conversion under package namespace.
every_eval_ever/converters/lm_eval/init.py Package marker for lm-eval converter module.
every_eval_ever/cli.py Add top-level CLI: validate, check-duplicates, convert (lm_eval/inspect/helm).
every_eval_ever/init.py Define lightweight public API via lazy module loading.
every_eval_ever/main.py Add python -m every_eval_ever entrypoint wrapper.
eval_converters/init.py Remove old namespace’s SCHEMA_VERSION constant.
README.md Update documentation to reflect package install, new CLI, and new workflow path.
.github/workflows/validate-data.yml Add data validation workflow invoking the package CLI.
.github/workflows/regenerate_types.yml Update type regeneration workflow to use package-local schemas/outputs.
Comments suppressed due to low confidence (2)

every_eval_ever/converters/inspect/adapter.py:106

  • AdapterMetadata requires supported_library_versions, but metadata() constructs it with only name, version, and description. Accessing InspectAIAdapter.metadata will raise TypeError: __init__() missing 1 required positional argument. Populate supported_library_versions (or make the field optional/defaulted in the dataclass).
    every_eval_ever/converters/helm/adapter.py:376
  • Several HELMAdapter method contracts are inconsistent with BaseEvaluationAdapter: _transform_single() is annotated as returning a tuple but returns only EvaluationLog, and transform_from_directory() takes an extra output_path parameter compared to the abstract base signature. These mismatches make it hard to call the adapter polymorphically and confuse static analysis. Additionally, metadata() constructs AdapterMetadata without the required supported_library_versions field, which will raise TypeError if accessed. Consider aligning these signatures/annotations with the base class and populating all required AdapterMetadata fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Erotemic and others added 4 commits March 20, 2026 12:06
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@nelaturuharsha nelaturuharsha merged commit 431a956 into evaleval:main Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants