Emmy

Evaluation-invariant measurement for multi-agent AI systems.

Status: Pre-experiment. The program overview is the current canonical reference; additional documents will be added to docs/ as they are finalized for public release. A small runnable demonstration of the core idea is in demo/.

The gap

AI agents increasingly work in groups — language-model agents that call tools and coordinate with one another, reinforcement-learning agents acting in a shared environment. These collectives are already moving into real-world deployment.

We have little settled practice for measuring what such a group does collectively. The closest tools look elsewhere: single-model methods — evaluation, interpretability, AI control — inspect one model at a time, and reveal little about the behavior of the group. Multi-agent results, where reported, are expressed in terms tied to one setup — a score that depends on a particular environment's payoffs, a behavioral signal defined for a single experiment. Those numbers rarely carry from one paper to the next, or from a lab setup to a deployment. And the reporting conventions that downstream safety and evaluation work will inherit are being fixed now — before the measurement practice supporting them is sound.

The bridge

Emmy is a research program building the measurement foundations for groups of AI agents. Its approach is unusual: instead of starting from a human idea of what the agents are doing — cooperating, competing, deceiving — and reaching for a number to stand in for it, Emmy starts from quantities it can measure cleanly, then asks what they reveal about safety. The hypothesis is simple: the right unit of measurement is the observable — a quantity computed from what the agents do, their actions and observations, defined so that it does not change when you rescale rewards or relabel a setup. Get a small, standard set of these right, and two things follow. Claims about coordination, robustness, and failure become comparable across papers. And an external evaluator gains a way to inspect a deployed group of agents directly — the inspection layer that single-model methods do not provide.

Because the measurement depends only on behavior, it requires no privileged access to the underlying models, and the same instruments apply whether the agents are reinforcement-learning policies, language-model ensembles, or active-inference systems.

The first deliverable is a paper and an open-source library: a small, canonical set of these observables for cooperative multi-agent collectives, with pre-registered, falsifiable claims and a published null-result protocol.

Two limits of note. First, this is pre-experiment work — these claims are not yet validated. Second, the framework measures behavior, not internal cognition — a complement to benchmark evaluation and interpretability, not a replacement.

Demo: invariance under reward rescaling

demo/ contains a small, runnable instance of the central claim: two tabular Q-learning agents in the iterated prisoner's dilemma, showing that behavioral observables (coordination, action autocorrelation) are invariant under reward rescaling while reward-based quantities are not. It is a smoke-test of the measurement machinery, not a research result. See demo/README.md.

Emmy Noether

After Emmy Noether (1882–1935), whose foundational work connecting symmetries to invariants underlies the framing of evaluation-invariant measurement.

License

Apache 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
demo		demo
docs		docs
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emmy

The gap

The bridge

Demo: invariance under reward rescaling

Emmy Noether

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Emmy

The gap

The bridge

Demo: invariance under reward rescaling

Emmy Noether

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages