Skip to content

msyvr/emmy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Emmy

Evaluation-invariant measurement for multi-agent AI systems.

Status: Pre-experiment. The program overview is the current canonical reference; additional documents will be added to docs/ as they are finalized for public release. A small runnable demonstration of the core idea is in demo/.

The gap

AI agents increasingly work in groups — language-model agents that call tools and coordinate with one another, reinforcement-learning agents acting in a shared environment. These collectives are already moving into real-world deployment.

We have little settled practice for measuring what such a group does collectively. The closest tools look elsewhere: single-model methods — evaluation, interpretability, AI control — inspect one model at a time, and reveal little about the behavior of the group. Multi-agent results, where reported, are expressed in terms tied to one setup — a score that depends on a particular environment's payoffs, a behavioral signal defined for a single experiment. Those numbers rarely carry from one paper to the next, or from a lab setup to a deployment. And the reporting conventions that downstream safety and evaluation work will inherit are being fixed now — before the measurement practice supporting them is sound.

The bridge

Emmy is a research program building the measurement foundations for groups of AI agents. Its approach is unusual: instead of starting from a human idea of what the agents are doing — cooperating, competing, deceiving — and reaching for a number to stand in for it, Emmy starts from quantities it can measure cleanly, then asks what they reveal about safety. The hypothesis is simple: the right unit of measurement is the observable — a quantity computed from what the agents do, their actions and observations, defined so that it does not change when you rescale rewards or relabel a setup. Get a small, standard set of these right, and two things follow. Claims about coordination, robustness, and failure become comparable across papers. And an external evaluator gains a way to inspect a deployed group of agents directly — the inspection layer that single-model methods do not provide.

Because the measurement depends only on behavior, it requires no privileged access to the underlying models, and the same instruments apply whether the agents are reinforcement-learning policies, language-model ensembles, or active-inference systems.

The first deliverable is a paper and an open-source library: a small, canonical set of these observables for cooperative multi-agent collectives, with pre-registered, falsifiable claims and a published null-result protocol.

Two limits of note. First, this is pre-experiment work — these claims are not yet validated. Second, the framework measures behavior, not internal cognition — a complement to benchmark evaluation and interpretability, not a replacement.

Demo: invariance under reward rescaling

demo/ contains a small, runnable instance of the central claim: two tabular Q-learning agents in the iterated prisoner's dilemma, showing that behavioral observables (coordination, action autocorrelation) are invariant under reward rescaling while reward-based quantities are not. It is a smoke-test of the measurement machinery, not a research result. See demo/README.md.

Emmy Noether

After Emmy Noether (1882–1935), whose foundational work connecting symmetries to invariants underlies the framing of evaluation-invariant measurement.

License

Apache 2.0 — see LICENSE.

About

Evaluation-invariant measurement for multi-agent AI systems.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages