BenchMAC is a benchmark for evaluating AI agents on their ability to perform complex, real-world Migration tasks for Angular Codebases.
Developped as part of my Master's thesis in collaboration with onepoint.
Migrating Angular applications across major versions is a nuanced task that goes beyond simple dependency updates. It requires adapting to breaking API changes, refactoring code and tests, and ensuring the entire project remains buildable, lint-free, and functionally correct.
BenchMAC provides a standardized, automated, and reproducible way to measure the performance of any AI system on these tasks.
BenchMAC separates migration generation from evaluation so any system under test (SUT) can plug in while results stay comparable.
- Patch Generation: An SUT works inside an instance-specific Docker image. It can run commands, iterate on errors, and make arbitrary edits. Its only obligation is to output a unified diff describing the final migration.
- Patch Evaluation: The harness resets the repository to the instance baseline, applies the diff, and runs the canonical install/build/test command sequence. All outputs are collected to compute metrics and capture failure evidence.
This architecture lets researchers innovate on agents, prompts, or rules without touching the evaluator, and keeps scoring deterministic across submissions.
| Metric | What it checks | Why it matters |
|---|---|---|
| Patch application | Diff applies cleanly | Detects syntactic conflicts and missing files before deeper checks |
| Target version attainment | Framework versions match the goal | Ensures the upgrade actually lands on the requested release |
| Build success | Canonical build command exits 0 |
Confirms the migrated project compiles in CI-like conditions |
Each run also archives command logs and agent output so teams can trace failure modes.
BenchMAC v1.0 ships nine instances drawn from gothinkster/angular-realworld-example-app, covering consecutive upgrades from Angular 11→12 through 19→20. The dataset definition lives in data/instances.jsonl with matching Dockerfiles under data/dockerfiles/.
Every instance has a pinned Docker image that:
- Downloads a history-free archive of the baseline commit
- Pins Node.js and npm by digest, then freezes Debian packages via
snapshot.debian.org - Tags a baseline Git commit
The harness and the agents both rely on these images, ensuring that patches generated today will be evaluated the same way tomorrow.
The inaugural v1.0 dataset consists of nine consecutive migration tasks (Angular v11→v20) from the gothinkster/angular-realworld-example-app repository. An initial study from the master thesis produced the following comparison of agent performance versus cost.
Findings:
Multiple methods, including the rule-based angular-schematics tool, achieved a perfect 100% success rate.
The benchmark successfully highlighted that several LLMs underperformed compared to the non-AI, rule-based approach.
However, we also observed that the current dataset is relatively simple, since many methods achieve perfect scores, the benchmark does not effectively distinguish between their capabilities.
To address this, future work should focus on raising the benchmark difficulty by introducing more complex instances. (See ./docs/add-new-instance.md))
For detailed instructions on running experiments and evaluating AI agents, see the Usage Guide.
Quick overview:
- Install dependencies with
uv sync - Run experiments:
uv run experiments/run_experiment.py - Evaluate results:
uv run benchmac eval - Explore results:
uv run explorer
For a comprehensive understanding of the research methodology and experimental results, please refer to the Master's thesis, particularly the Methodology and Experiments sections.
This project is licensed under the MIT License. See the LICENSE file for details.
@misc{mallet2024benchmac,
author = {Samuel Mallet},
title = {BenchMAC: A Benchmark for Evaluating AI-Assisted Angular Version Migration},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/SuperMuel/BenchMAC}}
}
