BenchMAC

BenchMAC is a benchmark for evaluating AI agents on their ability to perform complex, real-world Migration tasks for Angular Codebases.

Developped as part of my Master's thesis in collaboration with onepoint.

🚀 Introduction

Migrating Angular applications across major versions is a nuanced task that goes beyond simple dependency updates. It requires adapting to breaking API changes, refactoring code and tests, and ensuring the entire project remains buildable, lint-free, and functionally correct.

BenchMAC provides a standardized, automated, and reproducible way to measure the performance of any AI system on these tasks.

🛠️ How It Works

BenchMAC separates migration generation from evaluation so any system under test (SUT) can plug in while results stay comparable.

Patch Generation: An SUT works inside an instance-specific Docker image. It can run commands, iterate on errors, and make arbitrary edits. Its only obligation is to output a unified diff describing the final migration.
Patch Evaluation: The harness resets the repository to the instance baseline, applies the diff, and runs the canonical install/build/test command sequence. All outputs are collected to compute metrics and capture failure evidence.

This architecture lets researchers innovate on agents, prompts, or rules without touching the evaluator, and keeps scoring deterministic across submissions.

📊 Evaluation Metrics

Metric	What it checks	Why it matters
Patch application	Diff applies cleanly	Detects syntactic conflicts and missing files before deeper checks
Target version attainment	Framework versions match the goal	Ensures the upgrade actually lands on the requested release
Build success	Canonical `build` command exits 0	Confirms the migrated project compiles in CI-like conditions

Each run also archives command logs and agent output so teams can trace failure modes.

🗃️ Dataset

BenchMAC v1.0 ships nine instances drawn from gothinkster/angular-realworld-example-app, covering consecutive upgrades from Angular 11→12 through 19→20. The dataset definition lives in data/instances.jsonl with matching Dockerfiles under data/dockerfiles/.

🧊 Reproducible Environments

Every instance has a pinned Docker image that:

Downloads a history-free archive of the baseline commit
Pins Node.js and npm by digest, then freezes Debian packages via snapshot.debian.org
Tags a baseline Git commit

The harness and the agents both rely on these images, ensuring that patches generated today will be evaluated the same way tomorrow.

🔬 BenchMAC v1.0: A Case Study

The inaugural v1.0 dataset consists of nine consecutive migration tasks (Angular v11→v20) from the gothinkster/angular-realworld-example-app repository. An initial study from the master thesis produced the following comparison of agent performance versus cost.

Findings:

Multiple methods, including the rule-based angular-schematics tool, achieved a perfect 100% success rate. The benchmark successfully highlighted that several LLMs underperformed compared to the non-AI, rule-based approach. However, we also observed that the current dataset is relatively simple, since many methods achieve perfect scores, the benchmark does not effectively distinguish between their capabilities.

To address this, future work should focus on raising the benchmark difficulty by introducing more complex instances. (See ./docs/add-new-instance.md))

🚀 Getting Started

For detailed instructions on running experiments and evaluating AI agents, see the Usage Guide.

Quick overview:

Install dependencies with uv sync
Run experiments: uv run experiments/run_experiment.py
Evaluate results: uv run benchmac eval
Explore results: uv run explorer

📘 Documentation

For a comprehensive understanding of the research methodology and experimental results, please refer to the Master's thesis, particularly the Methodology and Experiments sections.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

📚 Citation

@misc{mallet2024benchmac,
  author       = {Samuel Mallet},
  title        = {BenchMAC: A Benchmark for Evaluating AI-Assisted Angular Version Migration},
  year         = {2024},
  publisher    = {GitHub},
  journal      = {GitHub repository},
  howpublished = {\url{https://github.com/SuperMuel/BenchMAC}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 228 Commits
.github/workflows		.github/workflows
.vscode		.vscode
analysis		analysis
data		data
docs		docs
experiments		experiments
explorer		explorer
scripts		scripts
src/bench_mac		src/bench_mac
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BenchMAC

🚀 Introduction

🛠️ How It Works

📊 Evaluation Metrics

🗃️ Dataset

🧊 Reproducible Environments

🔬 BenchMAC v1.0: A Case Study

🚀 Getting Started

📘 Documentation

📄 License

📚 Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BenchMAC

🚀 Introduction

🛠️ How It Works

📊 Evaluation Metrics

🗃️ Dataset

🧊 Reproducible Environments

🔬 BenchMAC v1.0: A Case Study

🚀 Getting Started

📘 Documentation

📄 License

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages