Skip to content

SuperMuel/BenchMac

Repository files navigation

BenchMAC

CI codecov License: MIT Ruff

docs/images/banner.png

BenchMAC is a benchmark for evaluating AI agents on their ability to perform complex, real-world Migration tasks for Angular Codebases.

Developped as part of my Master's thesis in collaboration with onepoint.

🚀 Introduction

Migrating Angular applications across major versions is a nuanced task that goes beyond simple dependency updates. It requires adapting to breaking API changes, refactoring code and tests, and ensuring the entire project remains buildable, lint-free, and functionally correct.

BenchMAC provides a standardized, automated, and reproducible way to measure the performance of any AI system on these tasks.

🛠️ How It Works

BenchMAC separates migration generation from evaluation so any system under test (SUT) can plug in while results stay comparable.

  1. Patch Generation: An SUT works inside an instance-specific Docker image. It can run commands, iterate on errors, and make arbitrary edits. Its only obligation is to output a unified diff describing the final migration.
  2. Patch Evaluation: The harness resets the repository to the instance baseline, applies the diff, and runs the canonical install/build/test command sequence. All outputs are collected to compute metrics and capture failure evidence.

This architecture lets researchers innovate on agents, prompts, or rules without touching the evaluator, and keeps scoring deterministic across submissions.

📊 Evaluation Metrics

Metric What it checks Why it matters
Patch application Diff applies cleanly Detects syntactic conflicts and missing files before deeper checks
Target version attainment Framework versions match the goal Ensures the upgrade actually lands on the requested release
Build success Canonical build command exits 0 Confirms the migrated project compiles in CI-like conditions

Each run also archives command logs and agent output so teams can trace failure modes.

🗃️ Dataset

BenchMAC v1.0 ships nine instances drawn from gothinkster/angular-realworld-example-app, covering consecutive upgrades from Angular 11→12 through 19→20. The dataset definition lives in data/instances.jsonl with matching Dockerfiles under data/dockerfiles/.

🧊 Reproducible Environments

Every instance has a pinned Docker image that:

  • Downloads a history-free archive of the baseline commit
  • Pins Node.js and npm by digest, then freezes Debian packages via snapshot.debian.org
  • Tags a baseline Git commit

The harness and the agents both rely on these images, ensuring that patches generated today will be evaluated the same way tomorrow.

🔬 BenchMAC v1.0: A Case Study

The inaugural v1.0 dataset consists of nine consecutive migration tasks (Angular v11→v20) from the gothinkster/angular-realworld-example-app repository. An initial study from the master thesis produced the following comparison of agent performance versus cost.

Agent Performance: Build Success Rate vs Average Cost

Findings:

Multiple methods, including the rule-based angular-schematics tool, achieved a perfect 100% success rate. The benchmark successfully highlighted that several LLMs underperformed compared to the non-AI, rule-based approach. However, we also observed that the current dataset is relatively simple, since many methods achieve perfect scores, the benchmark does not effectively distinguish between their capabilities.

To address this, future work should focus on raising the benchmark difficulty by introducing more complex instances. (See ./docs/add-new-instance.md))

🚀 Getting Started

For detailed instructions on running experiments and evaluating AI agents, see the Usage Guide.

Quick overview:

  • Install dependencies with uv sync
  • Run experiments: uv run experiments/run_experiment.py
  • Evaluate results: uv run benchmac eval
  • Explore results: uv run explorer

📘 Documentation

For a comprehensive understanding of the research methodology and experimental results, please refer to the Master's thesis, particularly the Methodology and Experiments sections.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

📚 Citation

@misc{mallet2024benchmac,
  author       = {Samuel Mallet},
  title        = {BenchMAC: A Benchmark for Evaluating AI-Assisted Angular Version Migration},
  year         = {2024},
  publisher    = {GitHub},
  journal      = {GitHub repository},
  howpublished = {\url{https://github.com/SuperMuel/BenchMAC}}
}

About

A benchmark designed to evaluate AI systems on the task of migrating real-world Angular applications across major versions.

Resources

License

Stars

Watchers

Forks

Contributors