π The Future of Atomistic Modeling and Simulation Benchmarks for MLIPs
Foundation machine learning interatomic potentials (MLIPs), trained on extensive databases containing millions of density functional theory (DFT) calculations, have revolutionized molecular and materials modeling. However, existing benchmarks often suffer from data leakage, limited transferability, and an over-reliance on error-based metrics tied to specific DFT references.
MLIP Arena introduces a unified, cutting-edge benchmark platform for evaluating foundation MLIP performance far beyond conventional error metrics. It focuses on revealing the physical soundness learned by MLIPs and assessing their practical utility, remaining completely agnostic to the underlying model architectures and training datasets.
By moving beyond static DFT references and revealing the critical failure modes of current foundation MLIPs in real-world settings, MLIP Arena provides a reproducible framework to guide next-generation MLIP development. We aim to drive improvements in predictive accuracy and runtime efficiency while maintaining robust physical consistency!
β‘ MLIP Arena leverages modern pythonic workflow orchestration with π Prefect π to enable advanced task/flow chaining, scaling, and caching.
Note
Contributions of new tasks via PRs are highly welcome! See our Project Page for outstanding tasks, or propose new feature requests in Discussions.
For comprehensive guides, API references, and advanced usage, please visit our Official Documentation Site!
- [Sep 18, 2025] π MLIP Arena is accepted as a Spotlight (top 3.5%) at NeurIPS! π
- [Apr 8, 2025] π MLIP Arena is accepted as an ICLR AI4Mat Spotlight! π Huge thanks to all co-authors for their contributions!
pip install mlip-arenaCaution
We strongly recommend a clean build in a new virtual environment due to compatibility issues between multiple popular MLIPs. We provide a single installation script using uv for minimal package conflicts and blazing fast installation!
Important
To automatically download Fairchem model checkpoints, please ensure you have downloading access to their Hugging Face model repo (e.g., OMAT24) (not the dataset repo). You must also log in locally on your machine via hf auth login (see HF Hub authentication).
π§ Linux
# (Optional) Install uv (it's much faster than pip!)
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
git clone https://github.com/atomind-ai/mlip-arena.git
cd mlip-arena
# One-script uv pip installation
bash scripts/install.shTip
Installing all compiled models can consume significant local storage. You can use the pip flag --no-cache, and running uv cache clean is extremely helpful for freeing up space.
π Mac OS
# (Optional) Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
# One-script uv pip installation
bash scripts/install-macosx.shInstructions for individual benchmarks are provided in the README within each corresponding folder under /benchmarks.
For a complete benchmark sweep using HPC resources, see the benchmarks/submit.py script. Refer to the Run Benchmarks and Submit Model section for usage instructions.
Arena provides a unified interface to run all compiled MLIPs. This can be achieved by simply iterating over MLIPEnum:
from mlip_arena.models import MLIPEnum
from mlip_arena.tasks import MD
from mlip_arena.tasks.utils import get_calculator
from ase import units
from ase.build import bulk
atoms = bulk("Cu", "fcc", a=3.6) * (5, 5, 5)
results = []
for model in MLIPEnum:
result = MD(
atoms=atoms,
calculator=get_calculator(
model,
calculator_kwargs=dict(), # directly passing to the calculator
dispersion=True,
dispersion_kwargs=dict(
damping='bj', xc='pbe', cutoff=40.0 * units.Bohr
), # passing to TorchDFTD3Calculator
), # compatible with custom ASE Calculators
ensemble="nve", # nvt and npt are also available
dynamics="velocityverlet", # compatible with any ASE Dynamics objects and their class names
total_time=1e3, # 1 ps = 1e3 fs
time_step=2, # fs
)
results.append(result)To run multiple benchmarks in parallel, append .submit to the task function and wrap your tasks in a flow. This dispatches them to a local or remote worker for concurrent execution. See the Prefect documentation on tasks and flows for more details.
from prefect import flow
@flow
def run_all_tasks():
futures = []
for model in MLIPEnum:
future = MD.submit(
atoms=atoms,
...
)
futures.append(future)
return [f.result(raise_on_failure=False) for f in futures]For a more practical example using HPC resources, please refer to the submission script or our MD stability benchmark.
The implemented tasks are available under mlip_arena.tasks.<module>.run or via from mlip_arena.tasks import * for convenient imports (note: this currently requires phonopy to be installed).
- OPT: Structure optimization
- EOS: Equation of state (energy-volume scan)
- MD: Molecular dynamics with flexible dynamics (NVE, NVT, NPT) and temperature/pressure scheduling (annealing, shearing, etc.)
- PHONON: Phonon calculation driven by phonopy
- NEB: Nudged elastic band
- NEB_FROM_ENDPOINTS: Nudged elastic band with convenient image interpolation (linear or IDPP)
- ELASTICITY: Elastic tensor calculation
PRs are welcome! Please clone the repo and submit PRs with your changes.
To make changes to the Hugging Face Space, fetch large files from git LFS first, and then run Streamlit:
git lfs fetch --all
git lfs pull
streamlit run serve/app.pyIf you have pretrained MLIP models that you would like to contribute to MLIP Arena and evaluate in real-time benchmarks, you have two options:
- Implement a new ASE Calculator class in
mlip_arena/models/externals. - Name your class with your awesome model name and add the exact same name to the
registrywith your metadata.
Caution
Remove unnecessary outputs from the results class attributes to avoid errors during MD simulations. Please refer to CHGNet as an example.
- Inherit the Hugging Face ModelHubMixin class in your model class definition. We recommend PytorchModelHubMixin.
- Create a new Hugging Face Model repository and upload the model file using the push_to_hub function.
- Follow the template to code the I/O interface for your model here.
- Update the model
registrywith the necessary metadata.
Once your model is ready (either registered or initialized as a custom ASE Calculator), you can run the core benchmark suite on a SLURM cluster:
- Move into the
benchmarks/directory:cd benchmarks - Open and modify the
submit.pytemplate script. Under the USER CONFIGURATION section:- Provide your
MODEL(as a registered string or custom ASE Calculator instance). - Adjust the
SLURM_CONFIGparameters for your specific HPC allocation (including any conda environments or module loads in thejob_script_prologue).
- Provide your
- Submit the pipeline:
This will dynamically distribute and run the core benchmarks (diatomics, EOS bulk, and E-V scans) via a Dask-Jobqueue on your SLURM cluster.
python submit.py
Note
Please reuse, extend, or chain the general tasks defined above and add your new folder and scripts under /benchmarks.
If you find this work and platform useful, please consider citing the following:
@inproceedings{
chiang2025mlip,
title={{MLIP} Arena: Advancing Fairness and Transparency in Machine Learning Interatomic Potentials via an Open, Accessible Benchmark Platform},
author={Yuan Chiang and Tobias Kreiman and Christine Zhang and Matthew C. Kuner and Elizabeth Jin Weaver and Ishan Amin and Hyunsoo Park and Yunsung Lim and Jihan Kim and Daryl Chrzan and Aron Walsh and Samuel M Blau and Mark Asta and Aditi S. Krishnapriyan},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2025},
url={https://openreview.net/forum?id=SAT0KPA5UO}
}
