This repository contains a suite of codes to train a Machine Learning potential on a Molecular Dynamics trajectory, that was run with either ab-initio MD or classic MD with a universal ML potential or classical field.
The installation steps reported here partially refer to the Alps infrastructure of CSCS. You will have to change accordingly to your needs.
- Repository codes: Move to a safe permament folder (such as
$STOREor$HOME) and type$ git clone https://github.com/croncagl/ML-potential-training---CP2K-LAMMPS-MACE.git $ cd ML-potential-training---CP2K-LAMMPS-MACE - CP2K: the software is already available on Alps via uenv. You can find CP2K images with the command
$ uenv image find cp2kand then pull the latest (as per now 2026.1:v1) image via the command
$ uenv image pull cp2k/2026.1:v1
After, download themps-wrapper.shscript available here and place it in the repository folder. More information about CP2K on Alps can be found on this page. - LAMMPS+MACE: Following the section "LAMMPS with MACE" at the end of this page, you will have to create a dedicated virtual environment. To do so, in the repository folder run the following commands:
$ uenv image pull lammps/20251210:v2 $ uenv start --view kokkos lammps/20251210:v2 $ python -m venv --system-site-packages venv-lammps-mace-cp2k $ source ./venv-lammps-mace-cp2k/bin/activate $ pip install --upgrade pip $ pip install torch --index-url https://download.pytorch.org/whl/cu129 $ pip install mace-torch cuequivariance-torch cuequivariance cuequivariance-ops-torch-cu12 cupy-cuda12x p7zip-fullWarning: Some scripts in the repository will have to source this virtual environment you just created. Therefore, before running, check that when this happens, you change the name and path accordingly.
To train a ML interatomic potential on a DFT-labeled MD trajectory, these steps are needed:
- MD simulation
- SCF labelling
- Training
- Active Learning & Fine-tuning
The first step is to run a MD simulation. There are two options:
1a) AIMD-CP2K (ab initio MD with CP2K)
1b) MD-LAMMPS (classical MD with LAMMPS)
This folder contains these scripts and files:
md.inpis the CP2K input file for the ab initio molecular dynamics simulationmd_scf.inpis the CP2K input file for the single point calculation needed by the ab initio molecular dynamics simulationrun_md.shis the SLURM script for the MD simulationrun_md_scf.shis the SLURM script for the single point calculationwater.xyza file containing the coordinates of 64 water molecules in a cubic box, that you can use as a test
These scripts run a Second-Generation Car-Parrinello Molecular Dynamics.
If you are not running the test with water molecules, you must change all the settings in the input files md.inpand md_scf.inp according to your system (i.e. coordinates file, cell parameters, temperature, timestep, pseudopototentials, cutoffs, ..., more info here), and then run the calculation with
$ sbatch run_md.sh
After the simulation runs a certain number of steps, specified in the input file with STEPS in the &MD section, the job ends, and the single point set in md_scf.inp is automatically launched via run_md_scf.sh.
Then, the MD simulation restarts using the new wavefunction obtained after the precise SCF, and so on until stopped manually.
Note: Ensure the single point calculation directed by the input file
md_scf.inphas stringent convergence criterions. This will help the simulation to stay on a reasonable trajectory, pulling down the electorn density to the ground state one eachSTEPSsteps.
The simulation will produce trajectory file in a standard .xyz format:
192
i = 1424, time = 712.000, E = -1105.6736896936
O 3.5337107104 -3.6887187243 5.0888985501
H 3.0680039652 -4.1967370795 5.7867724170
H 4.4603980763 -3.5235828804 5.3435138791
O -3.4599963468 1.7260212500 4.6924344265
H -3.9690158895 0.8943871846 4.4571245996
H -4.0114445010 2.0629840997 5.4279543292
O 3.8132541531 3.6767210714 -7.6092879240
H 2.8077912161 3.9505080695 -7.5088943994
H 3.9071992980 2.7312972355 -8.0119714168
O -2.9465112092 -0.0914352224 -4.1691567879
H -2.3078910518 0.2395511044 -3.4738171345
H -2.3887447414 -0.7205677884 -4.6293317215
O -4.1349490504 -7.5162633190 5.6720183394
H -4.6028577825 -8.2738420126 6.1002481780
H -4.2329192090 -7.5961932706 4.6786529282
O -6.3643099029 0.4460789029 2.1358396511
H -5.7454751591 -0.2304906767 2.6165417730
H -5.8308345325 1.3052175224 2.0273909523
O 5.5066525280 -3.8891911196 0.6268247593
...
where the various snapshots i of the simulations are printed sequentially in the file.
Since ab initio MD simulations (such as the one done with CP2K) can be computationally demanding (in both time and cost), an alternative is provided by running a classical molecular dynamics simulation with a so-called universal model, or foundation model, which describes fairly accurately the interactions between atoms of many different species. In this case we refer to the MACE foundation models, that are extensively described here.
If you plan to use one of these models, first make sure to convert it to the right LAMMPS format. To do this, exit any uenv/venv loaded on a login node, and then run these commands
$ salloc -A <your_account> -C gpu -N 1 -t 00:05:00
$ srun --pty /bin/bash
$ uenv start --view kokkos lammps/20251210:v2
$ source /path_to_your/venv-lammps-mace-cp2k/bin/activate
$ python -m mace.cli.create_lammps_model some_mace.model --format=mliap
This script will create a some_mace-mliap_lammps.pt model that can be used with the following LAMMPS instructions.
The MD-LAMMPS folder contains these scripts and files:
input.lammps: example LAMMPS input file (here other details) for a MD simulation with a MACE potentialrun_lammps_mace_slurm: SLURM script for launching the LAMMPS MD input filewater.lmpa file containing the cell vectors as well as the masses and coordinates of 64 water molecules in a cubic box, that you can use as a test (same aswater.xyzused in the CP2K MD)mace-mh-1.model-mliap_lammps.pt.7za compressed MACE-MH-1 universal model already converted into MLIAP format
To run the test with water molecules, first extract the MACE potential with the command:
$ 7z x mace-mh-1.model-mliap_lammps.pt.7z
and type Y when prompted.
In general, you must create your own geom.lmp file - a file containing the information about the system under investigation (coordinates, masses, cell, ... here the documentation) and modify the input file input.lammps accordingly to your system properties and simulation details. Then you run the MD simulation with
$ sbatch run_lammps_mace_slurm
As before, the simulation will produce a trajectory file in a standard .xyz format, as requested by the dump command in the input.lammps input file.
When the MD is done, the next step is to "label" a subset of the trajectory with single point calculations to calculate DFT energies and forces. This will ultimately produce the training set file that will be used by MACE to create the bespoke ML interatomic potential.
The folder SCF contains the following scripts and files:
sp.inp: CP2K input file for a single point calculation. This must have the desired level of theory for the ML potential.select_configs.py: python script to select random configurations from a trajectory and set up the filesystem. It also contains a couple of functions to delete atoms or molecules above a certain height, and modify the multiplicity ofsp.inpaccording to the number of electrons.orig_submit.sh: SLURM script to launch an array of single point calculationscopy_submit.py: python script to copy the original SLURM script to each folder, and adjust the array numbersmake_extended_mace.sh: shell script to extract energy, coordinates, and forces from each selected configuration, and compile them in a single file ready for training.
The labelling must be done following these instructions:
- Put your
.xyztrajectory file in theSCFfolder. - Adjust borders, number of configurations, and number of folders in the
select_configs.pyscript. To run it, you have (unless you already did) load the uenv and source the venv:
$ uenv start lammps/20251210:v2 --view=kokkos
$ source /path_to_your/venv-lammps-mace-cp2k/bin/activate
$ python select_configs.py
This will create a folder named sp_conf, which contains a number of folders equal to that specified in the script, each containing the same number of subfolders. Each subfolder, numbered progressively from 0000 to the total number, contains the .xyz file of the randomly extracted configuration and a copy of the sp.inp file.
An example of the directory tree, for 3 folders with 50 configurations each (150 configurations total) would be
sp_conf/
├── indices.npy
├── 0/
│ ├── 0000/
│ │ ├── geom.xyz
│ │ └── sp.inp
│ ├── ⋮
│ ├── 0049/
│ └── submit_0_00.sh
├── 1/
│ ├── 0050/
│ ├── ⋮
│ ├── 0099/
│ └── submit_1_00.sh
└── 2/
├── 0100/
├── ⋮
├── 0149/
└── submit_2_00.sh
- Adjust the number of folders and configurations in the
copy_submit.pyscript, then run it with:
$ python copy_submit.py - This will put a copy of the
orig_submit_0_00.shfile in each of the folders ofsp_conf, and adjusts the array numbers accordingly. It also creates thesubmit_all.shscript. Runsubmit_all.shscript with
$ sbatch submit_all.sh
to launch all the single points at once. The single points are all launched on different nodes, there should be no issue with crashing due to memory problems. - When all the SCF have been completed, adjusti the number of folders in
make_extended_mace.shscript, and run it with
$ sbatch make_extended_mace.shto automatically extract energy, coordinates, and forces from each configuration, and convert energies from Hartree to eV, and forces from Hartree/Bohr to eV/Angstrom. The resulting output file can be used as a training set, or added to an existing training set.
Warning: Pay attention to adjust the cell that is being used in all the files that require it:
sp.inp,select_configs.py,make_extended_mace.sh
Once the make_extended_mace.sh script has constructed the training data file, the training can start.
If you run the script with the water example, the produced training file is named h2o_training.xyz.
Put this or your training set file in the TRAINING folder, where the run_mace.sh script is.
This script contains the parameters that control the MACE training:
fixed_args=(
--name="h2o"
--seed=881311935
--device='cuda'
--enable_cueq=True
--model='MACE'
--error_table='PerAtomRMSE'
--default_dtype='float32'
--num_workers=$OMP_NUM_THREADS
--r_max=6.0
--num_channels=256
--max_L=2
--train_file="h2o_training.xyz"
#--valid_file="val.xyz" # either change this if you manually split your file, or use --valid_fraction
#--test_file="test.xyz" # same as validation
--loss='weighted'
--energy_key='REF_energy'
--forces_key='REF_forces'
--E0s="average" # better if isolated atom energies are in the training file. In this case use "isolated"
--restart_latest
--distributed
--forces_weight=100.0
--energy_weight=1.0
--batch_size=1
--valid_batch_size=2
--lr=0.01
--max_num_epochs=100
--eval_interval=1
--swa
--swa_lr=0.001
--start_swa=75
)
start the training with:
$ sbatch run_mace.sh
This particular setting will run first 75 epochs with learning rate equal to 0.01 and then the last 25 epochs with a reduced learning rate equal to 0.001. Also, during these last 25 epochs, the weight of the energy error in the loss is increased. The training is therefore split in two "stages".
Once the training is finished, there will be different models created. In the example of water you will find:
h2o.modelh2o_compiled.modelh2o_stagetwo.modelh2o_stagetwo_compiled.model
And similary for your training data and project name.
Convert the created h2o_stagetwo.model file in the LAMMPS format, or your your_stagetwo.model model. To do this, exit any uenv/venv loaded on a login node, and then run these commands
$ salloc -A <your_account> -C gpu -N 1 -t 00:05:00
$ srun --pty /bin/bash
$ uenv start --view kokkos lammps/20251210:v2
$ source /path_to_your/venv-lammps-mace-cp2k/bin/activate
$ python -m mace.cli.create_lammps_model your_stagetwo.model --format=mliap
This script will create a model your_stagetwo.model-mliap_lammps.pt that can be used now to run MD simulations with LAMMPS as described in section 1b).
The MD simulation that will run with this model will have the desired DFT level of theory selected at stage 2), but a notably reduced computational cost.
ML interatomic potentials are subject to errors that come mainly from two sources: insufficient sampling of configuration space and inaccurate fit on training data. Active learning tries to compensate the first error. Here we implement active learning by query of committee of models, as described here.
The idea is to first train four different models on the same training set (using different seeds), as described in section 3), by only changing --name="h2o_i" for different i and --seed=881311935 to four different values. Then run a molecular dynamics simulation using one of them. Once the simulation is finished, the other three potentials evaluate the configurations
that have been produced, i.e. the models calculate energy and forces starting from the atomic coordinates in the trajectory. Finally, an histogram of maximum force deviation is built and a subset of the trajectory is sampled in a region of uncertainity, so that a new training set is created to further train and improve the quality of the ML potential. In other words, the new training set will contain configurations where the four potentials "do not agree" on forces estimation. This new training set will be labelled again by DFT calculation as described in section 2) so that it will be ready to be used as a new training set.
The ACTIVE_LEARNING folder contains the following scripts:
std_max.pyis the python script which calculates the maximum standard deviation of the forces, among all atoms for each configuration, and among all models.run_stdevis the SLURM script to run thestd_max.pyscriptconf_selection.ipynbis a Jupyter notebook template to analyze the maximum standard deviation histograms, and select a number of configurations based on standard deviation.
Insert the path to each model in the std_max.py script, along with the path to the trajectory (in .xyz format), then run
$ sbatch run_stdev
After this job is completed, you can run the conf_selection.ipynb notebook locally.
Create a folder and put there the trajectory .xyz and the file model_devi_summary.out that has been produced by std_max.py. Put the path of this folder (in this example named ACTIVE_LEARNING_DATA) in the notebook, in the third cell
folders = {
"my_unique_dataset": "/path_to_some/ACTIVE_LEARNING_DATA"
}
and use the notebook to analyze the force deviation histogram. This will create an histogram such as this one:

the histogram is divided in five regions by four vertical lines, corresponing to specific values of force deviations, as described here. The first region (from left to right) is discarded as contains configurations where the models well agree between each other, and the last region is also discarded as containing high force deviation (usually denoted as non-physical configuration, but check carefully as it depends ultimately on the value on the x-axis).
The notebook will sample randomly an integer number (defined by the user in the notebook) of configurations in the remaining three intervals. These are collected separately in three different .xyzfiles (useful for inspection). The file containing all the new configuration is the geom_all.xyzfile. This has to be used as a new "trajectory" to label as described in step 2).
Finally, once this new set configurations have been relabelled, the original model can be trained again. In particular, MACE supports the multihead replay fine-tuning, so that the new model will not forget what already learned in the past. To do so, simply modify the MACE training arguments as shown here:
fixed_args=(
--name "mace_model_finetuned"
--train_file "path_to_your/geom_all_relabelled.xyz"
--foundation_model "path_to_your/mace.model"
--pt_train_file "path_to_your_original_training_data.xyz"
--multiheads_finetuning True
--E0s="{1: -188.0294448423205, 8: -94.01472242116016}" # values relative to H and O for the water example
...
)
where geom_all_relabelled.xyzis the output of make_extended_mace.sh of section 2) on the new geom_all.xyzlabelled data and mace.model is your MACE model trained previously in section 3). Note that we have now to specify E0s values, since multihead fine-tuning does not support "average". These values can be found in the output of the first training, in a line such as
...
INFO: Isolated Atomic Energies (E0s) not in training file, using command line argument
INFO: Computing average Atomic Energies using least squares regression
INFO: Atomic Energies used (z: eV) for head Default: {1: -188.0294448423205, 8: -94.01472242116016}
...
More info about the multihead replay fine-tuning of MACE can be found here.
The fine-tuning procedure can be done again four times to create a committee of four fine-tuned model, from which another histogram of maximum force deviation can be calculated. In case of successfull active-learning, the new histrogram should shrink and move to the left (smaller values of deviation) with respect to the first one.
This corresponds to a first cycle of active learning. The user may repeat other active learning cycles, including specific ill-behaving dynamics, depending on specific needs.