Interpretable GNN-based molecular property prediction via group-contribution fragmentation.
GC-GNN is a framework for interpretable molecular property prediction that combines the classical group-contribution (GC) concept with graph neural networks (GNNs). Rather than treating a molecule as a single atom-level graph, GC-GNN encodes it as a multi-modal hierarchical representation spanning three levels of chemical abstraction:
- L1 — Full atom graph: AttentiveFP over all atoms and bonds of the molecule
- L2 — Fragment subgraphs: AttentiveFP within each functional group (intra-fragment connectivity)
- L3 — Junction tree: AttentiveFP over the fragment-level graph where nodes are functional groups and edges represent inter-fragment bonds
Together, L2–L3 form a hypergraph-structured encoding of the molecule — capturing atomic detail, intra-group chemistry, and inter-group topology simultaneously. The three representations are concatenated and passed through an MLP to produce property predictions.
Two model architectures are provided:
-
GroupGAT (Group Graph ATtention) — the full three-level model described above

-
AGC (Attentive Group-Contribution) — a two-level variant (L2 + L3 only) focused on group-contribution-style predictions

A key feature of GC-GNN is that attention operates at the fragment level rather than the atom level. Because L3 runs AttentiveFP over the junction tree — where each node is a functional group — the model produces attention weights over chemically meaningful substructures (e.g. hydroxyl, carbonyl, aromatic ring) rather than individual atoms. This makes predictions directly interpretable in the language of group-contribution chemistry: which functional groups drove the model's estimate?
Fragment attention can be extracted and visualised for any molecule after training. See notebooks/03_inference.ipynb for a worked example.
1. Create environment
conda create -n gnn python=3.11 -y && conda activate gnn2. Install RDKit
pip install rdkit3. Install PyTorch 2.8
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126The
cu126suffix targets CUDA 12.6. If your driver differs, find the correct command at pytorch.org/get-started. 4. Install PyTorch Geometric 2.7
pip install torch-geometric==2.7.05. Install GC-GNN
pip install -e ".[dev]"# Train
python scripts/train.py configs/example/groupgat_esol.yaml
# Evaluate on train/val/test splits
python scripts/evaluate.py checkpoints/example/groupgat_esol_example/
# Predict on new molecules
python scripts/predict.py checkpoints/example/groupgat_esol_example/ --smiles "CCO" "c1ccccc1" "CC(=O)O"
# Predict from a CSV file
python scripts/predict.py checkpoints/example/groupgat_esol_example/ --input molecules.csv --smiles-col SMILES --output predictions.csvGenerate parity plots, residual plots, and loss curves:
python scripts/plot.py checkpoints/example/groupgat_esol_example/
python scripts/plot.py checkpoints/example/groupgat_esol_example/ --splits test --unit "log(mol/L)"Plots are saved to checkpoints/example/groupgat_esol_example/plots/.
Train a single model on multiple targets simultaneously. Targets with missing values (NaN) are automatically masked in the loss:
data:
target_col: ["y1", "y2", "y3"]
task_weights: [1.0, 1.0, 1.0] # optional per-task loss weights
log_transform: true # apply log(1+y) to targets (useful for skewed distributions)
model:
multihead: true # separate MLP head per task, shared encoderFine-tune a pre-trained model on a new dataset:
# Fine-tune all layers
python scripts/train.py configs/example/my_experiment.yaml --pretrained checkpoints/source_model/best.pt
# Fine-tune prediction head only (freeze encoder)
python scripts/train.py configs/example/my_experiment.yaml --pretrained checkpoints/source_model/best.pt --freeze-encoderpython scripts/hpo.py configs/hpo/groupgat_hpo.yamlResults are saved to checkpoints/hpo_results/<study_name>/results.json.
Retrain with best parameters using:
python scripts/train.py configs/example/my_experiment.yaml --from-hpo checkpoints/hpo_results/my_experiment/results.jsonInput data must be a CSV with at least a SMILES column and a target column:
SMILES,Target
CCO,-0.31
c1ccccc1,2.13- Prepare a CSV with SMILES and target columns
- Create a config (see
configs/example/groupgat_esol.yamlas template) - Provide a split file or let the code generate one:
data:
csv_path: data/my_property.csv
smiles_col: SMILES
target_col: Target
split:
val_frac: 0.1
test_frac: 0.1
seed: 42gc-gnn/
├── configs/ # YAML experiment configs
│ ├── example/ # ESOL worked example
│ └── hpo/ # HPO configs
├── data/
│ └── properties/ # CSV datasets and split files
├── scripts/
│ ├── train.py # Training
│ ├── evaluate.py # Evaluation on splits
│ ├── predict.py # Inference
│ ├── plot.py # Diagnostic plots
│ ├── hpo.py # Hyperparameter optimisation
│ ├── embed.py # Molecule embedding extraction
│ └── merge_datasets.py # Combine datasets from multiple sources
├── src/
│ ├── chem/ # Featurizers, fragmentation, junction tree
│ ├── data/ # Dataset, datamodule, splits
│ ├── models/ # GroupGAT, AGC, AFP
│ ├── training/ # Lightning module, pipeline, inference
│ └── config.py # Config dataclasses
└── checkpoints/ # Saved model checkpoints
Pre-trained checkpoints, datasets, and full reproduction results for the original GroupGAT papers are maintained in a separate repository:
gc-gnn-papers — benchmarks for Aouichaoui et al. CACE 2023 and JCIM 2023.
If you use this code or the pre-trained models in your research, please cite the relevant papers:
CACE 2023
@article{aouichaoui2023application,
title={Application of interpretable group-embedded graph neural networks for pure compound properties},
author={Aouichaoui, Adem R. N. and Fan, Fan and Mansouri, Seyed S. and Abildskov, Jens and Sin, G{\"u}rkan},
journal={Computers \& Chemical Engineering},
volume={176},
pages={108291},
year={2023},
publisher={Elsevier}
}JCIM 2023
@article{aouichaoui2023groupgat,
title={Combining Group-Contribution Concept and Graph Neural Networks Toward Interpretable Molecular Property Models},
author={Aouichaoui, Adem R. N. and Fan, Fan and Abildskov, Jens and Sin, G{\"u}rkan},
journal={Journal of Chemical Information and Modeling},
volume={63},
number={3},
pages={725--744},
year={2023},
publisher={ACS Publications}
}This project is licensed under the MIT License - see the LICENSE file for details.
We would like to thank the following master students for their contributions to this project:
- Fan Fan — Technical University of Denmark
- Paul Seghers — Technical University of Denmark
For questions, issues, or collaborations:
Adem R. N. Aouichaoui
Section of Autonomous Materials Discovery (AMD)
Department of Energy Conversion and Storage
Technical University of Denmark, DTU