GC-GNN: Interpretable Group-Contribution Graph Neural Networks

Interpretable GNN-based molecular property prediction via group-contribution fragmentation.

Overview

GC-GNN is a framework for interpretable molecular property prediction that combines the classical group-contribution (GC) concept with graph neural networks (GNNs). Rather than treating a molecule as a single atom-level graph, GC-GNN encodes it as a multi-modal hierarchical representation spanning three levels of chemical abstraction:

L1 — Full atom graph: AttentiveFP over all atoms and bonds of the molecule
L2 — Fragment subgraphs: AttentiveFP within each functional group (intra-fragment connectivity)
L3 — Junction tree: AttentiveFP over the fragment-level graph where nodes are functional groups and edges represent inter-fragment bonds

Together, L2–L3 form a hypergraph-structured encoding of the molecule — capturing atomic detail, intra-group chemistry, and inter-group topology simultaneously. The three representations are concatenated and passed through an MLP to produce property predictions.

Two model architectures are provided:

GroupGAT (Group Graph ATtention) — the full three-level model described above
AGC (Attentive Group-Contribution) — a two-level variant (L2 + L3 only) focused on group-contribution-style predictions

Interpretability

A key feature of GC-GNN is that attention operates at the fragment level rather than the atom level. Because L3 runs AttentiveFP over the junction tree — where each node is a functional group — the model produces attention weights over chemically meaningful substructures (e.g. hydroxyl, carbonyl, aromatic ring) rather than individual atoms. This makes predictions directly interpretable in the language of group-contribution chemistry: which functional groups drove the model's estimate?

Fragment attention can be extracted and visualised for any molecule after training. See notebooks/03_inference.ipynb for a worked example.

Installation

1. Create environment

conda create -n gnn python=3.11 -y && conda activate gnn

2. Install RDKit

pip install rdkit

3. Install PyTorch 2.8

pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126

The cu126 suffix targets CUDA 12.6. If your driver differs, find the correct command at pytorch.org/get-started. 4. Install PyTorch Geometric 2.7

pip install torch-geometric==2.7.0

5. Install GC-GNN

pip install -e ".[dev]"

Usage

Quick Start

# Train
python scripts/train.py configs/example/groupgat_esol.yaml

# Evaluate on train/val/test splits
python scripts/evaluate.py checkpoints/example/groupgat_esol_example/

# Predict on new molecules
python scripts/predict.py checkpoints/example/groupgat_esol_example/ --smiles "CCO" "c1ccccc1" "CC(=O)O"

# Predict from a CSV file
python scripts/predict.py checkpoints/example/groupgat_esol_example/ --input molecules.csv --smiles-col SMILES --output predictions.csv

Diagnostic Plots

Generate parity plots, residual plots, and loss curves:

python scripts/plot.py checkpoints/example/groupgat_esol_example/
python scripts/plot.py checkpoints/example/groupgat_esol_example/ --splits test --unit "log(mol/L)"

Plots are saved to checkpoints/example/groupgat_esol_example/plots/.

Multi-Task Learning

Train a single model on multiple targets simultaneously. Targets with missing values (NaN) are automatically masked in the loss:

data:
  target_col: ["y1", "y2", "y3"]
  task_weights: [1.0, 1.0, 1.0]   # optional per-task loss weights
  log_transform: true              # apply log(1+y) to targets (useful for skewed distributions)
model:
  multihead: true                  # separate MLP head per task, shared encoder

Transfer Learning

Fine-tune a pre-trained model on a new dataset:

# Fine-tune all layers
python scripts/train.py configs/example/my_experiment.yaml --pretrained checkpoints/source_model/best.pt

# Fine-tune prediction head only (freeze encoder)
python scripts/train.py configs/example/my_experiment.yaml --pretrained checkpoints/source_model/best.pt --freeze-encoder

Hyperparameter Optimisation

python scripts/hpo.py configs/hpo/groupgat_hpo.yaml

Results are saved to checkpoints/hpo_results/<study_name>/results.json.

Retrain with best parameters using:

python scripts/train.py configs/example/my_experiment.yaml --from-hpo checkpoints/hpo_results/my_experiment/results.json

Data

Data Format

Input data must be a CSV with at least a SMILES column and a target column:

SMILES,Target
CCO,-0.31
c1ccccc1,2.13

Adding a Custom Dataset

Prepare a CSV with SMILES and target columns
Create a config (see configs/example/groupgat_esol.yaml as template)
Provide a split file or let the code generate one:

data:
  csv_path: data/my_property.csv
  smiles_col: SMILES
  target_col: Target
  split:
    val_frac: 0.1
    test_frac: 0.1
    seed: 42

Project Structure

gc-gnn/
├── configs/            # YAML experiment configs
│   ├── example/        # ESOL worked example
│   └── hpo/            # HPO configs
├── data/
│   └── properties/     # CSV datasets and split files
├── scripts/
│   ├── train.py        # Training
│   ├── evaluate.py     # Evaluation on splits
│   ├── predict.py      # Inference
│   ├── plot.py         # Diagnostic plots
│   ├── hpo.py          # Hyperparameter optimisation
│   ├── embed.py        # Molecule embedding extraction
│   └── merge_datasets.py # Combine datasets from multiple sources
├── src/
│   ├── chem/           # Featurizers, fragmentation, junction tree
│   ├── data/           # Dataset, datamodule, splits
│   ├── models/         # GroupGAT, AGC, AFP
│   ├── training/       # Lightning module, pipeline, inference
│   └── config.py       # Config dataclasses
└── checkpoints/        # Saved model checkpoints

Case Studies

Pre-trained checkpoints, datasets, and full reproduction results for the original GroupGAT papers are maintained in a separate repository:

gc-gnn-papers — benchmarks for Aouichaoui et al. CACE 2023 and JCIM 2023.

Citation

If you use this code or the pre-trained models in your research, please cite the relevant papers:

CACE 2023

@article{aouichaoui2023application,
  title={Application of interpretable group-embedded graph neural networks for pure compound properties},
  author={Aouichaoui, Adem R. N. and Fan, Fan and Mansouri, Seyed S. and Abildskov, Jens and Sin, G{\"u}rkan},
  journal={Computers \& Chemical Engineering},
  volume={176},
  pages={108291},
  year={2023},
  publisher={Elsevier}
}

JCIM 2023

@article{aouichaoui2023groupgat,
  title={Combining Group-Contribution Concept and Graph Neural Networks Toward Interpretable Molecular Property Models},
  author={Aouichaoui, Adem R. N. and Fan, Fan and Abildskov, Jens and Sin, G{\"u}rkan},
  journal={Journal of Chemical Information and Modeling},
  volume={63},
  number={3},
  pages={725--744},
  year={2023},
  publisher={ACS Publications}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

We would like to thank the following master students for their contributions to this project:

Fan Fan — Technical University of Denmark
Paul Seghers — Technical University of Denmark

Contact

For questions, issues, or collaborations: Adem R. N. Aouichaoui
Section of Autonomous Materials Discovery (AMD)
Department of Energy Conversion and Storage Technical University of Denmark, DTU

arnaou@dtu.dk | ORCID

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
configs		configs
data		data
figures		figures
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GC-GNN: Interpretable Group-Contribution Graph Neural Networks

Overview

Interpretability

Installation

Usage

Quick Start

Diagnostic Plots

Multi-Task Learning

Transfer Learning

Hyperparameter Optimisation

Data

Data Format

Adding a Custom Dataset

Project Structure

Case Studies

Citation

License

Acknowledgements

Contact

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GC-GNN: Interpretable Group-Contribution Graph Neural Networks

Overview

Interpretability

Installation

Usage

Quick Start

Diagnostic Plots

Multi-Task Learning

Transfer Learning

Hyperparameter Optimisation

Data

Data Format

Adding a Custom Dataset

Project Structure

Case Studies

Citation

License

Acknowledgements

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages