Skip to content

arnaou/gc-gnn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GC-GNN: Interpretable Group-Contribution Graph Neural Networks

License: MIT Python 3.11 PyTorch

Interpretable GNN-based molecular property prediction via group-contribution fragmentation.

Overview

GC-GNN is a framework for interpretable molecular property prediction that combines the classical group-contribution (GC) concept with graph neural networks (GNNs). Rather than treating a molecule as a single atom-level graph, GC-GNN encodes it as a multi-modal hierarchical representation spanning three levels of chemical abstraction:

  • L1 — Full atom graph: AttentiveFP over all atoms and bonds of the molecule
  • L2 — Fragment subgraphs: AttentiveFP within each functional group (intra-fragment connectivity)
  • L3 — Junction tree: AttentiveFP over the fragment-level graph where nodes are functional groups and edges represent inter-fragment bonds

Together, L2–L3 form a hypergraph-structured encoding of the molecule — capturing atomic detail, intra-group chemistry, and inter-group topology simultaneously. The three representations are concatenated and passed through an MLP to produce property predictions.

Two model architectures are provided:

  • GroupGAT (Group Graph ATtention) — the full three-level model described above GroupGAT architecture

  • AGC (Attentive Group-Contribution) — a two-level variant (L2 + L3 only) focused on group-contribution-style predictions AGC

Interpretability

A key feature of GC-GNN is that attention operates at the fragment level rather than the atom level. Because L3 runs AttentiveFP over the junction tree — where each node is a functional group — the model produces attention weights over chemically meaningful substructures (e.g. hydroxyl, carbonyl, aromatic ring) rather than individual atoms. This makes predictions directly interpretable in the language of group-contribution chemistry: which functional groups drove the model's estimate?

Fragment attention can be extracted and visualised for any molecule after training. See notebooks/03_inference.ipynb for a worked example.

Attention example

Installation

1. Create environment

conda create -n gnn python=3.11 -y && conda activate gnn

2. Install RDKit

pip install rdkit

3. Install PyTorch 2.8

pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126

The cu126 suffix targets CUDA 12.6. If your driver differs, find the correct command at pytorch.org/get-started. 4. Install PyTorch Geometric 2.7

pip install torch-geometric==2.7.0

5. Install GC-GNN

pip install -e ".[dev]"

Usage

Quick Start

# Train
python scripts/train.py configs/example/groupgat_esol.yaml

# Evaluate on train/val/test splits
python scripts/evaluate.py checkpoints/example/groupgat_esol_example/

# Predict on new molecules
python scripts/predict.py checkpoints/example/groupgat_esol_example/ --smiles "CCO" "c1ccccc1" "CC(=O)O"

# Predict from a CSV file
python scripts/predict.py checkpoints/example/groupgat_esol_example/ --input molecules.csv --smiles-col SMILES --output predictions.csv

Diagnostic Plots

Generate parity plots, residual plots, and loss curves:

python scripts/plot.py checkpoints/example/groupgat_esol_example/
python scripts/plot.py checkpoints/example/groupgat_esol_example/ --splits test --unit "log(mol/L)"

Plots are saved to checkpoints/example/groupgat_esol_example/plots/.

Multi-Task Learning

Train a single model on multiple targets simultaneously. Targets with missing values (NaN) are automatically masked in the loss:

data:
  target_col: ["y1", "y2", "y3"]
  task_weights: [1.0, 1.0, 1.0]   # optional per-task loss weights
  log_transform: true              # apply log(1+y) to targets (useful for skewed distributions)
model:
  multihead: true                  # separate MLP head per task, shared encoder

Transfer Learning

Fine-tune a pre-trained model on a new dataset:

# Fine-tune all layers
python scripts/train.py configs/example/my_experiment.yaml --pretrained checkpoints/source_model/best.pt

# Fine-tune prediction head only (freeze encoder)
python scripts/train.py configs/example/my_experiment.yaml --pretrained checkpoints/source_model/best.pt --freeze-encoder

Hyperparameter Optimisation

python scripts/hpo.py configs/hpo/groupgat_hpo.yaml

Results are saved to checkpoints/hpo_results/<study_name>/results.json.

Retrain with best parameters using:

python scripts/train.py configs/example/my_experiment.yaml --from-hpo checkpoints/hpo_results/my_experiment/results.json

Data

Data Format

Input data must be a CSV with at least a SMILES column and a target column:

SMILES,Target
CCO,-0.31
c1ccccc1,2.13

Adding a Custom Dataset

  1. Prepare a CSV with SMILES and target columns
  2. Create a config (see configs/example/groupgat_esol.yaml as template)
  3. Provide a split file or let the code generate one:
data:
  csv_path: data/my_property.csv
  smiles_col: SMILES
  target_col: Target
  split:
    val_frac: 0.1
    test_frac: 0.1
    seed: 42

Project Structure

gc-gnn/
├── configs/            # YAML experiment configs
│   ├── example/        # ESOL worked example
│   └── hpo/            # HPO configs
├── data/
│   └── properties/     # CSV datasets and split files
├── scripts/
│   ├── train.py        # Training
│   ├── evaluate.py     # Evaluation on splits
│   ├── predict.py      # Inference
│   ├── plot.py         # Diagnostic plots
│   ├── hpo.py          # Hyperparameter optimisation
│   ├── embed.py        # Molecule embedding extraction
│   └── merge_datasets.py # Combine datasets from multiple sources
├── src/
│   ├── chem/           # Featurizers, fragmentation, junction tree
│   ├── data/           # Dataset, datamodule, splits
│   ├── models/         # GroupGAT, AGC, AFP
│   ├── training/       # Lightning module, pipeline, inference
│   └── config.py       # Config dataclasses
└── checkpoints/        # Saved model checkpoints

Case Studies

Pre-trained checkpoints, datasets, and full reproduction results for the original GroupGAT papers are maintained in a separate repository:

gc-gnn-papers — benchmarks for Aouichaoui et al. CACE 2023 and JCIM 2023.

Citation

If you use this code or the pre-trained models in your research, please cite the relevant papers:

CACE 2023

@article{aouichaoui2023application,
  title={Application of interpretable group-embedded graph neural networks for pure compound properties},
  author={Aouichaoui, Adem R. N. and Fan, Fan and Mansouri, Seyed S. and Abildskov, Jens and Sin, G{\"u}rkan},
  journal={Computers \& Chemical Engineering},
  volume={176},
  pages={108291},
  year={2023},
  publisher={Elsevier}
}

JCIM 2023

@article{aouichaoui2023groupgat,
  title={Combining Group-Contribution Concept and Graph Neural Networks Toward Interpretable Molecular Property Models},
  author={Aouichaoui, Adem R. N. and Fan, Fan and Abildskov, Jens and Sin, G{\"u}rkan},
  journal={Journal of Chemical Information and Modeling},
  volume={63},
  number={3},
  pages={725--744},
  year={2023},
  publisher={ACS Publications}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

We would like to thank the following master students for their contributions to this project:

  • Fan Fan — Technical University of Denmark
  • Paul Seghers — Technical University of Denmark

Contact

For questions, issues, or collaborations: Adem R. N. Aouichaoui
Section of Autonomous Materials Discovery (AMD)
Department of Energy Conversion and Storage Technical University of Denmark, DTU

arnaou@dtu.dk | ORCID

About

Interpretable graph neural networks with group-embedded molecular representations for predicting pure compound properties

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors