MechGraph is a neuro-symbolic architecture designed to predict complex organic reaction mechanisms. It bridges the gap between molecular graph representations and large language models (LLMs) to generate step-by-step mechanistic explanations.
Standard LLMs trained on text representations of molecules (like SMILES) often fail to grasp 2D/3D topological information crucial for understanding reactivity. MechGraph addresses this by using a Graph Neural Network (GNN) to encode molecular structures and a trainable Projector to map these graph embeddings directly into the LLM's token embedding space.
- π¬ Multimodal Input: Processes 2D molecular graphs (via PyTorch Geometric) and text prompts simultaneously
- π§ GIN Encoder: Graph Isomorphism Network extracts topology-aware molecular features
- π Modality Projector: Learnable adapter aligns graph features with LLM token dimensions
- π€ LLM Integration: Compatible with HuggingFace models (Llama-2, Mistral, TinyLlama, etc.)
- π Comprehensive Evaluation: Built-in metrics for accuracy, validity, BLEU-4, and more
MechGraph/
βββ mechgraph/ # Main Python package
β βββ __init__.py
β βββ models/ # Neural network components
β β βββ __init__.py
β β βββ graph_encoder.py # GIN encoder for molecular graphs
β β βββ mechgraph_model.py # Main multimodal model
β βββ data/ # Data processing
β β βββ __init__.py
β β βββ processor.py # SMILES to graph conversion
β β βββ dataset.py # Dataset loaders
β βββ evaluation/ # Evaluation tools
β β βββ __init__.py
β β βββ metrics.py # Accuracy, BLEU, validity metrics
β β βββ logger.py # Experiment logging
β βββ utils/ # Utilities
β βββ __init__.py
β βββ visualization.py # Training plots
βββ scripts/ # Executable scripts
β βββ train.py # Training script
β βββ inference.py # Run predictions
β βββ evaluate.py # Model evaluation
β βββ save_model.py # Save checkpoints
β βββ generate_tables.py # Generate paper tables
βββ configs/ # Configuration files
β βββ default.yaml # Default hyperparameters
βββ data/ # Data files
β βββ pubchem-10m.txt # PubChem molecules
β βββ pmechdb_data/ # PMechDB reaction data
βββ tables/ # Generated evaluation tables
βββ notebooks/
β βββ MechGraph.ipynb # Interactive notebook
βββ requirements.txt # Python dependencies
βββ setup.py # Package installation
βββ README.md # This file
- Python 3.9+
- CUDA 11.8+ (recommended for GPU training)
- 8GB+ GPU memory (for default LLM)
# Clone the repository
git clone https://github.com/whyujjwal/MechGraph.git
cd MechGraph
# Create virtual environment
conda create -n mechgraph python=3.10
conda activate mechgraph
# Install dependencies
pip install -r requirements.txt
# Install package in development mode
pip install -e .If you encounter issues with PyG, install it manually:
# For CUDA 11.8
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.0.0+cu118.html
pip install torch-geometricSee the PyG installation guide for other CUDA versions.
For gated models (e.g., Llama-2), authenticate with HuggingFace:
huggingface-cli loginMechGraph consists of three differentiable components:
Input: Node features (atomic numbers) + Edge indices
β
[GIN Conv] β [BatchNorm] β [ReLU] β [Dropout] (Γ N layers)
β
[Global Mean Pooling]
β
Output: Graph embedding [B, Graph_Hidden_Dim]
Input: Graph embedding [B, Graph_Hidden_Dim]
β
[Linear Layer]
β
Output: LLM-compatible token [B, 1, LLM_Hidden_Dim]
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Molecule Graph βββ [GraphEncoder] βββ [Projector] βββ β
β β β
β β β
β Text Prompt βββ [Tokenizer] βββ [Embeddings] βββ [Concat] β
β β β
β β β
β [Frozen LLM] β
β β β
β β β
β Mechanism Output β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
from mechgraph import MechGraphModel, MoleculeProcessor
# Initialize
processor = MoleculeProcessor()
model = MechGraphModel(llm_path="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Convert SMILES to graph
graph = processor.smiles_to_graph("c1ccccc1") # Benzene
# Get graph embedding
embedding = model.get_graph_embedding(graph)# Stage 1: Alignment (molecule-description pairs)
python scripts/train.py --stage alignment --epochs 2
# Stage 2: Instruction tuning (reaction-mechanism pairs)
python scripts/train.py --stage instruction --epochs 5
# With custom settings
python scripts/train.py \
--stage instruction \
--llm_path "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--batch_size 4 \
--learning_rate 1e-4 \
--max_samples 1000# Run inference on a SMILES string
python scripts/inference.py --smiles "CC(=O)O"
# With trained checkpoint
python scripts/inference.py \
--smiles "c1ccccc1" \
--checkpoint checkpoints/mechgraph_epoch_2.pt# Evaluate model
python scripts/evaluate.py --checkpoint checkpoints/mechgraph_epoch_2.pt
# Generate paper tables
python scripts/generate_tables.pyUses molecule-description pairs. The included data/pubchem-10m.txt contains SMILES strings from PubChem.
Requires the PMechDB dataset:
- Visit PMechDB Download Page
- Download the dataset
- Place CSV files in
data/pmechdb_data/
Expected structure:
data/pmechdb_data/
βββ manually_curated_train.csv
βββ manually_curated_test_challenging.csv
βββ combinatorial_train.csv
βββ combinatorial_test.csv
βββ combinatorial_all.csv
| Model | Top-1 Accuracy | Mechanism Validity | BLEU-4 |
|---|---|---|---|
| BioBERT | 42.5% | 68.2% | 0.35 |
| GPT-4 (Zero-shot) | 58.1% | 84.3% | 0.61 |
| MolRAG | 65.3% | 78.9% | 0.72 |
| MechGraph (Ours) | 82.4% | 91.5% | 0.88 |
Edit configs/default.yaml to customize:
model:
llm_path: "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
freeze_llm: true
graph_hidden_dim: 128
num_gnn_layers: 3
training:
epochs: 2
batch_size: 2
learning_rate: 1.0e-4MechGraphModel(
llm_path: str = "meta-llama/Llama-2-7b-hf",
freeze_llm: bool = True,
node_feature_dim: int = 1,
graph_hidden_dim: int = 128,
num_gnn_layers: int = 3
)processor = MoleculeProcessor(add_hydrogens=True)
graph = processor.smiles_to_graph("CCO") # Ethanol
graphs = processor.batch_smiles_to_graphs(["CCO", "CC(=O)O"])from mechgraph.evaluation import (
calculate_top1_accuracy,
calculate_bleu4,
calculate_mechanism_validity,
calculate_levenshtein_distance
)Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
If you use MechGraph in your research, please cite:
@article{mechgraph2024,
title={MechGraph: Neuro-Symbolic Molecular Mechanism Prediction},
author={MechGraph Team},
journal={arXiv preprint},
year={2024}
}- PyTorch Geometric for graph neural network implementations
- HuggingFace Transformers for LLM integration
- RDKit for cheminformatics utilities
- PMechDB for reaction mechanism data