The scikit-learn compatible library for molecular fingerprints and chemoinformatics.
Easily and efficiently compute molecular fingerprints, molecular filters, distances & similarity measures, and more.
Go from SMILES to production-grade chemoinformatics ML pipelines in a few lines of code.
Documentation · Examples & tutorials · API Reference · Publication
You can install from PyPI, using pip or uv.
pip install scikit-fingerprintsIf you need bleeding-edge features and don't mind potentially unstable or undocumented functionalities, you can also install directly from GitHub:
pip install git+https://github.com/MLCIL/scikit-fingerprints.gitPython versions from 3.10 to 3.13 are supported on all major operating systems. Tests are run on Linux Ubuntu, Windows, and macOS.
Simply input SMILES strings into the molecular fingerprint instance:
from skfp.fingerprints import ECFPFingerprint
smiles = ["O=S(=O)(O)CCS(=O)(=O)O", "O=C(O)c1ccccc1O"]
fp = ECFPFingerprint()
X = fp.transform(smiles) # SMILES in, NumPy array outBuild a full molecular ML pipeline with scikit-learn:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline, make_union
from skfp.datasets.moleculenet import load_clintox
from skfp.fingerprints import ECFPFingerprint, MACCSFingerprint
from skfp.metrics import extract_pos_proba, multioutput_auroc_score
from skfp.model_selection import scaffold_train_test_split
from skfp.preprocessing import MolFromSmilesTransformer
smiles, y = load_clintox()
smiles_train, smiles_test, y_train, y_test = scaffold_train_test_split(
smiles, y, test_size=0.2
)
pipeline = make_pipeline(
MolFromSmilesTransformer(),
make_union(ECFPFingerprint(count=True), MACCSFingerprint()),
RandomForestClassifier(random_state=0),
)
pipeline.fit(smiles_train, y_train)
y_pred_proba = extract_pos_proba(pipeline.predict_proba(smiles_test))
print(f"AUROC: {multioutput_auroc_score(y_test, y_pred_proba):.2%}")-
- over 30, e.g. ECFP, Avalon, MACCS, Mordred, PubChem
- all with a uniform
.transform()API
-
- over 30, e.g. Lipinski Rule of 5, PAINS, REOS
- both substructural and physicochemical
-
Similarity & distance measures
- 14 measures, e.g. Tanimoto, Dice, MCS
- compatible with kNN, UMAP, HDBSCAN, and other distance-based models
- efficient bulk similarity distribution computation
-
- 11 methods, e.g. kNN, centroid distance, TOPKAT
- evaluate the reliability of algorithms for new molecules
-
- MoleculeNet, Therapeutics Data Commons, MoleculeACE, and LRGB
- train-test splits built-in
-
Native scikit-learn integration
- use
Pipeline,FeatureUnion,GridSearchCV, and more - build, save, and deploy ML pipelines for chemoinformatics
- use
-
Other features
- fast and efficient: parallelized, sparse matrices support, C++ RDKit under the hood
- efficient hyperparameter tuning with fingerprints caching
- MIT licensed, permissive academic and commercial use
Step-by-step Jupyter notebooks, both for learning and deploying production-grade features:
- Introduction to scikit-fingerprints
- Fingerprint types
- Molecular pipelines
- Conformers and 3D fingerprints
- Hyperparameter tuning
- Dataset splits
- Datasets and benchmarking
- Similarity and distance metrics
- Molecular filters
- Molecular clustering
Publications using scikit-fingerprints:
- J. Adamczyk, W. Czech "Molecular Topological Profile (MOLTOP) -- Simple and Strong Baseline for Molecular Graph Classification" ECAI 2024
- J. Adamczyk, P. Ludynia "Scikit-fingerprints: easy and efficient computation of molecular fingerprints in Python" SoftwareX
- J. Adamczyk, P. Ludynia, W. Czech "Molecular Fingerprints Are Strong Models for Peptide Function Prediction" ArXiv preprint
- J. Adamczyk "Towards Rational Pesticide Design with Graph Machine Learning Models for Ecotoxicology" CIKM 2025
- J. Adamczyk, J. Poziemski, F. Job, M. Król, M. Makowski "MolPILE - large-scale, diverse dataset for molecular representation learning" ArXiv preprint
- J. Adamczyk, J. Poziemski, P. Siedlecki "Evaluating machine learning models for predicting pesticide toxicity to honey bees" Ecotoxicology and Environmental Safety 2026
- M. Fitzner et al. "BayBE: a Bayesian Back End for experimental planning in the low-to-no-data regime" RSC Digital Discovery
- J. Xiong et al. "Bridging 3D Molecular Structures and Artificial Intelligence by a Conformation Description Language"
- S. Mavlonazarova et al. "Untargeted Metabolomics Reveals Organ-Specific and Extraction-Dependent Metabolite Profiles in Endemic Tajik Species Ferula violacea Korovin" bioRxiv preprint
If you use scikit-fingerprints in your work, please cite our publication in SoftwareX (open access):
@article{scikit_fingerprints,
title = {Scikit-fingerprints: Easy and efficient computation of molecular fingerprints in Python},
author = {Jakub Adamczyk and Piotr Ludynia},
journal = {SoftwareX},
volume = {28},
pages = {101944},
year = {2024},
issn = {2352-7110},
doi = {https://doi.org/10.1016/j.softx.2024.101944},
url = {https://www.sciencedirect.com/science/article/pii/S2352711024003145},
}Also available as a preprint on ArXiv.
Contributions are welcome! Please read CONTRIBUTING.md and CODE_OF_CONDUCT.md for details.
MIT -- see LICENSE.md for details.