Skip to content

MLCIL/scikit-fingerprints

scikit-fingerprints

PyPI version Downloads Monthly downloads License Python versions Contributors

The scikit-learn compatible library for molecular fingerprints and chemoinformatics.

Easily and efficiently compute molecular fingerprints, molecular filters, distances & similarity measures, and more.

Go from SMILES to production-grade chemoinformatics ML pipelines in a few lines of code.

Documentation · Examples & tutorials · API Reference · Publication


Table of Contents


Install

You can install from PyPI, using pip or uv.

pip install scikit-fingerprints

If you need bleeding-edge features and don't mind potentially unstable or undocumented functionalities, you can also install directly from GitHub:

pip install git+https://github.com/MLCIL/scikit-fingerprints.git

Python versions from 3.10 to 3.13 are supported on all major operating systems. Tests are run on Linux Ubuntu, Windows, and macOS.

Quickstart

Simply input SMILES strings into the molecular fingerprint instance:

from skfp.fingerprints import ECFPFingerprint

smiles = ["O=S(=O)(O)CCS(=O)(=O)O", "O=C(O)c1ccccc1O"]

fp = ECFPFingerprint()
X = fp.transform(smiles)  # SMILES in, NumPy array out

Build a full molecular ML pipeline with scikit-learn:

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline, make_union

from skfp.datasets.moleculenet import load_clintox
from skfp.fingerprints import ECFPFingerprint, MACCSFingerprint
from skfp.metrics import extract_pos_proba, multioutput_auroc_score
from skfp.model_selection import scaffold_train_test_split
from skfp.preprocessing import MolFromSmilesTransformer

smiles, y = load_clintox()
smiles_train, smiles_test, y_train, y_test = scaffold_train_test_split(
    smiles, y, test_size=0.2
)

pipeline = make_pipeline(
    MolFromSmilesTransformer(),
    make_union(ECFPFingerprint(count=True), MACCSFingerprint()),
    RandomForestClassifier(random_state=0),
)
pipeline.fit(smiles_train, y_train)

y_pred_proba = extract_pos_proba(pipeline.predict_proba(smiles_test))
print(f"AUROC: {multioutput_auroc_score(y_test, y_pred_proba):.2%}")

Key features

  • Molecular fingerprints

    • over 30, e.g. ECFP, Avalon, MACCS, Mordred, PubChem
    • all with a uniform .transform() API
  • Molecular filters

    • over 30, e.g. Lipinski Rule of 5, PAINS, REOS
    • both substructural and physicochemical
  • Similarity & distance measures

    • 14 measures, e.g. Tanimoto, Dice, MCS
    • compatible with kNN, UMAP, HDBSCAN, and other distance-based models
    • efficient bulk similarity distribution computation
  • Applicability domain checks

    • 11 methods, e.g. kNN, centroid distance, TOPKAT
    • evaluate the reliability of algorithms for new molecules
  • Benchmark datasets

    • MoleculeNet, Therapeutics Data Commons, MoleculeACE, and LRGB
    • train-test splits built-in
  • Native scikit-learn integration

    • use Pipeline, FeatureUnion, GridSearchCV, and more
    • build, save, and deploy ML pipelines for chemoinformatics
  • Other features

    • fast and efficient: parallelized, sparse matrices support, C++ RDKit under the hood
    • efficient hyperparameter tuning with fingerprints caching
    • MIT licensed, permissive academic and commercial use

Tutorials

Step-by-step Jupyter notebooks, both for learning and deploying production-grade features:

  1. Introduction to scikit-fingerprints
  2. Fingerprint types
  3. Molecular pipelines
  4. Conformers and 3D fingerprints
  5. Hyperparameter tuning
  6. Dataset splits
  7. Datasets and benchmarking
  8. Similarity and distance metrics
  9. Molecular filters
  10. Molecular clustering

Publications and citing

Publications using scikit-fingerprints:

  1. J. Adamczyk, W. Czech "Molecular Topological Profile (MOLTOP) -- Simple and Strong Baseline for Molecular Graph Classification" ECAI 2024
  2. J. Adamczyk, P. Ludynia "Scikit-fingerprints: easy and efficient computation of molecular fingerprints in Python" SoftwareX
  3. J. Adamczyk, P. Ludynia, W. Czech "Molecular Fingerprints Are Strong Models for Peptide Function Prediction" ArXiv preprint
  4. J. Adamczyk "Towards Rational Pesticide Design with Graph Machine Learning Models for Ecotoxicology" CIKM 2025
  5. J. Adamczyk, J. Poziemski, F. Job, M. Król, M. Makowski "MolPILE - large-scale, diverse dataset for molecular representation learning" ArXiv preprint
  6. J. Adamczyk, J. Poziemski, P. Siedlecki "Evaluating machine learning models for predicting pesticide toxicity to honey bees" Ecotoxicology and Environmental Safety 2026
  7. M. Fitzner et al. "BayBE: a Bayesian Back End for experimental planning in the low-to-no-data regime" RSC Digital Discovery
  8. J. Xiong et al. "Bridging 3D Molecular Structures and Artificial Intelligence by a Conformation Description Language"
  9. S. Mavlonazarova et al. "Untargeted Metabolomics Reveals Organ-Specific and Extraction-Dependent Metabolite Profiles in Endemic Tajik Species Ferula violacea Korovin" bioRxiv preprint

If you use scikit-fingerprints in your work, please cite our publication in SoftwareX (open access):

@article{scikit_fingerprints,
   title = {Scikit-fingerprints: Easy and efficient computation of molecular fingerprints in Python},
   author = {Jakub Adamczyk and Piotr Ludynia},
   journal = {SoftwareX},
   volume = {28},
   pages = {101944},
   year = {2024},
   issn = {2352-7110},
   doi = {https://doi.org/10.1016/j.softx.2024.101944},
   url = {https://www.sciencedirect.com/science/article/pii/S2352711024003145},
}

Also available as a preprint on ArXiv.

Contributing

Contributions are welcome! Please read CONTRIBUTING.md and CODE_OF_CONDUCT.md for details.

License

MIT -- see LICENSE.md for details.