GitHub - MLCIL/scikit-fingerprints: Scikit-learn compatible library for molecular fingerprints and chemoinformatics

The scikit-learn compatible library for molecular fingerprints and chemoinformatics.

Easily and efficiently compute molecular fingerprints, molecular filters, distances & similarity measures, and more.

Go from SMILES to production-grade chemoinformatics ML pipelines in a few lines of code.

Documentation · Examples & tutorials · API Reference · Publication

Install

You can install from PyPI, using pip or uv.

pip install scikit-fingerprints

If you need bleeding-edge features and don't mind potentially unstable or undocumented functionalities, you can also install directly from GitHub:

pip install git+https://github.com/MLCIL/scikit-fingerprints.git

Python versions from 3.10 to 3.13 are supported on all major operating systems. Tests are run on Linux Ubuntu, Windows, and macOS.

Quickstart

Simply input SMILES strings into the molecular fingerprint instance:

from skfp.fingerprints import ECFPFingerprint

smiles = ["O=S(=O)(O)CCS(=O)(=O)O", "O=C(O)c1ccccc1O"]

fp = ECFPFingerprint()
X = fp.transform(smiles)  # SMILES in, NumPy array out

Build a full molecular ML pipeline with scikit-learn:

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline, make_union

from skfp.datasets.moleculenet import load_clintox
from skfp.fingerprints import ECFPFingerprint, MACCSFingerprint
from skfp.metrics import extract_pos_proba, multioutput_auroc_score
from skfp.model_selection import scaffold_train_test_split
from skfp.preprocessing import MolFromSmilesTransformer

smiles, y = load_clintox()
smiles_train, smiles_test, y_train, y_test = scaffold_train_test_split(
    smiles, y, test_size=0.2
)

pipeline = make_pipeline(
    MolFromSmilesTransformer(),
    make_union(ECFPFingerprint(count=True), MACCSFingerprint()),
    RandomForestClassifier(random_state=0),
)
pipeline.fit(smiles_train, y_train)

y_pred_proba = extract_pos_proba(pipeline.predict_proba(smiles_test))
print(f"AUROC: {multioutput_auroc_score(y_test, y_pred_proba):.2%}")

Key features

Molecular fingerprints
- over 30, e.g. ECFP, Avalon, MACCS, Mordred, PubChem
- all with a uniform .transform() API
Molecular filters
- over 30, e.g. Lipinski Rule of 5, PAINS, REOS
- both substructural and physicochemical
Similarity & distance measures
- 14 measures, e.g. Tanimoto, Dice, MCS
- compatible with kNN, UMAP, HDBSCAN, and other distance-based models
- efficient bulk similarity distribution computation
Applicability domain checks
- 11 methods, e.g. kNN, centroid distance, TOPKAT
- evaluate the reliability of algorithms for new molecules
Benchmark datasets
- MoleculeNet, Therapeutics Data Commons, MoleculeACE, and LRGB
- train-test splits built-in
Native scikit-learn integration
- use Pipeline, FeatureUnion, GridSearchCV, and more
- build, save, and deploy ML pipelines for chemoinformatics
Other features
- fast and efficient: parallelized, sparse matrices support, C++ RDKit under the hood
- efficient hyperparameter tuning with fingerprints caching
- MIT licensed, permissive academic and commercial use

Tutorials

Step-by-step Jupyter notebooks, both for learning and deploying production-grade features:

Publications and citing

Publications using scikit-fingerprints:

If you use scikit-fingerprints in your work, please cite our publication in SoftwareX (open access):

@article{scikit_fingerprints,
   title = {Scikit-fingerprints: Easy and efficient computation of molecular fingerprints in Python},
   author = {Jakub Adamczyk and Piotr Ludynia},
   journal = {SoftwareX},
   volume = {28},
   pages = {101944},
   year = {2024},
   issn = {2352-7110},
   doi = {https://doi.org/10.1016/j.softx.2024.101944},
   url = {https://www.sciencedirect.com/science/article/pii/S2352711024003145},
}

Also available as a preprint on ArXiv.

Contributing

Contributions are welcome! Please read CONTRIBUTING.md and CODE_OF_CONDUCT.md for details.

License

MIT -- see LICENSE.md for details.

Name		Name	Last commit message	Last commit date
Latest commit History 602 Commits
.github		.github
benchmarking		benchmarking
docs		docs
examples		examples
skfp		skfp
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Install

Quickstart

Key features

Tutorials

Publications and citing

Contributing

License

About

Uh oh!

Releases 21

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Install

Quickstart

Key features

Tutorials

Publications and citing

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 21

Uh oh!

Contributors

Uh oh!

Languages