scDblFinderPy

Python implementation of the scDblFinder workflow for doublet detection in single-cell RNA-seq data, designed to run on AnnData/Scanpy objects.

This module mirrors the core ideas of the R package:

optional pre-clustering
artificial doublet generation
iterative classifier training
threshold optimization based on expected doublet rate

What this package does

Given a count matrix in an AnnData object, scDblFinderPy estimates a doublet score for each real cell and returns a final class (doublet or singlet).

At a high level, the pipeline is:

Optional clustering of real cells (clustered mode).
Feature selection and artificial doublet generation.
Combined real + artificial embedding and KNN feature extraction.
Iterative XGBoost training and score refinement.
Final thresholding to obtain doublet calls.

Repository layout

scDblFinderPy/              ← repo root (this directory is the Python package)
├── scDblFinder.py          main pipeline — contains compute_doublet_score()
├── clustering.py
├── doublet_generation.py
├── misc.py
├── thresholding.py
├── rng.py
├── graph.py
├── biocneighbors_kmknn.py
├── louvain_controlled.py
├── hw_kmeans.py
├── r_mt19937.py
├── r_sample_emulation.py

Setup

1. Clone the repository

git clone <repo-url>
cd scDblFinderPy

2. Create and activate a virtual environment

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

3. Install dependencies

pip install numpy pandas scipy anndata scanpy scikit-learn xgboost

Optional — GPU acceleration (requires a CUDA-capable machine):

pip install rapids-singlecell cuml

Using the package in your own scripts

Because the repo root itself is the Python package, you need to add its parent directory to sys.path before importing. Assuming you cloned into /path/to/scDblFinderPy:

import sys
sys.path.insert(0, "/path/to")   # parent of the scDblFinderPy/ directory
from scDblFinderPy.scDblFinder import compute_doublet_score

Input expectations

compute_doublet_score(...) expects an AnnData object where:

adata.X contains raw counts (preferred), or
adata.layers['counts'] contains raw counts.

Random mode (no clustering)

import scanpy as sc
import sys
sys.path.insert(0, "/path/to")
from scDblFinderPy.scDblFinder import compute_doublet_score

adata = sc.read_h5ad("your_data.h5ad")
adata_out = compute_doublet_score(
    adata,
    clusters_col=None,   # random mode — no clustering step
    n_iters=3,
    random_state=42,
    verbose=True,
)

print(adata_out.obs[["scDblFinder_score", "scDblFinder_class"]].head())
print("Threshold:", adata_out.uns.get("scDblFinder_threshold"))

Clustered mode (auto clustering)

adata_out = compute_doublet_score(
    adata,
    clusters_col="clusters",  # column is computed and stored here if absent
    n_iters=3,
    random_state=42,
    verbose=True,
)

Clustered mode (precomputed clusters)

adata.obs["my_clusters"] = ...   # your own cluster labels
adata_out = compute_doublet_score(adata, clusters_col="my_clusters")

Outputs

In adata.obs:

scDblFinder_score — continuous doublet score (higher = more likely doublet)
scDblFinder_class — final call: doublet or singlet

In adata.uns:

scDblFinder_threshold — the score threshold used for the final classification

If return_type='full', the returned object also includes artificial doublets.

Key parameters

Parameter	Default	Description
`clusters_col`	`None`	`None` for random mode; column name for clustered mode
`n_features`	`1352`	number of genes used for feature selection
`n_components`	`20`	number of PCA components
`n_artificial`	`None`	override number of artificial doublets (auto if `None`)
`prop_random`	`0.1`	fraction of artificial doublets generated randomly
`n_iters`	`3`	iterative classifier refinement rounds
`dbr_per1k`	`0.008`	expected doublet rate per 1k cells
`stringency`	`0.5`	threshold optimisation aggressiveness
`random_state`	`42`	reproducibility seed
`use_gpu`	`False`	enable GPU-accelerated steps (requires rapids/cuml)
`verbose`	`True`	print progress at each stage

Running the benchmarks

The benchmark scripts live in benchmarking/ and must be run from inside that directory so that relative dataset paths resolve correctly.

Run all datasets:

cd benchmarking
python run_python_benchmark.py

Results are saved to benchmarking/python_benchmark_metrics.csv.

Run a single dataset (e.g. hm-6k):

cd benchmarking
python run_dataset.py hm-6k
# optionally pass a repeat count: python run_dataset.py hm-6k 3

Results are saved to benchmarking/python_benchmark_hm-6k.csv.

Datasets must be present as benchmarking/datasets/<name>.h5ad and must contain a truth column in adata.obs with values doublet / singlet.

Reproducibility tips

Fix random_state when comparing runs.
Keep package versions stable (especially scanpy, scikit-learn, xgboost).
Use the same preprocessing assumptions (counts in adata.X or adata.layers['counts']).

Notes and current limitations

samples_col is accepted but currently ignored in the Python pipeline.
Some low-level numerical differences from the R package are expected due to library backend differences.

Troubleshooting

Results look unstable or weak:

Confirm counts are raw (not log-normalised or otherwise transformed).
Try both modes (clusters_col=None and a clustered mode).
Check xgboost version is compatible with your Python version.
Run with verbose=True to inspect each stage.

Clustering looks poor:

Pass your own precomputed cluster labels via clusters_col instead of relying on the built-in fast clustering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scDblFinderPy

What this package does

Repository layout

Setup

1. Clone the repository

2. Create and activate a virtual environment

3. Install dependencies

Using the package in your own scripts

Input expectations

Random mode (no clustering)

Clustered mode (auto clustering)

Clustered mode (precomputed clusters)

Outputs

Key parameters

Running the benchmarks

Reproducibility tips

Notes and current limitations

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
biocneighbors_kmknn.py		biocneighbors_kmknn.py
clustering.py		clustering.py
doublet_generation.py		doublet_generation.py
graph.py		graph.py
hw_kmeans.py		hw_kmeans.py
louvain_controlled.py		louvain_controlled.py
misc.py		misc.py
r_mt19937.py		r_mt19937.py
r_sample_emulation.py		r_sample_emulation.py
rng.py		rng.py
scDblFinder.py		scDblFinder.py
thresholding.py		thresholding.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

scDblFinderPy

What this package does

Repository layout

Setup

1. Clone the repository

2. Create and activate a virtual environment

3. Install dependencies

Using the package in your own scripts

Input expectations

Random mode (no clustering)

Clustered mode (auto clustering)

Clustered mode (precomputed clusters)

Outputs

Key parameters

Running the benchmarks

Reproducibility tips

Notes and current limitations

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages