Seqwin is a lightning‑fast, memory‑efficient toolkit for discovering signature sequences (genomic markers) that balance high sensitivity with high specificity. It builds a minimizer‑based pan‑genome graph across target and neighboring non‑target genomes and extracts signature sequences using a novel graph algorithm. Signatures can be used for downstream assay design such as qPCR, ddPCR, amplicon sequencing and hybrid capture probes.
Seqwin computes minimizers with ntHash, using code adopted from btllib (licensed under the GNU General Public License v3.0).
See the Seqwin Wiki for full documentation.
Seqwin is supported on Linux, macOS, and Windows via WSL for x86-64 and AArch64 systems.
If Conda is not installed, install it with miniforge or miniconda.
1. Create a new Conda environment "seqwin" and install Seqwin via Bioconda
conda create -n seqwin seqwin \
--channel conda-forge \
--channel bioconda \
--strict-channel-priorityTip
Setting channel priority is important for Bioconda packages to function properly. You may also persist channel priority settings for all package installation by modifying your ~/.condarc file. For more information, check the Bioconda documentation.
2. Activate the environment and verify the install
conda activate seqwin
seqwin --helpPrerequisites
- Python >=3.10 (with
pipand development headers; usually included with official installers) - A C++17 compiler (GCC, Clang)
- zlib development headers/library (zlib)
1. Clone this repository and install with pip
This will build the C++ extension/wrapper and install the required Python dependencies.
git clone https://github.com/treangenlab/Seqwin.git
cd Seqwin
pip install . -v
seqwin --help2. Install non-Python dependencies
Seqwin can still run without these tools, but some features will be unavailable or skipped. See the Command Line Parameters for details.
- Mash (see the publication)
- NCBI BLAST+
- NCBI Datasets CLI
Identify signatures by providing one or more target taxa and non-target neighboring taxa.
seqwin \
-t "Salmonella enterica subsp. diarizonae" \
-n "Salmonella enterica subsp. salamae" \
-n "Salmonella bongori" \
--threads 8Taxa names must be exact matches to NCBI Taxonomy.
Outputs are written to seqwin-out/ in your working directory (see Description of Outputs).
Alternatively, a list of target or non-target genomes can be provided as a text file of file paths. Each line should be the path to a genome FASTA file (plain text or gzipped).
seqwin --tar-paths targets.txt --neg-paths non-targets.txtBelow is an example of targets.txt or non-targets.txt
./genomes/GCA_003718275.1_ASM371827v1_genomic.fna
/data/genomes/GCA_000389055.1_46.E.09_genomic.fna
/data/genomes/GCA_008363955.1_ASM836395v1_genomic.fna.gzExpected runtime (with --threads 8 or -p 8):
- ~5min and 2.5GB peak RAM for ~500 bacterial genomes with default settings.
- ~5min and 23GB peak RAM for ~15k bacterial genomes with
--no-blastand--no-mash.
Run seqwin --help or seqwin -h to see the full command line interface.
If you use Seqwin in your research, please cite:
Michael X. Wang, Bryce Kille, Michael G. Nute, Siyi Zhou, Lauren B. Stadler, and Todd J. Treangen "Seqwin: Ultrafast identification of signature sequences in microbial genomes". Proceedings of ISMB 2026, accepted (2026).
Benchmarking datasets, outputs and scripts are available on Zenodo.