smufin: Mutation Finder and Genome Comparison Tool

smufin is a mutation finder and side-by-side genome comparison and manipulation toolset based on k-mers. It's both the current reference implementation and a redesign of the original SMUFIN, a reference-free method to identify mutations on tumor genomes, comparing them directly against the corresponding normal genome of the same patient.

smufin has been designed as a reconfigurable set of checkpointable stages, and supports different modes of execution to adapt to the characteristics of the hardware where it's running: from scale-out executions in large data centers to scale-up solutions that take advantage of accelerators and storage class memory in a single machine.

Compile

Compiling smufin requires make, a compiler such as gcc with C++11 support (>= 4.8), and the following libraries:

sparsehash (>= 2.0)
boost (>= 1.55): Property trees and string algorithms
ConcurrentQueue and ReaderWriterQueue: MPMC and SPSC queues
libbf: Bloom filters
RocksDB (>= 4.9): Key-value store for flash storage
htslib: Parse BAM files
msgpack: Serialization

The paths for each library can be configured using a custom make.conf file, see make.conf.sample for an example. On Debian-based systems, packages for the first two and last three libraries are available as: libsparsehash-dev libboost1.55-dev librocksdb-dev libhts-dev libmsgpack-dev.

smufin's makefile defaults to a minimal output. For a more verbose output, use the following:

VERBOSE=1 make

Run

Running smufin requires a configuration file such as the sample smufin.conf. The following command line options can be used to override the configuration file.

Usage: sm -c CONFIG [OPTIONS]
Options:
 -p, --partitions NUM_PARTITIONS
 --pid PARTITION_ID
 -l, --loaders NUM_LOADER_THREADS
 -s, --storers NUM_STORER_THREADS
 -f, --filters NUM_FILTER_THREADS
 -m, --mergers NUM_MERGE_THREADS
 -g, --groupers NUM_GROUP_THREADS
 --input-normal INPUT_FILES
 --input-tumor INPUT_FILES
 -o, --output OUTPUT_PATH
 -x, --exec COMMANDS
 -h, --help

Input file arguments given to --input-{normal,tumor} are supposed to be either a single file, or a quoted wildcard expandable string, e.g. "file-*.fq.gz" or "file-[12].fq.gz".

Commands

The argument passed to the --exec flag, or core.exec configuration option, must be a list of stage commands separated by semicolon. Commands are prepended by a stage name followed by colon, and chained in a comma-separated list. E.g. count:run,dump or count:restore;filter:run,dump. The following list contains all available stages and commands:

prune
- run: generates a bloom filter of stems that have been observed in the input more than once; optional stage that can be run first to save memory during count.
count: build frequency table.
- run: counts frequency of normal and tumoral kmers in input sequence, ignoring kmers whose stem is only seen once; counters hold values up to 2^16.
- dump: serialize kmer frequency as sparsehash tables indexed by stem, for checkpointing and/or later analysis.
- restore: unserialize dumped frequency tables from disk.
- stats: display frequency stats, including size of different tables, and histograms for normal and tumoral counts.
- export: serialize frequencies as plain CSV table files containing kmers along with normal and tumoral counters; rows can be limited to kmers that meet certain criteria through configuration options export-{min,max}.
filter: select breakpoint candidates and build indexes.
- run: build filter normal and tumoral (mutated and non-mutated) indexes containing candidate reads, along with their IDs and positions of candidate kmers.
- dump: finalize writing filter indexes to disk; when using RocksDB indexes, force a compaction.
- stats: display sizes of the different filters.
merge: combine multiple filter indexes.
- run: read and combine filter indexes from different partitions into a single, unified index in RocksDB. Merges all possible indexes, sequentially one at a time.
- run_{seq,k2i,i2p}_{nn,tn,tm}: read and combine specific filter indexes from different partitions into a single RocksDB instance.
- stats: display sizes of the merged filters.
- to_fastq: convert indexed reads to FASTQ format.
group: match candidates that belong to the same region.
- run: window-based group leader selection and retrieval of related reads.
- dump: write groups and set collections to disk (group_rocks only).
- stats: display number of groups generated by each thread.

Note that commands need to follow a certain order, and some stages can't be executed without running earlier stages first. The following graph shows the dependencies between smufin commands:

Additional Documentation

Maintainers

Jordà Polo <jorda.polo@bsc.es>, 2015-2019.

Name		Name	Last commit message	Last commit date
Latest commit History 459 Commits
data		data
doc		doc
src		src
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
COPYING		COPYING
Makefile		Makefile
README.md		README.md
make.conf.sample		make.conf.sample
smufin.conf		smufin.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

smufin: Mutation Finder and Genome Comparison Tool

Compile

Run

Commands

Additional Documentation

Maintainers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

smufin: Mutation Finder and Genome Comparison Tool

Compile

Run

Commands

Additional Documentation

Maintainers

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages