Alignment-free estimation of CAG / CTG repeat lengths from targeted amplicon sequencing data (Illumina MiSeq short reads or Oxford Nanopore long reads).
Github page- https://github.com/hchetia/TripletLogger
TripletLogger scans FASTQ reads directly with an N-aware regular expression, counts pure target triplets within each accepted match, and reports per-read repeat lengths, a raw repeat-length frequency table, summary metrics, and an estimated allele call (kernel-density peak picking). No alignment to a reference is required.
- Supports both CAG and CTG repeats with triplet-specific logic (regex, counting rule, flank trimming, Type B handling).
- Works on short reads (MiSeq) and long reads (ONT) via a single
--readTypeswitch with sensible defaults for each. - Tolerates sequencing errors through configurable Type A (total non-target) and Type B (consecutive non-target) thresholds, with a length-adaptive Type A floor.
- Trims absorbed downstream flank tracts (CCG-like for CAG, CGG-like for CTG).
- Streams FASTQ in chunks — memory-safe for large ONT runs.
- Outputs per-sample CSVs: repeat metrics, reads-per-repeat-length, and rejected-match sizes.
- R ≥ 4.0
- Bioconductor:
ShortRead,Biostrings - CRAN:
dplyr,optparse,tictoc
Install:
install.packages(c("dplyr", "optparse", "tictoc"))
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("ShortRead", "Biostrings"))Rscript TripletLogger_v1_0.R -i <input.fastq.gz> [options]| Flag | Description |
|---|---|
-i, --input |
Input FASTQ file (plain or gzipped) |
| Flag | Default | Description |
|---|---|---|
-o, --output |
. |
Output directory |
-t, --tripletType |
CAG |
Repeat type: CAG or CTG |
-r, --readType |
long |
long (ONT) or short (MiSeq) |
-q, --qThreshold |
20 |
Minimum mean base quality per read |
-s, --sampleId |
filename | Override sample ID |
| Flag | Default | Description |
|---|---|---|
-m, --minRepeatLength |
1 |
Minimum repeat length to report |
--maxTypeA |
5 |
Fixed floor for Type A errors (total non-target triplets) |
--maxTypeB |
1 (CAG) / 2 (CTG) |
Max consecutive non-target triplets |
--typeArate |
0.05 |
Per-triplet error tolerance for adaptive Type A |
--chunkSize |
20000 |
FASTQ streaming chunk size |
--alleleBW |
1.5 |
Kernel density bandwidth for allele calling |
--freqRangeMin |
1 |
Minimum repeat length for frequency table |
--freqRangeMax |
1000 |
Maximum repeat length (short reads capped at 120) |
CAG repeats from a Nanopore run:
Rscript TripletLogger_v1_0.R -i sample01.fastq.gz -o results/ -t CAG -r longCTG repeats from a MiSeq run:
Rscript TripletLogger_v1_0.R -i sample02.fastq.gz -o results/ -t CTG -r shortFor each input FASTQ, TripletLogger writes three CSVs to the output directory:
<sample>_<TYPE>TRIPLETLogger.v1.0.<readtype>_RepeatMetrics.csv— summary metrics and estimated allele call(s).<sample>_<TYPE>TRIPLETLogger.v1.0.<readtype>_NumReadsPerRepeat.csv— raw repeat-length frequency distribution.<sample>_<TYPE>TRIPLETLogger.v1.0.<readtype>_ThresholdRejected.csv— sizes of matches rejected by error thresholds.
v1.0
MIT