Skip to content

performance optimization of tools with mmap and SIMD #8

Merged
jorgeMFS merged 39 commits into
ieeta-pt:mainfrom
jorgeMFS:main
Dec 9, 2025
Merged

performance optimization of tools with mmap and SIMD #8
jorgeMFS merged 39 commits into
ieeta-pt:mainfrom
jorgeMFS:main

Conversation

@jorgeMFS

@jorgeMFS jorgeMFS commented Dec 9, 2025

Copy link
Copy Markdown
Collaborator

Pull Request Checklist

Thank you for contributing to VCFX! Before submitting your pull request, please confirm the following:

  • [ x] I ran pre-commit run --files <changed files> to execute ruff, flake8, mypy, and pytest.
  • [x ] All C++ tests pass via ctest --output-on-failure from the build directory.
  • [ x] All Python tests pass via pytest tests/python.
  • [ x] Documentation has been updated where applicable (e.g. README.md, docs/*).

Provide a brief description of the changes below.

jorgeMFS and others added 30 commits December 3, 2025 03:26
- Performance optimizations for VCF processing tools
- Dark mode header readability fix
- Improved I/O buffering and mmap support
Phase 2 - I/O Optimizations (57 tools):
- Add vcfx_io.h with optimized split_tabs() and init_io() utilities
- Apply vcfx::init_io() to all tools for unbuffered I/O
- Replace stringstream parsing with zero-copy tab splitting
- Reuse vector allocations outside main loops

Phase 3 - Memory Bomb Tool Redesigns (6 tools):
- VCFX_sorter: External merge sort with temp files (--max-memory, --temp-dir)
- VCFX_merger: Streaming K-way merge for sorted inputs (--assume-sorted)
- VCFX_diff_tool: Two-pointer merge diff (--assume-sorted)
- VCFX_ld_calculator: Sliding window LD calculation (--streaming, --window)
- VCFX_haplotype_extractor: Streaming block output (--streaming)
- VCFX_haplotype_phaser: Sliding window phasing (--streaming, --window)

All streaming modes maintain backward compatibility - default behavior unchanged.
Memory target: <4GB RAM for 50GB file processing achieved.
All 64 tests pass (100%).
Implements a streaming gzip/BGZF reader that processes compressed VCF
files line-by-line with bounded memory usage (~64KB) instead of loading
the entire file into memory.

Features:
- Automatic gzip/BGZF compression detection via magic bytes
- BGZF support (concatenated gzip streams)
- Line-by-line streaming with O(chunk_size + line_length) memory
- Helper functions make_streaming_reader() for easy instantiation
- Handles both compressed and uncompressed files transparently
Documents the C++ API including:
- String utilities (trim, split)
- File reading functions (read_maybe_compressed, read_file_maybe_compressed)
- StreamingGzipReader class for bounded-memory gzip decompression
- Helper factory functions (make_streaming_reader)
- Command-line helpers and error handling functions

Also reorganizes mkdocs navigation to group C++ and Python APIs under
"API Reference" section.
Phase 2 of performance optimization:
- Replace stringstream-based tab splitting with vcfx::split_tabs()
- Move vector declarations outside loops with .reserve(16)
- Add split_string() function to vcfx_io.h for char-delimiter cases
- Fix test_diff_tool.sh to use tabs instead of spaces in test data

Tools optimized:
- allele_balance_calc, allele_counter, concordance_checker
- missing_data_handler, duplicate_remover, format_converter
- distance_calculator, dosage_calculator, gl_filter, genotype_query
- hwe_tester, inbreeding_calculator, impact_filter, indel_normalizer
- info_aggregator, info_parser, info_summarizer, ld_calculator
- metadata_summarizer, multiallelic_splitter, nonref_filter
- outlier_detector, phase_quality_filter, population_filter
- position_subsetter, probability_filter, quality_adjuster
- ref_comparator, reformatter, region_subsampler, subsampler
- sv_handler, custom_annotator, cross_sample_concordance
- annotation_extractor, ancestry_assigner, ancestry_inferrer
- alignment_checker, fasta_converter, file_splitter
- haplotype_phaser, haplotype_extractor, diff_tool

Expected speedup: 5-10x for string parsing operations

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…sing

- Add memory-mapped I/O for file arguments using mmap with MADV_SEQUENTIAL
- Implement zero-copy QUAL field extraction using pointer arithmetic
- Add 1MB output buffer with periodic flushing
- Support multiple file arguments for batch processing
- Fallback to stdin processing for pipes
- Update help message and documentation with new usage patterns
- Add memory-mapped I/O for file arguments with MADV_SEQUENTIAL hints
- Implement multi-threaded parallel processing across CPU cores
- Add --threads option to control parallelism (default: auto-detect)
- Use fast pattern scanning with memchr for missing genotype detection
- Zero-copy pass-through for lines without missing genotypes
- Update documentation with new options and benchmark results
…uffer

- Replace string splitting with zero-allocation character iteration
- Use 1MB output buffer with periodic flushing
- Pre-allocate vectors to avoid reallocations
- Direct genotype parsing without intermediate string copies
- Switch from factorial-based exact test to chi-square approximation
- Eliminates exponential computation that caused timeout on large samples
- Chi-square formula: X² = Σ((observed - expected)² / expected)
- Convert to p-value using chi-square CDF with 1 degree of freedom
…e files

- Replace in-memory accumulation with temp file for intermediate storage
- Stream output to prevent memory exhaustion on 2504-sample files
- Update test expected outputs to match new format
- Enables processing of files that previously caused out-of-memory
- Add output buffering for efficient I/O
- Optimize genotype parsing with direct character iteration
- Improve variant storage to reduce memory allocations
- Update documentation with performance notes
- Add 1MB output buffering for efficient I/O
- Optimize field extraction with direct character iteration
- Improve GL parsing to avoid unnecessary string copies
- Update documentation with performance notes
- Replace stringstream with direct string operations
- Use character-level parsing for genotype extraction
- Add output buffering with 1MB buffer
- Reduces processing time on large multi-sample files
- Add output buffering for efficient I/O
- Use direct character iteration for genotype parsing
- Pre-allocate sample genotype vectors
- Improve numeric conversion with fast integer parsing
- Add output buffering for efficient I/O
- Use direct character iteration for genotype parsing
- Avoid string allocations in inner loop
- Update test data to match optimized output format
- Add output buffering for efficient I/O
- Use direct character parsing for field extraction
- Reduce string allocations in inner parsing loop
- Add output buffering for efficient I/O
- Use direct character iteration for genotype parsing
- Reduce string allocations in inner loop
- Update test expected error outputs
- Add output buffering for efficient I/O
- Use direct character iteration for field extraction
- Reduce string allocations during parsing
- Add output buffering for efficient I/O
- Use direct character iteration for genotype parsing
- Pre-allocate vectors for sample data storage
- Update benchmark task definitions
- Update results notebook with new performance data
Performance improvements (~40x faster):
- Add memory-mapped I/O for file arguments using mmap with MADV_SEQUENTIAL
- Pre-compute chromosome IDs during parsing for O(1) sort comparison
- Implement CompactSortKey struct (20 bytes vs ~9KB per variant)
- Add 1MB output buffer to reduce syscall overhead
- Support both natural and lexicographic chromosome ordering

Benchmark on 427K variants, 2504 samples (1.4GB):
- Before: >30 minutes (timeout)
- After: ~46 seconds with file argument
- Replace character-by-character tab counting with memchr-based validation
- Add SIMD newline finding (AVX2/SSE2 with portable fallback)
- Use hasEightColumnsFast() for O(7) memchr calls vs O(70) char comparisons
- Optimize countVariantsMmap() with findNewlineSIMD() for line detection
- Update documentation to reflect mmap mode, gzip support, and performance
- Add file argument support with memory-mapped I/O
- Implement SIMD-optimized newline finding (AVX2/SSE2/memchr)
- Use zero-copy pointer arithmetic for CHROM/POS extraction
- Add 1MB output buffering with 512KB flush threshold
- Update documentation with new usage and performance notes

Indexes 427K variants in ~6-9 seconds (was >2 min timeout)
…f_filter, validator

VCFX_haplotype_phaser (16x speedup):
- Add memory-mapped file I/O with -i/--input option
- SIMD-optimized newline detection (AVX2/SSE2/memchr)
- Zero-copy string parsing with std::string_view
- CircularVariantBuffer for O(1) streaming operations
- OutputBuffer with 1MB batched writes
- FORMAT field caching, int8_t genotypes
- Add -q/--quiet for batch processing

VCFX_phase_checker (12x speedup):
- Memory-mapped I/O with fast file input
- SIMD-optimized line scanning
- Zero-copy parsing, output buffering

VCFX_genotype_query (50x speedup):
- Memory-mapped I/O, SIMD line detection
- Zero-copy field parsing
- FORMAT caching, output buffering

VCFX_nonref_filter:
- Memory-mapped I/O optimization
- Zero-copy parsing, output buffering

VCFX_validator:
- Performance improvements and refactoring

All tools: Add new tests, update documentation
Rewrite VCFX_fasta_converter with an optimal two-pass algorithm:
- Pass 1: Fast newline scan to count variants
- Pass 2: Direct writes to pre-allocated contiguous buffer
- Row-major layout for perfect cache locality during output

Performance results:
- 50MB file: 1.78s → 0.14s (12.7x faster)
- 503MB file: ~422 MB/s throughput
- 6.8GB file: 61 seconds (previously timed out)

Key optimizations:
- Memory-mapped I/O with SIMD newline scanning
- Zero-copy parsing with raw pointer arithmetic
- FORMAT field caching for GT index lookup
- 1MB OutputBuffer for batched writes
- Pre-allocated O(variants × samples) buffer

All 22 tests pass. Output verified identical between modes.
Rewrite VCFX_diff_tool with memory-mapped I/O and SIMD acceleration:
- mmap both input VCF files with MADV_SEQUENTIAL hint
- SIMD newline scanning (AVX2/SSE2/memchr fallback)
- Zero-copy parsing using string_view and raw pointers
- 1MB OutputBuffer for batched writes
- Added -q/--quiet option for batch processing

Performance results:
- 503MB file: 0.73s (~690 MB/s throughput)
- 6.8GB × 2 files: 53s (~256 MB/s throughput)
- Previously: TIMEOUT on large files

Both in-memory and streaming modes now use optimized mmap path.
All 13 tests pass.
jorgeMFS and others added 9 commits December 7, 2025 15:51
Rewrite VCFX_concordance_checker with memory-mapped I/O and SIMD acceleration:
- mmap input VCF file with MADV_SEQUENTIAL hint
- SIMD newline scanning (AVX2/SSE2/memchr fallback)
- Zero-copy parsing using string_view and raw pointers
- 1MB OutputBuffer for batched writes
- Added -i/--input and -q/--quiet options

Performance results:
- 50MB file: 1.75s → 0.03s (56x faster)
- 503MB file: 0.41s (~1.2 GB/s throughput)
- 6.8GB file: 23s (~293 MB/s throughput)
- Previously: TIMEOUT on large files

All 6 tests pass. Output verified identical between modes.
…atch processing

- Add memory-mapped file I/O with madvise hints
- Add SIMD-accelerated newline and tab scanning (NEON/SSE2/AVX2)
- Add multi-threaded parallel processing with -t flag
- Implement batch sample processing: find all sample positions once per line
- Add pre-allocated result arrays to avoid per-line allocations
- Use direct write() syscalls with 16MB per-thread buffers
- Add zero-copy parsing with string_view

Benchmark results (4GB VCF, 427K variants):
- 100 samples: 26s -> 3.3s (8x speedup)
- 500 samples: ~131s -> 13s (10x speedup)
- VCFX_inbreeding_calculator: 6 min → 17 sec (~21x speedup)
- VCFX_hwe_tester: 5 min → 17 sec (~18x speedup)

Both tools now support:
- Memory-mapped I/O via -i/--input flag
- SIMD newline/tab scanning (AVX2/SSE2/NEON)
- Zero-copy parsing with string_view
- Buffered output with direct write() syscalls
- Quiet mode via -q/--quiet flag
…th mmap and SIMD

- VCFX_allele_freq_calc: mmap + SIMD for ~20x speedup
- VCFX_indel_normalizer: mmap + SIMD for ~73x speedup
- VCFX_missing_detector: mmap + SIMD + MT pre-scan + zero-copy for ~42x speedup

All tools now support -i/--input for file mode and -q/--quiet for scripts.
… for 5-60x speedup

Major optimizations for VCFX_ld_calculator:

- Memory-mapped I/O with madvise hints for extreme read performance
- SIMD-accelerated r² computation (NEON/AVX2/SSE2) for 8-16x inner loop speedup
- Multi-threaded matrix computation with work-stealing for linear scaling
- Distance-based pruning (--max-distance) to skip biologically irrelevant pairs
- Zero-allocation VCF parsing with raw pointer arithmetic
- Compact int8_t genotype storage for 4x smaller memory footprint
- Pre-computed per-variant statistics (sum, variance) to avoid redundant calculations
- 4MB buffered output with direct write() syscalls

New CLI options:
- -i/--input FILE: Use mmap for best performance
- -n/--threads N: Multi-threaded matrix mode
- -d/--max-distance BP: Skip pairs beyond distance threshold
- -q/--quiet: Suppress informational messages
- -v/--version: Show version

Performance improvements:
- Streaming mode: ~5x faster (2 min vs 10 min on 427K variants)
- Matrix mode: ~60x faster (30s vs 32 min on 10K variants)
- With distance pruning: 90-99% fewer comparisons while capturing biologically relevant LD

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
… SIMD

VCFX_allele_balance_calc:
- Memory-mapped I/O with madvise hints
- SIMD-accelerated line scanning (NEON/AVX2/SSE2)
- Incremental output flushing for memory efficiency
- New CLI options: -i/--input, -q/--quiet

VCFX_haplotype_extractor:
- Memory-mapped I/O for file input
- SIMD-accelerated parsing
- Zero-copy output optimization
- New CLI options: -i/--input, -q/--quiet
- Updated test suite for new features
Benchmarked all uncategorized tools on 4GB file:
- 8 tools need mmap optimization (>60s): af_subsetter, cross_sample_concordance, distance_calculator, dosage_calculator, duplicate_remover, metadata_summarizer, multiallelic_splitter, variant_classifier
- 17 tools verified as fast (<1s): alignment_checker, ancestry_inferrer, annotation_extractor, compressor, field_extractor, format_converter, gl_filter, impact_filter, info_aggregator, info_parser, info_summarizer, population_filter, position_subsetter, record_filter, ref_comparator, region_subsampler, sample_extractor
… mmap for 10-20x speedup

Add memory-mapped I/O optimization to three VCF tools for large file processing:

- af_subsetter: 2m44s -> 8s (~20x faster)
- dosage_calculator: 2m41s -> 18s (~9x faster)
- duplicate_remover: 2m36s -> 10s (~15x faster)

Changes:
- Add -i/--input option for direct file input with mmap
- Add -q/--quiet option to suppress warnings
- Use 1MB output buffers with periodic flushing
- Zero-allocation parsing in mmap mode
- Pre-allocated hash sets for duplicate detection
- Update documentation with performance benchmarks
- Add comprehensive tests for file mode vs stdin mode
- VCFX_variant_classifier: add mmap support (~12x speedup)
- VCFX_metadata_summarizer: add mmap support (~15x speedup)
- VCFX_cross_sample_concordance: add multi-threading and reusable buffers
- VCFX_distance_calculator: add mmap and zero-copy parsing
- VCFX_multiallelic_splitter: add mmap and PL recoding optimization

All 31 tools now optimized with mmap/SIMD acceleration.
@jorgeMFS jorgeMFS merged commit 60a6b71 into ieeta-pt:main Dec 9, 2025
8 checks passed
jorgeMFS added a commit that referenced this pull request Dec 9, 2025
…or-drop-library

Add reusable utilities to vcfx_core
jorgeMFS added a commit that referenced this pull request Dec 9, 2025
Performance optimization of tools with mmap and SIMD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant