performance optimization of tools with mmap and SIMD by jorgeMFS · Pull Request #8 · ieeta-pt/VCFX

jorgeMFS · 2025-12-09T03:01:05Z

Pull Request Checklist

Thank you for contributing to VCFX! Before submitting your pull request, please confirm the following:

[ x] I ran pre-commit run --files <changed files> to execute ruff, flake8, mypy, and pytest.
[x ] All C++ tests pass via ctest --output-on-failure from the build directory.
[ x] All Python tests pass via pytest tests/python.
[ x] Documentation has been updated where applicable (e.g. README.md, docs/*).

Provide a brief description of the changes below.

- Performance optimizations for VCF processing tools - Dark mode header readability fix - Improved I/O buffering and mmap support

Phase 2 - I/O Optimizations (57 tools): - Add vcfx_io.h with optimized split_tabs() and init_io() utilities - Apply vcfx::init_io() to all tools for unbuffered I/O - Replace stringstream parsing with zero-copy tab splitting - Reuse vector allocations outside main loops Phase 3 - Memory Bomb Tool Redesigns (6 tools): - VCFX_sorter: External merge sort with temp files (--max-memory, --temp-dir) - VCFX_merger: Streaming K-way merge for sorted inputs (--assume-sorted) - VCFX_diff_tool: Two-pointer merge diff (--assume-sorted) - VCFX_ld_calculator: Sliding window LD calculation (--streaming, --window) - VCFX_haplotype_extractor: Streaming block output (--streaming) - VCFX_haplotype_phaser: Sliding window phasing (--streaming, --window) All streaming modes maintain backward compatibility - default behavior unchanged. Memory target: <4GB RAM for 50GB file processing achieved. All 64 tests pass (100%).

Implements a streaming gzip/BGZF reader that processes compressed VCF files line-by-line with bounded memory usage (~64KB) instead of loading the entire file into memory. Features: - Automatic gzip/BGZF compression detection via magic bytes - BGZF support (concatenated gzip streams) - Line-by-line streaming with O(chunk_size + line_length) memory - Helper functions make_streaming_reader() for easy instantiation - Handles both compressed and uncompressed files transparently

Documents the C++ API including: - String utilities (trim, split) - File reading functions (read_maybe_compressed, read_file_maybe_compressed) - StreamingGzipReader class for bounded-memory gzip decompression - Helper factory functions (make_streaming_reader) - Command-line helpers and error handling functions Also reorganizes mkdocs navigation to group C++ and Python APIs under "API Reference" section.

Phase 2 of performance optimization: - Replace stringstream-based tab splitting with vcfx::split_tabs() - Move vector declarations outside loops with .reserve(16) - Add split_string() function to vcfx_io.h for char-delimiter cases - Fix test_diff_tool.sh to use tabs instead of spaces in test data Tools optimized: - allele_balance_calc, allele_counter, concordance_checker - missing_data_handler, duplicate_remover, format_converter - distance_calculator, dosage_calculator, gl_filter, genotype_query - hwe_tester, inbreeding_calculator, impact_filter, indel_normalizer - info_aggregator, info_parser, info_summarizer, ld_calculator - metadata_summarizer, multiallelic_splitter, nonref_filter - outlier_detector, phase_quality_filter, population_filter - position_subsetter, probability_filter, quality_adjuster - ref_comparator, reformatter, region_subsampler, subsampler - sv_handler, custom_annotator, cross_sample_concordance - annotation_extractor, ancestry_assigner, ancestry_inferrer - alignment_checker, fasta_converter, file_splitter - haplotype_phaser, haplotype_extractor, diff_tool Expected speedup: 5-10x for string parsing operations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…sing - Add memory-mapped I/O for file arguments using mmap with MADV_SEQUENTIAL - Implement zero-copy QUAL field extraction using pointer arithmetic - Add 1MB output buffer with periodic flushing - Support multiple file arguments for batch processing - Fallback to stdin processing for pipes - Update help message and documentation with new usage patterns

- Add memory-mapped I/O for file arguments with MADV_SEQUENTIAL hints - Implement multi-threaded parallel processing across CPU cores - Add --threads option to control parallelism (default: auto-detect) - Use fast pattern scanning with memchr for missing genotype detection - Zero-copy pass-through for lines without missing genotypes - Update documentation with new options and benchmark results

…uffer - Replace string splitting with zero-allocation character iteration - Use 1MB output buffer with periodic flushing - Pre-allocate vectors to avoid reallocations - Direct genotype parsing without intermediate string copies

- Switch from factorial-based exact test to chi-square approximation - Eliminates exponential computation that caused timeout on large samples - Chi-square formula: X² = Σ((observed - expected)² / expected) - Convert to p-value using chi-square CDF with 1 degree of freedom

…e files - Replace in-memory accumulation with temp file for intermediate storage - Stream output to prevent memory exhaustion on 2504-sample files - Update test expected outputs to match new format - Enables processing of files that previously caused out-of-memory

- Add output buffering for efficient I/O - Optimize genotype parsing with direct character iteration - Improve variant storage to reduce memory allocations - Update documentation with performance notes

- Add 1MB output buffering for efficient I/O - Optimize field extraction with direct character iteration - Improve GL parsing to avoid unnecessary string copies - Update documentation with performance notes

- Replace stringstream with direct string operations - Use character-level parsing for genotype extraction - Add output buffering with 1MB buffer - Reduces processing time on large multi-sample files

- Add output buffering for efficient I/O - Use direct character iteration for genotype parsing - Pre-allocate sample genotype vectors - Improve numeric conversion with fast integer parsing

- Add output buffering for efficient I/O - Use direct character iteration for genotype parsing - Avoid string allocations in inner loop - Update test data to match optimized output format

- Add output buffering for efficient I/O - Use direct character parsing for field extraction - Reduce string allocations in inner parsing loop

- Add output buffering for efficient I/O - Use direct character iteration for genotype parsing - Reduce string allocations in inner loop - Update test expected error outputs

- Add output buffering for efficient I/O - Use direct character iteration for field extraction - Reduce string allocations during parsing

- Add output buffering for efficient I/O - Use direct character iteration for genotype parsing - Pre-allocate vectors for sample data storage

- Update benchmark task definitions - Update results notebook with new performance data

Performance improvements (~40x faster): - Add memory-mapped I/O for file arguments using mmap with MADV_SEQUENTIAL - Pre-compute chromosome IDs during parsing for O(1) sort comparison - Implement CompactSortKey struct (20 bytes vs ~9KB per variant) - Add 1MB output buffer to reduce syscall overhead - Support both natural and lexicographic chromosome ordering Benchmark on 427K variants, 2504 samples (1.4GB): - Before: >30 minutes (timeout) - After: ~46 seconds with file argument

- Replace character-by-character tab counting with memchr-based validation - Add SIMD newline finding (AVX2/SSE2 with portable fallback) - Use hasEightColumnsFast() for O(7) memchr calls vs O(70) char comparisons - Optimize countVariantsMmap() with findNewlineSIMD() for line detection - Update documentation to reflect mmap mode, gzip support, and performance

- Add file argument support with memory-mapped I/O - Implement SIMD-optimized newline finding (AVX2/SSE2/memchr) - Use zero-copy pointer arithmetic for CHROM/POS extraction - Add 1MB output buffering with 512KB flush threshold - Update documentation with new usage and performance notes Indexes 427K variants in ~6-9 seconds (was >2 min timeout)

…f_filter, validator VCFX_haplotype_phaser (16x speedup): - Add memory-mapped file I/O with -i/--input option - SIMD-optimized newline detection (AVX2/SSE2/memchr) - Zero-copy string parsing with std::string_view - CircularVariantBuffer for O(1) streaming operations - OutputBuffer with 1MB batched writes - FORMAT field caching, int8_t genotypes - Add -q/--quiet for batch processing VCFX_phase_checker (12x speedup): - Memory-mapped I/O with fast file input - SIMD-optimized line scanning - Zero-copy parsing, output buffering VCFX_genotype_query (50x speedup): - Memory-mapped I/O, SIMD line detection - Zero-copy field parsing - FORMAT caching, output buffering VCFX_nonref_filter: - Memory-mapped I/O optimization - Zero-copy parsing, output buffering VCFX_validator: - Performance improvements and refactoring All tools: Add new tests, update documentation

Rewrite VCFX_fasta_converter with an optimal two-pass algorithm: - Pass 1: Fast newline scan to count variants - Pass 2: Direct writes to pre-allocated contiguous buffer - Row-major layout for perfect cache locality during output Performance results: - 50MB file: 1.78s → 0.14s (12.7x faster) - 503MB file: ~422 MB/s throughput - 6.8GB file: 61 seconds (previously timed out) Key optimizations: - Memory-mapped I/O with SIMD newline scanning - Zero-copy parsing with raw pointer arithmetic - FORMAT field caching for GT index lookup - 1MB OutputBuffer for batched writes - Pre-allocated O(variants × samples) buffer All 22 tests pass. Output verified identical between modes.

Rewrite VCFX_diff_tool with memory-mapped I/O and SIMD acceleration: - mmap both input VCF files with MADV_SEQUENTIAL hint - SIMD newline scanning (AVX2/SSE2/memchr fallback) - Zero-copy parsing using string_view and raw pointers - 1MB OutputBuffer for batched writes - Added -q/--quiet option for batch processing Performance results: - 503MB file: 0.73s (~690 MB/s throughput) - 6.8GB × 2 files: 53s (~256 MB/s throughput) - Previously: TIMEOUT on large files Both in-memory and streaming modes now use optimized mmap path. All 13 tests pass.

Rewrite VCFX_concordance_checker with memory-mapped I/O and SIMD acceleration: - mmap input VCF file with MADV_SEQUENTIAL hint - SIMD newline scanning (AVX2/SSE2/memchr fallback) - Zero-copy parsing using string_view and raw pointers - 1MB OutputBuffer for batched writes - Added -i/--input and -q/--quiet options Performance results: - 50MB file: 1.75s → 0.03s (56x faster) - 503MB file: 0.41s (~1.2 GB/s throughput) - 6.8GB file: 23s (~293 MB/s throughput) - Previously: TIMEOUT on large files All 6 tests pass. Output verified identical between modes.

…atch processing - Add memory-mapped file I/O with madvise hints - Add SIMD-accelerated newline and tab scanning (NEON/SSE2/AVX2) - Add multi-threaded parallel processing with -t flag - Implement batch sample processing: find all sample positions once per line - Add pre-allocated result arrays to avoid per-line allocations - Use direct write() syscalls with 16MB per-thread buffers - Add zero-copy parsing with string_view Benchmark results (4GB VCF, 427K variants): - 100 samples: 26s -> 3.3s (8x speedup) - 500 samples: ~131s -> 13s (10x speedup)

- VCFX_inbreeding_calculator: 6 min → 17 sec (~21x speedup) - VCFX_hwe_tester: 5 min → 17 sec (~18x speedup) Both tools now support: - Memory-mapped I/O via -i/--input flag - SIMD newline/tab scanning (AVX2/SSE2/NEON) - Zero-copy parsing with string_view - Buffered output with direct write() syscalls - Quiet mode via -q/--quiet flag

…th mmap and SIMD - VCFX_allele_freq_calc: mmap + SIMD for ~20x speedup - VCFX_indel_normalizer: mmap + SIMD for ~73x speedup - VCFX_missing_detector: mmap + SIMD + MT pre-scan + zero-copy for ~42x speedup All tools now support -i/--input for file mode and -q/--quiet for scripts.

… for 5-60x speedup Major optimizations for VCFX_ld_calculator: - Memory-mapped I/O with madvise hints for extreme read performance - SIMD-accelerated r² computation (NEON/AVX2/SSE2) for 8-16x inner loop speedup - Multi-threaded matrix computation with work-stealing for linear scaling - Distance-based pruning (--max-distance) to skip biologically irrelevant pairs - Zero-allocation VCF parsing with raw pointer arithmetic - Compact int8_t genotype storage for 4x smaller memory footprint - Pre-computed per-variant statistics (sum, variance) to avoid redundant calculations - 4MB buffered output with direct write() syscalls New CLI options: - -i/--input FILE: Use mmap for best performance - -n/--threads N: Multi-threaded matrix mode - -d/--max-distance BP: Skip pairs beyond distance threshold - -q/--quiet: Suppress informational messages - -v/--version: Show version Performance improvements: - Streaming mode: ~5x faster (2 min vs 10 min on 427K variants) - Matrix mode: ~60x faster (30s vs 32 min on 10K variants) - With distance pruning: 90-99% fewer comparisons while capturing biologically relevant LD 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

… SIMD VCFX_allele_balance_calc: - Memory-mapped I/O with madvise hints - SIMD-accelerated line scanning (NEON/AVX2/SSE2) - Incremental output flushing for memory efficiency - New CLI options: -i/--input, -q/--quiet VCFX_haplotype_extractor: - Memory-mapped I/O for file input - SIMD-accelerated parsing - Zero-copy output optimization - New CLI options: -i/--input, -q/--quiet - Updated test suite for new features

Benchmarked all uncategorized tools on 4GB file: - 8 tools need mmap optimization (>60s): af_subsetter, cross_sample_concordance, distance_calculator, dosage_calculator, duplicate_remover, metadata_summarizer, multiallelic_splitter, variant_classifier - 17 tools verified as fast (<1s): alignment_checker, ancestry_inferrer, annotation_extractor, compressor, field_extractor, format_converter, gl_filter, impact_filter, info_aggregator, info_parser, info_summarizer, population_filter, position_subsetter, record_filter, ref_comparator, region_subsampler, sample_extractor

… mmap for 10-20x speedup Add memory-mapped I/O optimization to three VCF tools for large file processing: - af_subsetter: 2m44s -> 8s (~20x faster) - dosage_calculator: 2m41s -> 18s (~9x faster) - duplicate_remover: 2m36s -> 10s (~15x faster) Changes: - Add -i/--input option for direct file input with mmap - Add -q/--quiet option to suppress warnings - Use 1MB output buffers with periodic flushing - Zero-allocation parsing in mmap mode - Pre-allocated hash sets for duplicate detection - Update documentation with performance benchmarks - Add comprehensive tests for file mode vs stdin mode

- VCFX_variant_classifier: add mmap support (~12x speedup) - VCFX_metadata_summarizer: add mmap support (~15x speedup) - VCFX_cross_sample_concordance: add multi-threading and reusable buffers - VCFX_distance_calculator: add mmap and zero-copy parsing - VCFX_multiallelic_splitter: add mmap and PL recoding optimization All 31 tools now optimized with mmap/SIMD acceleration.

…or-drop-library Add reusable utilities to vcfx_core

Performance optimization of tools with mmap and SIMD

jorgeMFS and others added 30 commits December 3, 2025 03:26

Bump version to 1.0.4

6585600

- Performance optimizations for VCF processing tools - Dark mode header readability fix - Improved I/O buffering and mmap support

Merge branch 'ieeta-pt:main' into main

a1060cc

perf(ld_calculator): optimize memory and I/O for large sample processing

b2884de

- Add output buffering for efficient I/O - Optimize genotype parsing with direct character iteration - Improve variant storage to reduce memory allocations - Update documentation with performance notes

perf(gl_filter): optimize genotype likelihood parsing and output

2a71781

- Add 1MB output buffering for efficient I/O - Optimize field extraction with direct character iteration - Improve GL parsing to avoid unnecessary string copies - Update documentation with performance notes

perf(cross_sample_concordance): remove stringstream overhead

3002405

- Replace stringstream with direct string operations - Use character-level parsing for genotype extraction - Add output buffering with 1MB buffer - Reduces processing time on large multi-sample files

perf(distance_calculator): optimize pairwise distance computation

e5fcb36

- Add output buffering for efficient I/O - Use direct character iteration for genotype parsing - Pre-allocate sample genotype vectors - Improve numeric conversion with fast integer parsing

perf(dosage_calculator): optimize dosage computation for large samples

b2321a0

- Add output buffering for efficient I/O - Use direct character iteration for genotype parsing - Avoid string allocations in inner loop - Update test data to match optimized output format

perf(multiallelic_splitter): optimize allele splitting for large files

ad37f1b

- Add output buffering for efficient I/O - Use direct character parsing for field extraction - Reduce string allocations in inner parsing loop

perf(inbreeding_calculator): optimize F coefficient computation

a4b0593

- Add output buffering for efficient I/O - Use direct character iteration for genotype parsing - Reduce string allocations in inner loop - Update test expected error outputs

perf(field_extractor): optimize field parsing and output

e229c0b

- Add output buffering for efficient I/O - Use direct character iteration for field extraction - Reduce string allocations during parsing

perf(ancestry_assigner): optimize ancestry computation for large samples

6f6b02e

- Add output buffering for efficient I/O - Use direct character iteration for genotype parsing - Pre-allocate vectors for sample data storage

chore(benchmarks): update benchmark tasks and results

3bf19b3

- Update benchmark task definitions - Update results notebook with new performance data

test(fasta_converter): update large test VCF file

47b050f

chore: update .gitignore

a541ab4

docs(diff_tool): update with v1.2 performance improvements

2c9be1e

jorgeMFS and others added 9 commits December 7, 2025 15:51

jorgeMFS merged commit 60a6b71 into ieeta-pt:main Dec 9, 2025
8 checks passed

jorgeMFS had a problem deploying to pypi December 9, 2025 03:20 — with GitHub Actions Failure

jorgeMFS had a problem deploying to pypi December 9, 2025 03:35 — with GitHub Actions Failure

jorgeMFS had a problem deploying to pypi December 9, 2025 03:38 — with GitHub Actions Failure

jorgeMFS had a problem deploying to testpypi December 9, 2025 09:48 — with GitHub Actions Failure

jorgeMFS had a problem deploying to pypi December 9, 2025 09:53 — with GitHub Actions Failure

jorgeMFS added a commit that referenced this pull request Dec 9, 2025

Merge pull request #8 from jorgeMFS/codex/provide-reusable-utilities-…

8e43767

…or-drop-library Add reusable utilities to vcfx_core

jorgeMFS added a commit that referenced this pull request Dec 9, 2025

Merge pull request #8 from jorgeMFS/main

2e814d8

Performance optimization of tools with mmap and SIMD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance optimization of tools with mmap and SIMD #8

performance optimization of tools with mmap and SIMD #8
jorgeMFS merged 39 commits into
ieeta-pt:mainfrom
jorgeMFS:main

jorgeMFS commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jorgeMFS commented Dec 9, 2025

Pull Request Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant