improved validation tool and documentation#6
Merged
Conversation
…nd-newlines Update docs closing text and fix newlines
…and-documentation Add benchmarking harness
- Fixed critical bug where validator was outputting entire VCF to stdout - Added detailed validation report with statistics and checks performed - Implemented memory-mapped I/O (mmap) for improved large file performance - Added file path argument support for direct file validation - Enhanced validation to include sample count, line statistics, and checks list - Validator now properly reports PASSED/FAILED status with clear formatting 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Added 21 new validator tests (tests 22-42) covering: - File path mode (mmap) validation - Multi-allelic variants - Phased genotypes - Edge cases (large POS, empty file, CRLF line endings) - Output report verification - Various invalid input scenarios - Added new test data files for comprehensive coverage - Fixed Python binding tests to use correct Python version from CMake cache - Tests now auto-detect Python executable matching the built extension
- Updated VCFX_validator documentation with new features: - File path argument support - Detailed validation report output format - Performance improvements via mmap - Fixed and improved tools overview documentation - Consolidated Docker documentation into main README - Updated changelog
- Fixed benchmark harness to handle cross-platform compatibility - Updated task generation script for proper tool invocation - Improved run_task.sh for more reliable timing - Updated tasks.yaml with correct benchmark configurations - Fixed Makefile targets and dependencies - Updated Jupyter notebook for result visualization
- VCFX_variant_counter: Add memory-mapped I/O for file arguments (~20x faster) - Implement mmap-based file reading for direct file access - Replace stringstream parsing with fast tab counting - Add larger I/O buffers (1MB) for stdin mode - Now supports file argument: VCFX_variant_counter input.vcf - VCFX_allele_freq_calc: Optimize with string_view for zero-copy parsing - Replace std::string allocations with string_view - Add efficient genotype parsing without memory allocation - Use larger I/O buffers and pre-allocated line buffers - VCFX_missing_detector: Optimize with string_view parsing - Zero-copy field extraction using string_view - Efficient GT field extraction without stringstream - Optimized missing genotype detection - VCFX_variant_classifier: Fix chromosome format compatibility - Remove requirement for 'chr' prefix in chromosome names - Now accepts both 'chr20' and '20' formats (1000 Genomes compatible) Performance improvement on chr21 (427K variants, 4.3GB): - Before: ~2 minutes (stdin mode) - After: ~6 seconds (mmap mode) - bcftools baseline: ~1.5 seconds All 64 tests pass.
- Break long lines to comply with 88 character limit - Remove trailing whitespace from blank lines - Refactor tool variable definitions using a loop for cleaner code 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Performance improvements for tools showing slowness in benchmarks: 1. VCFX_variant_classifier: - Replace slow stringstream split with find-based approach - Replace ostringstream with string concatenation - Add 1MB I/O buffers - Add sync_with_stdio(false) 2. VCFX_allele_freq_calc: - Add 1MB output buffer - Cache FORMAT field GT index (avoids re-parsing same format) - Add sync_with_stdio(false) 3. VCFX_missing_detector: - Add 1MB output buffer - Cache FORMAT field GT index - Add sync_with_stdio(false) 4. VCFX_variant_counter: - Optimize gzip path with offset tracking instead of O(n) erase - Use larger 64KB chunks for decompression - Use string_view for zero-copy line processing Expected improvements: 2-10x speedup per tool. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
jorgeMFS
added a commit
that referenced
this pull request
Dec 9, 2025
…-output Make haplotype extractor debug optional
jorgeMFS
added a commit
that referenced
this pull request
Dec 9, 2025
Improved validation tool and documentation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request Checklist
Thank you for contributing to VCFX! Before submitting your pull request, please confirm the following:
pre-commit run --files <changed files>to executeruff,flake8,mypy, andpytest.ctest --output-on-failurefrom the build directory.pytest tests/python.README.md,docs/*).Provide a brief description of the changes below.