Skip to content

improved validation tool and documentation#6

Merged
jorgeMFS merged 12 commits into
ieeta-pt:mainfrom
jorgeMFS:main
Dec 3, 2025
Merged

improved validation tool and documentation#6
jorgeMFS merged 12 commits into
ieeta-pt:mainfrom
jorgeMFS:main

Conversation

@jorgeMFS

@jorgeMFS jorgeMFS commented Dec 3, 2025

Copy link
Copy Markdown
Collaborator

Pull Request Checklist

Thank you for contributing to VCFX! Before submitting your pull request, please confirm the following:

  • I ran pre-commit run --files <changed files> to execute ruff, flake8, mypy, and pytest.
  • All C++ tests pass via ctest --output-on-failure from the build directory.
  • All Python tests pass via pytest tests/python.
  • Documentation has been updated where applicable (e.g. README.md, docs/*).

Provide a brief description of the changes below.

jorgeMFS and others added 12 commits June 10, 2025 10:34
…nd-newlines

Update docs closing text and fix newlines
…and-documentation

Add benchmarking harness
- Fixed critical bug where validator was outputting entire VCF to stdout
- Added detailed validation report with statistics and checks performed
- Implemented memory-mapped I/O (mmap) for improved large file performance
- Added file path argument support for direct file validation
- Enhanced validation to include sample count, line statistics, and checks list
- Validator now properly reports PASSED/FAILED status with clear formatting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Added 21 new validator tests (tests 22-42) covering:
  - File path mode (mmap) validation
  - Multi-allelic variants
  - Phased genotypes
  - Edge cases (large POS, empty file, CRLF line endings)
  - Output report verification
  - Various invalid input scenarios
- Added new test data files for comprehensive coverage
- Fixed Python binding tests to use correct Python version from CMake cache
- Tests now auto-detect Python executable matching the built extension
- Updated VCFX_validator documentation with new features:
  - File path argument support
  - Detailed validation report output format
  - Performance improvements via mmap
- Fixed and improved tools overview documentation
- Consolidated Docker documentation into main README
- Updated changelog
- Fixed benchmark harness to handle cross-platform compatibility
- Updated task generation script for proper tool invocation
- Improved run_task.sh for more reliable timing
- Updated tasks.yaml with correct benchmark configurations
- Fixed Makefile targets and dependencies
- Updated Jupyter notebook for result visualization
- VCFX_variant_counter: Add memory-mapped I/O for file arguments (~20x faster)
  - Implement mmap-based file reading for direct file access
  - Replace stringstream parsing with fast tab counting
  - Add larger I/O buffers (1MB) for stdin mode
  - Now supports file argument: VCFX_variant_counter input.vcf

- VCFX_allele_freq_calc: Optimize with string_view for zero-copy parsing
  - Replace std::string allocations with string_view
  - Add efficient genotype parsing without memory allocation
  - Use larger I/O buffers and pre-allocated line buffers

- VCFX_missing_detector: Optimize with string_view parsing
  - Zero-copy field extraction using string_view
  - Efficient GT field extraction without stringstream
  - Optimized missing genotype detection

- VCFX_variant_classifier: Fix chromosome format compatibility
  - Remove requirement for 'chr' prefix in chromosome names
  - Now accepts both 'chr20' and '20' formats (1000 Genomes compatible)

Performance improvement on chr21 (427K variants, 4.3GB):
- Before: ~2 minutes (stdin mode)
- After: ~6 seconds (mmap mode)
- bcftools baseline: ~1.5 seconds

All 64 tests pass.
- Break long lines to comply with 88 character limit
- Remove trailing whitespace from blank lines
- Refactor tool variable definitions using a loop for cleaner code

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Performance improvements for tools showing slowness in benchmarks:

1. VCFX_variant_classifier:
   - Replace slow stringstream split with find-based approach
   - Replace ostringstream with string concatenation
   - Add 1MB I/O buffers
   - Add sync_with_stdio(false)

2. VCFX_allele_freq_calc:
   - Add 1MB output buffer
   - Cache FORMAT field GT index (avoids re-parsing same format)
   - Add sync_with_stdio(false)

3. VCFX_missing_detector:
   - Add 1MB output buffer
   - Cache FORMAT field GT index
   - Add sync_with_stdio(false)

4. VCFX_variant_counter:
   - Optimize gzip path with offset tracking instead of O(n) erase
   - Use larger 64KB chunks for decompression
   - Use string_view for zero-copy line processing

Expected improvements: 2-10x speedup per tool.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@jorgeMFS jorgeMFS merged commit f3f4e00 into ieeta-pt:main Dec 3, 2025
8 checks passed
jorgeMFS added a commit that referenced this pull request Dec 9, 2025
…-output

Make haplotype extractor debug optional
jorgeMFS added a commit that referenced this pull request Dec 9, 2025
Improved validation tool and documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant