Summary
Add async support for VCF file scanning, enabling non-blocking, streaming conversion of VCF records into Arrow RecordBatch via futures::Stream. This would allow consumers to integrate oxbow into async runtimes (e.g., async Python servers, concurrent multi-region queries, or network-backed storage).
Motivation
The current sync Scanner API works well for single-threaded, local-file workflows. However, several use cases benefit from async I/O:
- Network/remote storage: async reads from object storage (S3, GCS) via noodles' async I/O support.
- Concurrent multi-region queries: scanning multiple genomic regions in parallel without blocking threads.
- Server environments: embedding oxbow in async web services (e.g., a genomics data API backed by async handlers).
- Non-blocking integration: Python users running oxbow inside
asyncio or trio event loops.
The noodles crate already provides async readers for VCF (via its async feature), making this a natural extension.
Proposed Design
Core Types
/// A stream of RecordBatch results.
pub type AsyncBatchReader = Pin<Box<dyn Stream<Item = Result<RecordBatch>> + Send>>;
/// Trait for async scanners (analogous to the sync `Scanner` pattern).
pub trait AsyncScanner {
fn scan(&self, columns: Option<Vec<String>>, batch_size: Option<usize>, limit: Option<usize>)
-> Pin<Box<dyn Future<Output = AsyncBatchReader> + Send + '_>>;
}
AsyncVcfScanner
- Internally wraps the existing sync
Scanner to reuse Model, schema inference, and column projection logic.
AsyncVcfScannerBuilder mirrors the sync Scanner::new() parameters (fields, info_fields, genotype_fields, samples, etc.).
- Initial implementation focuses on full-file async scan; region query (
scan_query) can follow in a subsequent PR.
Example Usage
let scanner = AsyncVcfScanner::builder()
.header(header)
.fields(Select::All)
.info_fields(Select::All)
.genotype_fields(Select::All)
.samples(Select::All)
.coord_system(CoordSystem::OneClosed)
.build()?;
let stream = scanner.scan(None, Some(1000), None);
while let Some(batch) = stream.next().await {
let batch = batch?;
// process RecordBatch...
}
Feature Gating
The async functionality would be behind a Cargo feature flag:
[features]
async = ["noodles/async", "futures", "tokio"]
This ensures sync-only users don't pay the compile-time cost of async dependencies.
Open Questions
- Async runtime: This draft uses
tokio. Would you prefer async-std, or a runtime-agnostic approach (using only futures without a specific runtime)?
- Scope: Should the
AsyncScanner trait be defined at the top level as a generic interface for all formats, or start VCF-only and generalize later?
- API surface: Should the initial PR include region queries (
scan_query), or ship full-file scan first and add query support incrementally?
- Python bindings: Is there interest in exposing the async API to Python (via
pyo3-asyncio) in a follow-up?
Current Progress
I have a working skeleton with the trait definition, type alias, and AsyncVcfScanner/AsyncVcfScannerBuilder stubs. Happy to open a draft PR for reference if that would be helpful.
Summary
Add async support for VCF file scanning, enabling non-blocking, streaming conversion of VCF records into Arrow
RecordBatchviafutures::Stream. This would allow consumers to integrate oxbow into async runtimes (e.g., async Python servers, concurrent multi-region queries, or network-backed storage).Motivation
The current sync
ScannerAPI works well for single-threaded, local-file workflows. However, several use cases benefit from async I/O:asyncioortrioevent loops.The
noodlescrate already provides async readers for VCF (via itsasyncfeature), making this a natural extension.Proposed Design
Core Types
AsyncVcfScannerScannerto reuseModel, schema inference, and column projection logic.AsyncVcfScannerBuildermirrors the syncScanner::new()parameters (fields, info_fields, genotype_fields, samples, etc.).scan_query) can follow in a subsequent PR.Example Usage
Feature Gating
The async functionality would be behind a Cargo feature flag:
This ensures sync-only users don't pay the compile-time cost of async dependencies.
Open Questions
tokio. Would you preferasync-std, or a runtime-agnostic approach (using onlyfutureswithout a specific runtime)?AsyncScannertrait be defined at the top level as a generic interface for all formats, or start VCF-only and generalize later?scan_query), or ship full-file scan first and add query support incrementally?pyo3-asyncio) in a follow-up?Current Progress
I have a working skeleton with the trait definition, type alias, and
AsyncVcfScanner/AsyncVcfScannerBuilderstubs. Happy to open a draft PR for reference if that would be helpful.