Skip to content

Feature: Async VCF Scanner support #178

Description

@wjixiang

Summary

Add async support for VCF file scanning, enabling non-blocking, streaming conversion of VCF records into Arrow RecordBatch via futures::Stream. This would allow consumers to integrate oxbow into async runtimes (e.g., async Python servers, concurrent multi-region queries, or network-backed storage).

Motivation

The current sync Scanner API works well for single-threaded, local-file workflows. However, several use cases benefit from async I/O:

  • Network/remote storage: async reads from object storage (S3, GCS) via noodles' async I/O support.
  • Concurrent multi-region queries: scanning multiple genomic regions in parallel without blocking threads.
  • Server environments: embedding oxbow in async web services (e.g., a genomics data API backed by async handlers).
  • Non-blocking integration: Python users running oxbow inside asyncio or trio event loops.

The noodles crate already provides async readers for VCF (via its async feature), making this a natural extension.

Proposed Design

Core Types

/// A stream of RecordBatch results.
pub type AsyncBatchReader = Pin<Box<dyn Stream<Item = Result<RecordBatch>> + Send>>;

/// Trait for async scanners (analogous to the sync `Scanner` pattern).
pub trait AsyncScanner {
    fn scan(&self, columns: Option<Vec<String>>, batch_size: Option<usize>, limit: Option<usize>) 
        -> Pin<Box<dyn Future<Output = AsyncBatchReader> + Send + '_>>;
}

AsyncVcfScanner

  • Internally wraps the existing sync Scanner to reuse Model, schema inference, and column projection logic.
  • AsyncVcfScannerBuilder mirrors the sync Scanner::new() parameters (fields, info_fields, genotype_fields, samples, etc.).
  • Initial implementation focuses on full-file async scan; region query (scan_query) can follow in a subsequent PR.

Example Usage

let scanner = AsyncVcfScanner::builder()
    .header(header)
    .fields(Select::All)
    .info_fields(Select::All)
    .genotype_fields(Select::All)
    .samples(Select::All)
    .coord_system(CoordSystem::OneClosed)
    .build()?;

let stream = scanner.scan(None, Some(1000), None);
while let Some(batch) = stream.next().await {
    let batch = batch?;
    // process RecordBatch...
}

Feature Gating

The async functionality would be behind a Cargo feature flag:

[features]
async = ["noodles/async", "futures", "tokio"]

This ensures sync-only users don't pay the compile-time cost of async dependencies.

Open Questions

  1. Async runtime: This draft uses tokio. Would you prefer async-std, or a runtime-agnostic approach (using only futures without a specific runtime)?
  2. Scope: Should the AsyncScanner trait be defined at the top level as a generic interface for all formats, or start VCF-only and generalize later?
  3. API surface: Should the initial PR include region queries (scan_query), or ship full-file scan first and add query support incrementally?
  4. Python bindings: Is there interest in exposing the async API to Python (via pyo3-asyncio) in a follow-up?

Current Progress

I have a working skeleton with the trait definition, type alias, and AsyncVcfScanner/AsyncVcfScannerBuilder stubs. Happy to open a draft PR for reference if that would be helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions