parquet-dotnet-query is a small query layer for Parquet files built on top of kiloOhm.Parquet.Net.
NuGet packages:
kiloOhm.Parquet.Net.Queryfor the core query enginekiloOhm.Parquet.Net.Query.Extensions.Writingfor write-side metadata helperskiloOhm.Parquet.Net.Query.Extensions.Indexingfor footer-backed equality indexeskiloOhm.Parquet.Net.Query.Extensions.Searchfor footer-backed text searchkiloOhm.Parquet.Net.Query.Extensions.Poolingfor reusable reader poolingkiloOhm.Parquet.Net.Query.Extensionsas a convenience umbrella package for the full stack
It keeps Parquet-specific optimizations explicit:
Pushdown(...)for predicates that should participate in parquet pruningWhere(...)for normal LINQ predicates with residual in-memory evaluationSelect(...)for projection and column pruningFromFiles(...)andFromDirectory(...)for dataset-style scansPlanAsync()andExplainAsync()for visibility into what was pushed down
The current implementation focuses on file pruning, row-group pruning, page pruning, and selective materialization rather than pretending to be a full LINQ provider.
- Explicit pushdown filter DSL
- Partial extraction from
Where(...)expressions - Statistics-based row-group pruning
- Page-index-based page pruning within surviving row groups
- Partition-aware file pruning for directory layouts like
Country=DE/... - Bloom-filter-aware equality pruning when bloom filters are present
- Extensible predicate planners for custom footer or sidecar indexes
- Query-plan caching with a default bounded in-memory cache and pluggable custom caches
- Late materialization for projected queries
- Nested POCO materialization
- Nested projection with column pruning
- Residual predicate evaluation for unsupported logic
- Strict mode for "push down or fail"
- Encryption-friendly query options on top of
kiloOhm.Parquet.Net
The pushdown subset is intentionally small and predictable.
Supported:
- Equality and inequality
<,<=,>,>=Between(...)StartsWith(..., StringComparison.Ordinal)- Conjunctive combinations with
&&
Not pushed down:
||- arbitrary method calls
- culture-sensitive string operations
- complex computed expressions
Unsupported parts are still evaluated correctly as residual predicates after reading matching row groups.
using Parquet.Query;
var rows = await ParquetQuery
.FromDirectory<Person>("people")
.Pushdown(filter => filter
.Eq(row => row.Country, "DE")
.Ge(row => row.Age, 18))
.Where(row => row.Name.EndsWith("n"))
.Select(row => new
{
row.Id,
row.Name,
City = row.Address.City
})
.ToListAsync();Semantics:
- dataset queries can skip whole files based on partition values in the path
Pushdown(...)is parquet-aware and plannableWhere(...)can contain richer logic, but unsupported parts stay residualSelect(...)drives projection and can reduce the columns read from the file- projected queries can defer non-filter columns until after row filtering
Use PlanAsync() or ExplainAsync() to inspect what the query engine will do.
var query = ParquetQuery
.FromFile<Person>("people.parquet")
.Where(row => row.Age >= 18 && row.Name.StartsWith("Lu", StringComparison.Ordinal));
var explanation = await query.ExplainAsync();
Console.WriteLine(explanation);You will see:
- selected files
- extracted pushdown predicates
- residual predicates
- selected row groups
- selected pages and candidate-row upper bounds when page pruning applies
- filter columns versus deferred columns
- whether pruning was based on partitions, statistics, bloom filters, persisted page indexes, or fallback page-index scans
- page-pruning source per row group (
persisted,fallback, orunavailable)
If you want unsupported residual logic to fail fast instead of silently falling back, use:
var rows = await ParquetQuery
.FromFile<Person>("people.parquet")
.Where(row => row.Country == "DE" && CustomCheck(row))
.StrictPushdown()
.ToListAsync();Custom pushdown predicates can be added without forking the core query engine:
- create a custom
PushdownPredicate<T> - add it through
Pushdown(filter => filter.Add(...)) - register one or more
IParquetPredicatePlanner<T>instances withWithPredicatePlanner(...)orWithPredicatePlanners(...)
This lets extension packages use footer metadata, sidecar indexes, or custom row-group/page pruning logic while keeping residual row-level verification in the main query pipeline.
The repository now includes a search-focused extension project at src/Parquet.Query.Extensions.Search with:
- a
LuceneFooterIndexingStrategyfor[ParquetLuceneIndex]string columns - footer-resident analyzed term dictionaries per row group
LuceneMatch(...)andLuceneFuzzy(...)query extensions backed by custom predicate planning
It also now includes an indexing-focused extension project at src/Parquet.Query.Extensions.Indexing with:
- a
FooterBitmapIndexingStrategyfor[ParquetFooterBitmapIndex]low-cardinality equality columns WithFooterIndexes()query extensions backed by footer-aware equality pruning
It also now includes a pooling-focused extension project at src/Parquet.Query.Extensions.Pooling with:
- a
ParquetReaderPoolthat reuses open readers per file PrewarmAsync(...)helpers so pools can be filled before queries arriveBlockFileAsync(...)helpers for coordinated file replacementWithReaderPool()query extensions that route query execution through the pool
If you want the full extension set in one install, use kiloOhm.Parquet.Net.Query.Extensions, which brings in the core query package plus writing, indexing, search, and pooling.
The query layer forwards ParquetOptions and exposes convenience methods for common encrypted-read scenarios:
WithParquetOptions(...)ConfigureParquetOptions(...)WithFooterKey(...)WithFooterSigningKey(...)UsePlaintextFooter(...)WithAadPrefix(...)UseCtrVariant(...)WithColumnKeyResolver(...)
Example:
var rows = await ParquetQuery
.FromFile<Person>("encrypted.parquet")
.WithFooterKey("0123456789ABCDEF")
.WithColumnKeyResolver((path, metadata) =>
{
if (path.Count > 0 && path[^1] == "Name")
{
return "0011223344556677";
}
return null;
})
.ToListAsync();Repeated queries automatically reuse cached planning metadata through a bounded in-memory cache in the core package.
You can disable it per query:
var uncachedRows = await ParquetQuery
.FromFile<Person>("people.parquet")
.WithoutQueryCache()
.Where(row => row.Country == "DE")
.ToListAsync();Or attach your own cache implementation:
IParquetQueryCache cache = new LruParquetQueryCache(capacity: 512);
var rows = await ParquetQuery
.FromDirectory<Person>("people")
.WithQueryCache(cache)
.Where(row => row.Country == "DE")
.ToListAsync();- Partial materialization is aimed at nested class graphs with scalar leaves
- More complex shapes fall back to full source materialization
- Page pruning depends on the public page reader/index APIs in
kiloOhm.Parquet.Net - When persisted page indexes are absent, the query layer can fall back to computed in-memory column indexes for supported types
- This is not a general-purpose
IQueryableprovider
That tradeoff is deliberate: the API stays storage-aware and keeps correctness simple.
Requirements:
- .NET 8 SDK
Restore, build, and test:
dotnet build Parquet.Query.slnx
dotnet test Parquet.Query.slnxRun the included benchmarks:
dotnet run -c Release --project benchmarks/Parquet.Query.Benchmarks/Parquet.Query.Benchmarks.csprojThe benchmark project compares:
- full-file deserialize plus in-memory filtering
- query execution with row-group pushdown
- query execution with page pruning and projection