Context. reasonable is the OWL 2 RL reasoner behind pgRDF's pgrdf.materialize(). A LUBM
ingest/scale benchmark (native PostgreSQL 17 + pgRDF, on a 32 vCPU / 256 GiB AMD VM) found that once ingest was parallelised (down to ~3 µs/triple, 32 cores), materialization is the dominant wall and it is single-thread-bound in reasonable. This proposal pins the root cause in the code and lays out fixes.
The measurement (what motivates this)
| LUBM-N |
base triples |
materialize wall |
within materialize |
| 250 |
34.5 M |
~609 s |
~50 % reason / ~50 % write-back |
| 500 |
69.1 M |
2,578 s (43 min) |
81 % reason / 19 % write-back |
- Materialize was 88 % of the total LUBM-500 wall (ingest 195 s, index 47 s, materialize 2,578 s).
- The reasoning phase ran at ~1 core for 56–81 % of its wall (sampled at 1 s on a 32-core box); it
never approached saturation. RAM peaked ~146 / 256 GiB — not memory-bound; CPU-time-bound on one core.
- The reasoning cost is super-linear in triples (it grew from ~50 % to 81 % of materialize as N
doubled), so at scale it dominates — the write-back half is comparatively negligible.
Conclusion: the hardware is not the limit; the single-threaded reasoner is.
Root cause (in this repo)
lib/src/reasoner.rs:
- The engine is
datafrog (use datafrog::{Iteration, Relation, Variable}, lib/Cargo.toml) — a deliberately single-threaded semi-naive Datalog evaluator. It has no internal parallelism.
run_fixpoint_loop (line ~1242) is the hot path:
while changed {
while self.iter1.changed() {
… ~90 from_join / from_map / from_antijoin rule applications … // all OWL-RL rules
}
}
All ~90 rule applications run sequentially on one Iteration (self.iter1) on one core, every round, until fixpoint. For LUBM-500 that is tens of millions of tuples × many rounds = the 35-min wall.
- Two structural facts shape any fix:
- Shared
Variables — rules read/write a shared set (spo, pso, osp, rdf_type,
all_triples_input, …); outputs of one rule feed others (datafrog handles this across rounds).
RefCell side-effects inside join closures — e.g. self.instances.borrow_mut().insert(tup),
self.intersections, self.unions, self.rdf_type_inv.borrow_mut(). These are not thread-safe and
assume single-threaded evaluation.
Fixes (recommended order)
1. Surgical first — parallelise datafrog's join/sort kernel (no reasoner rewrite)
datafrog's hot cost is sorting Relations + the leapfrog/sort-merge from_join over large inputs.
Fork datafrog (or vendor a patched copy) and:
Relation::from* → rayon::par_sort_unstable instead of sort (the sorts on multi-M-tuple relations
are the bulk of each round and parallelise trivially);
from_join → partition the probe by key-range and join partitions in parallel, concatenating results.
This is transparent to reasoner.rs (drop-in datafrog), keeps all semantics, and should light up cores on exactly the sort/merge-heavy rounds that dominate LUBM. Lowest risk, fastest to prototype, and the right experiment to quantify the ceiling before a bigger rewrite. Expected: several× on the reason phase, bounded by Amdahl (the per-round changed() bookkeeping stays serial).
2. Scalable target — port the fixpoint to differential-dataflow
differential-dataflow (same author as datafrog) runs the same join model across N timely-dataflow workers (multi-thread, even multi-process / out-of-core). The mapping is close:
Variable+from_join → DD Collection+iterate+join; the RefCell side-effects become derived collections. This is the principled answer for billion-triple closures — true N-core scaling and a path past the in-RAM ceiling. Larger effort (the ~2,260-line reasoner.rs rules translate, but DD's arrange/iterate API is more involved). Recommended once pt1 has quantified the win.
3. Not recommended yet — naive per-rule parallelism on datafrog
Running the ~90 rules of one round on separate threads is blocked by the two structural facts above (shared Variables + RefCell side-effects). It would require first refactoring every rule body into a pure function over immutable Relation snapshots (no borrow_mut), then a serial merge + changed().
That refactor is most of the work of pt1/pt2 with less payoff — skip it.
Orthogonal, cheap — rule-set pruning
Semi-naive datafrog already makes rules with empty recent inputs nearly free, so most non-firing OWL-RL rules (functional/inverse-functional/asymmetric/disjoint/sameAs on data that lacks them, common for LUBM) cost little. Confirm with a per-rule timer; if the fixed per-round bookkeeping over ~90 Variables is material, gate unused rules by a one-time axiom presence check. Minor lever vs pt1/pt2.
Validation plan
- Correctness gate: closure triple count must match the current reasoner exactly (LUBM-N has a known closure; pgRDF's bench compares
total_quads and the standard LUBM query answer counts).
- Re-run LUBM-250 and LUBM-500 materialize; measure core utilisation (1 s sampler) and wall.
Target: reason phase utilisation from ~1 core → ≥ (cores−2), wall down proportionally.
- Bench harness + numbers live in the pgRDF benchmark workspace (LUBM-250/500 reports + a cross-run COMPARE). Hand the patched build there for an A/B against the 2,578 s LUBM-500 baseline.
@gtfierro - i'd be happy to serach for a solution if this is of interest?
flat materialize drops to single thread in the middle
And below earlier test where report did not include materialisation step

Context.
reasonableis the OWL 2 RL reasoner behind pgRDF'spgrdf.materialize(). A LUBMingest/scale benchmark (native PostgreSQL 17 + pgRDF, on a 32 vCPU / 256 GiB AMD VM) found that once ingest was parallelised (down to ~3 µs/triple, 32 cores), materialization is the dominant wall and it is single-thread-bound in
reasonable. This proposal pins the root cause in the code and lays out fixes.The measurement (what motivates this)
never approached saturation. RAM peaked ~146 / 256 GiB — not memory-bound; CPU-time-bound on one core.
doubled), so at scale it dominates — the write-back half is comparatively negligible.
Conclusion: the hardware is not the limit; the single-threaded reasoner is.
Root cause (in this repo)
lib/src/reasoner.rs:datafrog(use datafrog::{Iteration, Relation, Variable},lib/Cargo.toml) — a deliberately single-threaded semi-naive Datalog evaluator. It has no internal parallelism.run_fixpoint_loop(line ~1242) is the hot path:Iteration(self.iter1) on one core, every round, until fixpoint. For LUBM-500 that is tens of millions of tuples × many rounds = the 35-min wall.Variables — rules read/write a shared set (spo,pso,osp,rdf_type,all_triples_input, …); outputs of one rule feed others (datafrog handles this across rounds).RefCellside-effects inside join closures — e.g.self.instances.borrow_mut().insert(tup),self.intersections,self.unions,self.rdf_type_inv.borrow_mut(). These are not thread-safe andassume single-threaded evaluation.
Fixes (recommended order)
1. Surgical first — parallelise datafrog's join/sort kernel (no reasoner rewrite)
datafrog's hot cost is sorting
Relations + the leapfrog/sort-mergefrom_joinover large inputs.Fork datafrog (or vendor a patched copy) and:
Relation::from*→rayon::par_sort_unstableinstead ofsort(the sorts on multi-M-tuple relationsare the bulk of each round and parallelise trivially);
from_join→ partition the probe by key-range and join partitions in parallel, concatenating results.This is transparent to
reasoner.rs(drop-in datafrog), keeps all semantics, and should light up cores on exactly the sort/merge-heavy rounds that dominate LUBM. Lowest risk, fastest to prototype, and the right experiment to quantify the ceiling before a bigger rewrite. Expected: several× on the reason phase, bounded by Amdahl (the per-roundchanged()bookkeeping stays serial).2. Scalable target — port the fixpoint to
differential-dataflowdifferential-dataflow(same author as datafrog) runs the same join model across N timely-dataflow workers (multi-thread, even multi-process / out-of-core). The mapping is close:Variable+from_join→ DDCollection+iterate+join; theRefCellside-effects become derived collections. This is the principled answer for billion-triple closures — true N-core scaling and a path past the in-RAM ceiling. Larger effort (the ~2,260-linereasoner.rsrules translate, but DD's arrange/iterate API is more involved). Recommended once pt1 has quantified the win.3. Not recommended yet — naive per-rule parallelism on datafrog
Running the ~90 rules of one round on separate threads is blocked by the two structural facts above (shared
Variables +RefCellside-effects). It would require first refactoring every rule body into a pure function over immutableRelationsnapshots (noborrow_mut), then a serial merge +changed().That refactor is most of the work of pt1/pt2 with less payoff — skip it.
Orthogonal, cheap — rule-set pruning
Semi-naive datafrog already makes rules with empty
recentinputs nearly free, so most non-firing OWL-RL rules (functional/inverse-functional/asymmetric/disjoint/sameAs on data that lacks them, common for LUBM) cost little. Confirm with a per-rule timer; if the fixed per-round bookkeeping over ~90Variables is material, gate unused rules by a one-time axiom presence check. Minor lever vs pt1/pt2.Validation plan
total_quadsand the standard LUBM query answer counts).Target: reason phase utilisation from ~1 core → ≥ (cores−2), wall down proportionally.
@gtfierro - i'd be happy to serach for a solution if this is of interest?
flat materialize drops to single thread in the middle
And below earlier test where report did not include materialisation step