Skip to content

Proposal: parallelise the reasonable OWL-RL fixpoint #57

@styk-tv

Description

@styk-tv

Context. reasonable is the OWL 2 RL reasoner behind pgRDF's pgrdf.materialize(). A LUBM
ingest/scale benchmark (native PostgreSQL 17 + pgRDF, on a 32 vCPU / 256 GiB AMD VM) found that once ingest was parallelised (down to ~3 µs/triple, 32 cores), materialization is the dominant wall and it is single-thread-bound in reasonable. This proposal pins the root cause in the code and lays out fixes.

The measurement (what motivates this)

LUBM-N base triples materialize wall within materialize
250 34.5 M ~609 s ~50 % reason / ~50 % write-back
500 69.1 M 2,578 s (43 min) 81 % reason / 19 % write-back
  • Materialize was 88 % of the total LUBM-500 wall (ingest 195 s, index 47 s, materialize 2,578 s).
  • The reasoning phase ran at ~1 core for 56–81 % of its wall (sampled at 1 s on a 32-core box); it
    never approached saturation. RAM peaked ~146 / 256 GiB — not memory-bound; CPU-time-bound on one core.
  • The reasoning cost is super-linear in triples (it grew from ~50 % to 81 % of materialize as N
    doubled), so at scale it dominates — the write-back half is comparatively negligible.

Conclusion: the hardware is not the limit; the single-threaded reasoner is.

Root cause (in this repo)

lib/src/reasoner.rs:

  • The engine is datafrog (use datafrog::{Iteration, Relation, Variable}, lib/Cargo.toml) — a deliberately single-threaded semi-naive Datalog evaluator. It has no internal parallelism.
  • run_fixpoint_loop (line ~1242) is the hot path:
    while changed {
        while self.iter1.changed() {
            … ~90 from_join / from_map / from_antijoin rule applications …   // all OWL-RL rules
        }
    }
    
    All ~90 rule applications run sequentially on one Iteration (self.iter1) on one core, every round, until fixpoint. For LUBM-500 that is tens of millions of tuples × many rounds = the 35-min wall.
  • Two structural facts shape any fix:
    1. Shared Variables — rules read/write a shared set (spo, pso, osp, rdf_type,
      all_triples_input, …); outputs of one rule feed others (datafrog handles this across rounds).
    2. RefCell side-effects inside join closures — e.g. self.instances.borrow_mut().insert(tup),
      self.intersections, self.unions, self.rdf_type_inv.borrow_mut(). These are not thread-safe and
      assume single-threaded evaluation.

Fixes (recommended order)

1. Surgical first — parallelise datafrog's join/sort kernel (no reasoner rewrite)

datafrog's hot cost is sorting Relations + the leapfrog/sort-merge from_join over large inputs.
Fork datafrog (or vendor a patched copy) and:

  • Relation::from*rayon::par_sort_unstable instead of sort (the sorts on multi-M-tuple relations
    are the bulk of each round and parallelise trivially);
  • from_join → partition the probe by key-range and join partitions in parallel, concatenating results.

This is transparent to reasoner.rs (drop-in datafrog), keeps all semantics, and should light up cores on exactly the sort/merge-heavy rounds that dominate LUBM. Lowest risk, fastest to prototype, and the right experiment to quantify the ceiling before a bigger rewrite. Expected: several× on the reason phase, bounded by Amdahl (the per-round changed() bookkeeping stays serial).

2. Scalable target — port the fixpoint to differential-dataflow

differential-dataflow (same author as datafrog) runs the same join model across N timely-dataflow workers (multi-thread, even multi-process / out-of-core). The mapping is close:
Variable+from_join → DD Collection+iterate+join; the RefCell side-effects become derived collections. This is the principled answer for billion-triple closures — true N-core scaling and a path past the in-RAM ceiling. Larger effort (the ~2,260-line reasoner.rs rules translate, but DD's arrange/iterate API is more involved). Recommended once pt1 has quantified the win.

3. Not recommended yet — naive per-rule parallelism on datafrog

Running the ~90 rules of one round on separate threads is blocked by the two structural facts above (shared Variables + RefCell side-effects). It would require first refactoring every rule body into a pure function over immutable Relation snapshots (no borrow_mut), then a serial merge + changed().
That refactor is most of the work of pt1/pt2 with less payoff — skip it.

Orthogonal, cheap — rule-set pruning

Semi-naive datafrog already makes rules with empty recent inputs nearly free, so most non-firing OWL-RL rules (functional/inverse-functional/asymmetric/disjoint/sameAs on data that lacks them, common for LUBM) cost little. Confirm with a per-rule timer; if the fixed per-round bookkeeping over ~90 Variables is material, gate unused rules by a one-time axiom presence check. Minor lever vs pt1/pt2.

Validation plan

  1. Correctness gate: closure triple count must match the current reasoner exactly (LUBM-N has a known closure; pgRDF's bench compares total_quads and the standard LUBM query answer counts).
  2. Re-run LUBM-250 and LUBM-500 materialize; measure core utilisation (1 s sampler) and wall.
    Target: reason phase utilisation from ~1 core → ≥ (cores−2), wall down proportionally.
  3. Bench harness + numbers live in the pgRDF benchmark workspace (LUBM-250/500 reports + a cross-run COMPARE). Hand the patched build there for an A/B against the 2,578 s LUBM-500 baseline.

@gtfierro - i'd be happy to serach for a solution if this is of interest?

flat materialize drops to single thread in the middle

Image

And below earlier test where report did not include materialisation step

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions