Proposal: parallelise the `reasonable` OWL-RL fixpoint

**Context.** `reasonable` is the OWL 2 RL reasoner behind pgRDF's `pgrdf.materialize()`. A LUBM
ingest/scale benchmark (native PostgreSQL 17 + pgRDF, on a 32 vCPU / 256 GiB AMD VM) found that once ingest was parallelised (down to ~3 µs/triple, 32 cores), **materialization is the dominant wall and it is single-thread-bound** in `reasonable`. This proposal pins the root cause in the code and lays out fixes.

## The measurement (what motivates this)

| LUBM-N | base triples | materialize wall | within materialize |
|---|---|---|---|
| 250 | 34.5 M | ~609 s | ~50 % reason / ~50 % write-back |
| 500 | 69.1 M | **2,578 s (43 min)** | **81 % reason / 19 % write-back** |

- Materialize was **88 % of the total LUBM-500 wall** (ingest 195 s, index 47 s, materialize 2,578 s).
- The reasoning phase ran at **~1 core for 56–81 % of its wall** (sampled at 1 s on a 32-core box); it
  **never approached saturation**. RAM peaked ~146 / 256 GiB — *not* memory-bound; **CPU-time-bound on one core**.
- The reasoning cost is **super-linear** in triples (it grew from ~50 % to 81 % of materialize as N
  doubled), so at scale it dominates — the write-back half is comparatively negligible.

**Conclusion: the hardware is not the limit; the single-threaded reasoner is.**

## Root cause (in this repo)

`lib/src/reasoner.rs`:
- The engine is **`datafrog`** (`use datafrog::{Iteration, Relation, Variable}`, `lib/Cargo.toml`) — a deliberately *single-threaded* semi-naive Datalog evaluator. It has no internal parallelism.
- `run_fixpoint_loop` (line ~1242) is the hot path:
  ```
  while changed {
      while self.iter1.changed() {
          … ~90 from_join / from_map / from_antijoin rule applications …   // all OWL-RL rules
      }
  }
  ```
  All ~90 rule applications run **sequentially on one `Iteration` (`self.iter1`) on one core**, every round, until fixpoint. For LUBM-500 that is tens of millions of tuples × many rounds = the 35-min wall.
- Two structural facts shape any fix:
  1. **Shared `Variable`s** — rules read/write a shared set (`spo`, `pso`, `osp`, `rdf_type`,
     `all_triples_input`, …); outputs of one rule feed others (datafrog handles this across rounds).
  2. **`RefCell` side-effects inside join closures** — e.g. `self.instances.borrow_mut().insert(tup)`,
     `self.intersections`, `self.unions`, `self.rdf_type_inv.borrow_mut()`. These are not thread-safe and
     assume single-threaded evaluation.

## Fixes (recommended order)

### 1. Surgical first — parallelise datafrog's join/sort kernel (no reasoner rewrite)
datafrog's hot cost is **sorting `Relation`s + the leapfrog/sort-merge `from_join`** over large inputs.
Fork datafrog (or vendor a patched copy) and:
- `Relation::from*` → `rayon::par_sort_unstable` instead of `sort` (the sorts on multi-M-tuple relations
  are the bulk of each round and parallelise trivially);
- `from_join` → partition the probe by key-range and join partitions in parallel, concatenating results.

This is **transparent to `reasoner.rs`** (drop-in datafrog), keeps all semantics, and should light up cores on exactly the sort/merge-heavy rounds that dominate LUBM. Lowest risk, fastest to prototype, and the right experiment to *quantify the ceiling* before a bigger rewrite. **Expected: several× on the reason phase**, bounded by Amdahl (the per-round `changed()` bookkeeping stays serial).

### 2. Scalable target — port the fixpoint to `differential-dataflow`
`differential-dataflow` (same author as datafrog) runs the **same join model across N timely-dataflow workers** (multi-thread, even multi-process / out-of-core). The mapping is close:
`Variable`+`from_join` → DD `Collection`+`iterate`+`join`; the `RefCell` side-effects become derived collections. This is the principled answer for **billion-triple** closures — true N-core scaling and a path past the in-RAM ceiling. Larger effort (the ~2,260-line `reasoner.rs` rules translate, but DD's arrange/iterate API is more involved). Recommended once pt1 has quantified the win.

### 3. Not recommended yet — naive per-rule parallelism on datafrog
Running the ~90 rules of one round on separate threads is blocked by the two structural facts above (shared `Variable`s + `RefCell` side-effects). It would require first refactoring every rule body into a **pure function over immutable `Relation` snapshots** (no `borrow_mut`), then a serial merge + `changed()`.
That refactor is most of the work of pt1/pt2 with less payoff — skip it.

### Orthogonal, cheap — rule-set pruning
Semi-naive datafrog already makes rules with empty `recent` inputs nearly free, so most non-firing OWL-RL rules (functional/inverse-functional/asymmetric/disjoint/sameAs on data that lacks them, common for LUBM) cost little. Confirm with a per-rule timer; if the fixed per-round bookkeeping over ~90 `Variable`s is material, gate unused rules by a one-time axiom presence check. Minor lever vs pt1/pt2.

## Validation plan
1. Correctness gate: closure triple count must match the current reasoner exactly (LUBM-N has a known closure; pgRDF's bench compares `total_quads` and the standard LUBM query answer counts).
2. Re-run LUBM-250 and LUBM-500 materialize; measure **core utilisation (1 s sampler)** and **wall**.
   Target: reason phase utilisation from ~1 core → ≥ (cores−2), wall down proportionally.
3. Bench harness + numbers live in the pgRDF benchmark workspace (LUBM-250/500 reports + a cross-run  COMPARE). Hand the patched build there for an A/B against the 2,578 s LUBM-500 baseline.


@gtfierro - i'd be happy to serach for a solution if this is of interest?

flat materialize drops to single thread in the middle

<img width="1012" height="1255" alt="Image" src="https://github.com/user-attachments/assets/18d6eeec-c2d5-481c-8f38-d5cbe5ff07da" />


And below earlier test where report did not include materialisation step

<img width="901" height="1208" alt="Image" src="https://github.com/user-attachments/assets/013b1deb-a3e2-401d-be26-454861b3e3d8" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: parallelise the `reasonable` OWL-RL fixpoint #57

The measurement (what motivates this)

Root cause (in this repo)

Fixes (recommended order)

1. Surgical first — parallelise datafrog's join/sort kernel (no reasoner rewrite)

2. Scalable target — port the fixpoint to `differential-dataflow`

3. Not recommended yet — naive per-rule parallelism on datafrog

Orthogonal, cheap — rule-set pruning

Validation plan

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

LUBM-N	base triples	materialize wall	within materialize
250	34.5 M	~609 s	~50 % reason / ~50 % write-back
500	69.1 M	2,578 s (43 min)	81 % reason / 19 % write-back

Proposal: parallelise the reasonable OWL-RL fixpoint #57

Description

The measurement (what motivates this)

Root cause (in this repo)

Fixes (recommended order)

1. Surgical first — parallelise datafrog's join/sort kernel (no reasoner rewrite)

2. Scalable target — port the fixpoint to differential-dataflow

3. Not recommended yet — naive per-rule parallelism on datafrog

Orthogonal, cheap — rule-set pruning

Validation plan

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Proposal: parallelise the `reasonable` OWL-RL fixpoint #57

2. Scalable target — port the fixpoint to `differential-dataflow`