|
| 1 | +# PRD: Semantic Data Pipeline Remediation |
| 2 | + |
| 3 | +**Date**: 2026-03-25 |
| 4 | +**Status**: APPROVED FOR EXECUTION |
| 5 | +**Priority**: P0 — Last major refactor |
| 6 | +**Principle**: Markdown is the source of truth. Neo4j is speed middleware. The GPU visualises the semantic structure that already exists in the data. |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## 1. Problem Statement |
| 11 | + |
| 12 | +The Logseq markdown files contain rich semantic relationships (9+ types: is-subclass-of, has-part, requires, enables, depends-on, relates-to, bridges-to/from, explicit_link, namespace). The parsing pipeline extracts all of them. But **only wikilinks survive** to the client graph — 8 of 9 relationship types are lost at various pipeline stages, and 110,209 OWL axiom nodes sit isolated in Neo4j. |
| 13 | + |
| 14 | +**Quantitative gap**: 490 edges reach the client from a dataset containing ~2,600+ potential relationships (980 EDGE + 623 SUBCLASS_OF + ~1,000 from axiom materialisation). |
| 15 | + |
| 16 | +--- |
| 17 | + |
| 18 | +## 2. Root Cause Analysis (from 3-agent audit) |
| 19 | + |
| 20 | +### 7 Data Loss Points |
| 21 | + |
| 22 | +| # | Stage | What's Lost | Root Cause | Severity | |
| 23 | +|---|-------|-------------|------------|----------| |
| 24 | +| **DL1** | OwlClass storage | 8/9 relationship types | `add_owl_class()` only stores `parent_classes` as SUBCLASS_OF. has-part, requires, enables etc. dropped | **CRITICAL** | |
| 25 | +| **DL2** | Graph load (OwlClass path) | Non-hierarchical edges | `load_graph()` only queries SUBCLASS_OF between OwlClasses | HIGH | |
| 26 | +| **DL3** | CSR construction | Edge type metadata | ForceComputeActor flattens edges to `(target, weight)`, discarding edge_type | HIGH | |
| 27 | +| **DL4** | GPU analytics → AppState | cluster_id, anomaly_score, community_id | ClusteringActor/AnomalyDetectionActor compute but never write to `app_state.node_analytics` | HIGH | |
| 28 | +| **DL5** | Binary protocol | Analytics fields | Wire format V3 has fields but TypeScript never reads them | MEDIUM | |
| 29 | +| **DL6** | OwlAxiom → edges | 110K axioms | Stored as isolated nodes, never materialised as graph edges | HIGH | |
| 30 | +| **DL7** | Constraint pipeline | OWL axiom → physics forces | OntologyConstraintTranslator exists but `apply_ontology_constraints()` never called | MEDIUM | |
| 31 | + |
| 32 | +### Existing Code That Works (But Is Disconnected) |
| 33 | + |
| 34 | +| Component | File | Status | |
| 35 | +|-----------|------|--------| |
| 36 | +| OntologyParser extracts 9+ relationship types | `parsers/ontology_parser.rs:354-397` | **Working** | |
| 37 | +| GitHubSyncService creates typed edges with OWL IRIs | `github_sync_service.rs:382-490` | **Working** | |
| 38 | +| GraphNode EDGE relationships in Neo4j | `neo4j_adapter.rs:582` | **Working** | |
| 39 | +| SemanticForcesActor (DAG, type clustering, collision) | `gpu/semantic_forces_actor.rs` | **Spawned, no data** | |
| 40 | +| OntologyConstraintActor (axiom → forces) | `gpu/ontology_constraint_actor.rs` | **Spawned, no data** | |
| 41 | +| OntologyConstraintTranslator (5 constraint types) | `physics/ontology_constraints.rs` | **Implemented, never called** | |
| 42 | +| WhelkInferenceEngine (transitive closure) | `adapters/whelk_inference_engine.rs` | **Working, output unused** | |
| 43 | +| semantic_forces.cu CUDA kernel | `utils/semantic_forces.cu` | **Compiled, never invoked** | |
| 44 | +| Binary protocol V3 analytics fields | `utils/binary_protocol.rs:40-50` | **Declared, always zero** | |
| 45 | +| ClusterHulls component | `graph/components/ClusterHulls.tsx` | **Renders, no cluster data** | |
| 46 | + |
| 47 | +--- |
| 48 | + |
| 49 | +## 3. Design Principle |
| 50 | + |
| 51 | +**Do not create new systems. Wire the existing ones together.** |
| 52 | + |
| 53 | +The architecture is sound. Every component exists. The problem is 7 broken wires between them. |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +## 4. Remediation Plan |
| 58 | + |
| 59 | +### Phase 1: Data Integrity (Fix DL1, DL2, DL6) |
| 60 | +*Goal: All markdown relationships reach Neo4j as edges* |
| 61 | + |
| 62 | +#### 1.1 Store ALL relationship types as Neo4j edges |
| 63 | +**File**: `src/adapters/neo4j_ontology_repository.rs` — `add_owl_class()` |
| 64 | +**Change**: After storing SUBCLASS_OF for parent_classes, also store: |
| 65 | +- `has_part` → `:RELATES {relationship_type: "has_part", owl_property_iri: "mv:hasPart"}` |
| 66 | +- `requires` → `:RELATES {relationship_type: "requires"}` |
| 67 | +- `depends_on`, `enables`, `relates_to`, `bridges_to`, `bridges_from` |
| 68 | +**Impact**: ~500+ new Neo4j edges from existing parsed data |
| 69 | + |
| 70 | +#### 1.2 Materialise SubClassOf axioms as SUBCLASS_OF edges |
| 71 | +**File**: `src/adapters/neo4j_ontology_repository.rs` |
| 72 | +**Change**: After whelk reasoning, run: |
| 73 | +```cypher |
| 74 | +MATCH (a:OwlAxiom {axiom_type: "SubClassOf"}) |
| 75 | +MATCH (s:OwlClass {iri: a.subject}) |
| 76 | +MATCH (o:OwlClass {iri: a.object}) |
| 77 | +MERGE (s)-[r:SUBCLASS_OF {is_inferred: true}]->(o) |
| 78 | +``` |
| 79 | +**Impact**: Transitive closure edges from 110K axioms |
| 80 | + |
| 81 | +#### 1.3 Load ALL relationship types in load_graph() |
| 82 | +**File**: `src/adapters/neo4j_adapter.rs` — `load_graph()` |
| 83 | +**Change**: After loading EDGE relationships, also query: |
| 84 | +```cypher |
| 85 | +MATCH (s)-[r:RELATES|SUBCLASS_OF]->(t) WHERE ... |
| 86 | +``` |
| 87 | +Map to Edge objects with appropriate edge_type and weight. |
| 88 | + |
| 89 | +### Phase 2: GPU Pipeline (Fix DL3, DL4) |
| 90 | +*Goal: Edge types and analytics reach the GPU and flow back* |
| 91 | + |
| 92 | +#### 2.1 Extend CSR with edge type buffer |
| 93 | +**File**: `src/utils/unified_gpu_compute/construction.rs` |
| 94 | +**Change**: Add `edge_types: DeviceBuffer<u8>` parallel to `edge_col_indices`. Upload edge type enum (0=explicit, 1=subclass, 2=structural, 3=dependency, 4=associative, 5=bridge). |
| 95 | +**Impact**: `semantic_forces.cu` can read edge types for weighted springs |
| 96 | + |
| 97 | +#### 2.2 Wire ClusteringActor → app_state.node_analytics |
| 98 | +**File**: `src/actors/gpu/clustering_actor.rs` |
| 99 | +**Change**: After computing cluster assignments, send results to `ClientCoordinatorActor` or write directly to `app_state.node_analytics`. |
| 100 | +**Impact**: Binary protocol V3 carries real cluster_id/anomaly_score |
| 101 | + |
| 102 | +### Phase 3: Semantic Forces Activation (Fix DL7) |
| 103 | +*Goal: Existing CUDA kernels compute forces from semantic structure* |
| 104 | + |
| 105 | +#### 3.1 Feed OntologyConstraintActor with axiom data |
| 106 | +**File**: `src/actors/gpu/ontology_constraint_actor.rs` |
| 107 | +**Change**: On graph reload, query OwlAxioms from Neo4j, run through `OntologyConstraintTranslator`, upload constraint buffer to GPU. |
| 108 | +**Impact**: DisjointClasses push apart, SubClassOf clusters together, SameAs merges |
| 109 | + |
| 110 | +#### 3.2 Activate SemanticForcesActor type clustering |
| 111 | +**File**: `src/actors/gpu/semantic_forces_actor.rs` |
| 112 | +**Change**: Forward `source_domain` from node metadata as `type_id` to the GPU kernel. Configure `TypeClusterConfig` with per-domain centroids. |
| 113 | +**Impact**: Nodes cluster by domain (AI/BC/MV/RB) in 3D space |
| 114 | + |
| 115 | +### Phase 4: Client Integration (Fix DL5) |
| 116 | +*Goal: Client renders semantic structure visually* |
| 117 | + |
| 118 | +#### 4.1 Parse V3 analytics fields in TypeScript |
| 119 | +**File**: `client/src/types/binaryProtocol.ts` |
| 120 | +**Change**: Expose `cluster_id`, `anomaly_score`, `community_id` in `BinaryNodeData`. |
| 121 | + |
| 122 | +#### 4.2 Colour nodes by cluster, edges by type |
| 123 | +**File**: `client/src/features/graph/components/GraphManager.tsx` |
| 124 | +**Change**: Use `cluster_id` for node colouring, `edge_type` for edge colour/width. |
| 125 | + |
| 126 | +--- |
| 127 | + |
| 128 | +## 5. Execution Order |
| 129 | + |
| 130 | +``` |
| 131 | +Phase 1.1 → Phase 1.3 → Phase 1.2 → rebuild → verify edge counts |
| 132 | +Phase 2.1 → Phase 2.2 → rebuild → verify analytics flow |
| 133 | +Phase 3.1 → Phase 3.2 → rebuild → verify spatial clustering |
| 134 | +Phase 4.1 → Phase 4.2 → verify visual output |
| 135 | +``` |
| 136 | + |
| 137 | +Each phase is independently testable. Each build verifies the previous phase works before adding the next. |
| 138 | + |
| 139 | +--- |
| 140 | + |
| 141 | +## 6. Success Criteria |
| 142 | + |
| 143 | +| Metric | Current | Target | |
| 144 | +|--------|---------|--------| |
| 145 | +| Client edges | 490 | 1,500+ | |
| 146 | +| Node isolation | 62% | <15% | |
| 147 | +| Edge types in graph | 1 (explicit_link) | 9+ | |
| 148 | +| GPU cluster_id populated | 0% | 100% | |
| 149 | +| Spatial domain clustering | None | Visible BC/AI/MV/RB groups | |
| 150 | +| Ontology constraints active | 0 | SubClassOf + DisjointWith | |
| 151 | +| Cluster hulls meaningful | 1 blob | 4-6 distinct domain hulls | |
| 152 | + |
| 153 | +--- |
| 154 | + |
| 155 | +## 7. Files Modified (Estimated) |
| 156 | + |
| 157 | +| Phase | Files | Lines Changed | |
| 158 | +|-------|-------|---------------| |
| 159 | +| 1.1 | neo4j_ontology_repository.rs | ~50 | |
| 160 | +| 1.2 | neo4j_ontology_repository.rs | ~30 | |
| 161 | +| 1.3 | neo4j_adapter.rs | ~40 | |
| 162 | +| 2.1 | construction.rs, execution.rs, memory.rs | ~80 | |
| 163 | +| 2.2 | clustering_actor.rs, app_state.rs | ~40 | |
| 164 | +| 3.1 | ontology_constraint_actor.rs, graph_state_actor.rs | ~60 | |
| 165 | +| 3.2 | semantic_forces_actor.rs, settings propagation | ~40 | |
| 166 | +| 4.1 | binaryProtocol.ts, graph.worker.ts | ~30 | |
| 167 | +| 4.2 | GraphManager.tsx, ClusterHulls.tsx | ~40 | |
| 168 | +| **Total** | **~15 files** | **~410 lines** | |
| 169 | + |
| 170 | +--- |
| 171 | + |
| 172 | +## 8. Risk Assessment |
| 173 | + |
| 174 | +- **Low risk**: Phases 1.x are additive (more edges, no removal) |
| 175 | +- **Medium risk**: Phase 2.1 (CSR extension) touches GPU memory layout |
| 176 | +- **Low risk**: Phase 3.x activates existing code paths |
| 177 | +- **Low risk**: Phase 4.x is client-only changes |
| 178 | + |
| 179 | +No destructive changes. Each phase adds capability without removing existing functionality. |
0 commit comments