Skip to content

neo4j2owl CSV import collapses parallel relationships — drops xref accessions and contaminates synonym/definition references #407

@Robbie1977

Description

@Robbie1977

Target repo for issue: VirtualFlyBrain/VFB2
Fix lives in: VirtualFlyBrain/neo4j2owl (owl2neo4jcsv.jar), consumed by VirtualFlyBrain/vfb-pipeline-dumps
Type: bug · data integrity
Severity: high — silent data loss in pdb (dropped xref accessions) and incorrect synonym typing on references


Summary

The neo4j2owl CSV importer loads every relationship with a property-less MERGE:

MERGE (s)-[r:<type>]->(e) SET r.<col> = ...

The key is only (start node, relationship type, end node). Any two relationships of the same type between the same pair of nodes therefore collapse onto a single edge, and the per-row SET clauses overwrite each other (last CSV row wins). This silently:

  1. Drops database_cross_reference accessions — a term with several xrefs to the same Site keeps only one.
  2. Contaminates has_reference edges — where a term's definition and one or more synonyms cite the same publication, the references merge into one edge that mixes typ, scope, value and has_synonym_type from different sources.

It is not stale data and not a re-run artefact: it happens deterministically on every build.


Affected component and version (verified)

  • Repo: github.com/VirtualFlyBrain/neo4j2owl
  • Deployed version: 1.2.3.9-PRE — pinned in vfb-pipeline-dumps/Dockerfile:
    ENV NEO4J2OWL_VERSION 1.2.3.9-PRE
    ARG OWL2NEO4J_JAR=https://github.com/VirtualFlyBrain/neo4j2owl/releases/download/$NEO4J2OWL_VERSION/owl2neo4jcsv.jar
  • Active development line: branch migrate_neo4j_hk. Tags 1.2.3.9-PRE (2024-10-08) and the newest 1.2.3.10-PRE (2024-10-14) both sit on this branch; their N2OCSVWriter is byte-identical.
  • ⚠️ master is the wrong base. master HEAD (c823d8e) is the old 1.2.4-PRE tag (2022). It refactored N2OCSVWriter and removed the csvPostfix / only_nodes|only_edges|all export API that the current dumps Makefile still depends on (pdb_sideloads calls the jar with … <var_part> only_edges, dumps.Makefile:174). A jar built from master would be incompatible with the live pipeline. 1.2.3.9-PRE is not an ancestor of master.
  • Base any fix on migrate_neo4j_hk at 1.2.3.10-PRE.

Location of the defect

src/main/java/ebi/spot/neo4j2owl/importer/N2OCSVWriter.java, method constructCypherQuery(...), RELATIONSHIPS branch (≈ line 80–81 at 1.2.3.9-PRE, ≈ line 101 at 1.2.3.10-PRE):

case RELATIONSHIPS:
    cypher += "MATCH (s:Entity { iri: cl.start}),(e:Entity { iri: cl.end})\n" +
              "MERGE (s)-[r:" + type + "]->(e) " +
              uncomposedSetClauses("cl", "r", manager.getHeadersForRelationships(type));
    break;

Why it is purely a load-query problem (so fixing the query is sufficient)

The in-memory model does not deduplicate — it emits one CSV row per asserted relationship:

  • N2OOWLRelationship (the map key in relationship_properties, N2OImportManager.java:22,57–66) has no equals()/hashCode() override, so it keys on object identity. Every updateRelation(...) inserts a distinct entry.
  • prepareRelationCSVsForExport(...) (N2OCSVWriter.java:184) writes one row per N2OOWLRelationship.

So the CSV genuinely contains all parallel edges; they are only merged at load time by the MERGE above. No upstream change is needed.


Reproduction

Case A — dropped xref accessions (reported by Clare)

FBbt_00067011 has four database_cross_reference annotations in source: DoOR:Or65a, DoOR:Or65b, DoOR:Or65c, FlyBrain_NDB:10412. In pdb:

MATCH (n:Neuron:Class {short_form:'FBbt_00067011'})-[:database_cross_reference]->(s) RETURN n, s

Returns a single edge/accession; expected three (DoOR) plus the FlyBrain_NDB one.

Case B — contaminated synonym/definition references

FBbt_00049999 ("adult anterior pars intercerebralis"). Source synonyms (authoritative, from the ontology):

synonym scope has_synonym_type reference
rSMPma has_exact_synonym …fbbt#BRAIN_NAME_ABV Ito et al., 2014 (FBrf0224194)
cell body rind of adult medioanterior superior medial protocerebrum has_exact_synonym (none) Ito et al., 2014 (FBrf0224194)
PIa has_exact_synonym (none) de Velasco et al., 2007 (FBrf0193772)

The term's definition also cites both FBrf0224194 and FBrf0193772.

Querying references typed BRAIN_NAME_ABV returns two edges instead of one:

MATCH (n)-[r:has_reference {has_synonym_type:['http://purl.obolibrary.org/obo/fbbt#BRAIN_NAME_ABV']}]->(p:pub)
WHERE n.label = "adult anterior pars intercerebralis"
RETURN n.short_form AS sf, r.typ AS typ, r.value AS value, p.short_form AS pub, id(r) AS rid;

Observed:

sf              typ    value                                                       pub           rid
FBbt_00049999   syn    ["rSMPma"]                                                  FBrf0224194   531402   ← correct (from polishing finalStep)
FBbt_00049999   def    ["cell body rind of adult medioanterior superior medial …"] FBrf0224194   3094     ← WRONG: typ:'def' edge wearing a synonym value + another synonym's has_synonym_type

Full properties of the bad edge:

type=has_reference
{ iri:"http://purl.org/dc/terms/references", short_form:"references", label:"has_reference",
  type:"Annotation", typ:"def", scope:"has_exact_synonym",
  value:["cell body rind of adult medioanterior superior medial protocerebrum"],
  has_synonym_type:["http://purl.obolibrary.org/obo/fbbt#BRAIN_NAME_ABV"] }
from=FBbt_00049999  to=FBrf0224194

The definition reference (typ:'def', → FBrf0224194), the rSMPma synonym reference and the cell body rind… synonym reference — all to FBrf0224194 — collapsed into one edge. typ ended on 'def'; value came from one synonym; has_synonym_type (BRAIN_NAME_ABV) bled in from rSMPma.

Systemic scale

MATCH ()-[r:has_reference]->()
WHERE r.has_synonym_type IS NOT NULL AND coalesce(r.typ,'') <> 'syn'
RETURN count(r) AS stale_typed_edges, collect(DISTINCT r.typ)[..8] AS typ_values,
       collect(DISTINCT r.scope)[..8] AS scopes;
-- 12128 | ["def"] | ["has_exact_synonym","has_broad_synonym","has_narrow_synonym"]

12,128 has_reference edges carry synonym typing on a non-'syn' edge — i.e. every term whose definition and synonyms share a publication. The database_cross_reference loss is unbounded across the graph wherever a term has multiple xrefs to the same Site.


Pipeline path (where these edges are produced)

  1. KB → triplestore (ts.p2.virtualflybrain.org): FBbt synonym/definition axioms carry OBO axiom annotations (database_cross_reference, oboInOwl#has_synonym_type).
  2. vfb-pipeline-dumps: sparql/_dump_all.sparql CONSTRUCTs all triples → pdb.owl (dumps.Makefile:165). Published at https://virtualflybrain.org/data/VFB/OWL/; CSV imports at https://virtualflybrain.org/data/VFB/OWL/dumps/csv_imports/.
  3. pdb_csvs (dumps.Makefile:177–178): runs owl2neo4jcsv.jar (= neo4j2owl 1.2.3.9-PRE) over pdb.owl with VFB_CONFIG = http://virtualflybrain.org/config/neo4j2owl-config.yaml. neo4j2owl reifies each IRI-valued annotation assertion into a has_reference / database_cross_reference relationship (N2OOntologyLoader.java:243–273), one row each, then loads them with the collapsing MERGE.

Why current workarounds don't cover it

vfb-pipeline-polishing/finalStep.py runs after import:

  • finalStep.py:276 creates clean per-synonym edges MERGE (primary)-[r:has_reference {typ:'syn', value:[syn.value]}]->(pub). Because it keys on typ:'syn', it neither matches nor cleans the collapsed import edge — it just adds a parallel correct edge, so the term ends up with both (hence the two BRAIN_NAME_ABV edges in Case B).
  • finalStep.py:314 splits multi-accession database_cross_reference edges (SIZE(r.accession) > 1). This is a band-aid for the same collapse, but it cannot recover accessions the load SET already overwrote — which is why Case A still shows only one.

These should be reviewed/removed once the importer is fixed.


Proposed fix

Add a deterministic per-row edge signature column at export and include it in the MERGE key, so identical rows dedup but distinct parallel edges are preserved.

N2OCSVWriter.javaconstructCypherQuery, RELATIONSHIPS branch:

"MERGE (s)-[r:" + type + " { edge_sig: cl.edge_sig }]->(e) " +
uncomposedSetClauses("cl", "r", manager.getHeadersForRelationships(type));

N2OCSVWriter.javaprepareRelationCSVsForExport / writeCSVRowFromColumns + header: write an edge_sig column per row, computed as a stable hash (e.g. SHA-1) of start | end | type | <sorted property values>.

Properties of this approach:

  • Structural edges (SUBCLASSOF, INSTANCEOF, part_of, …) with identical/empty props → identical signature → dedup to one edge, exactly as today.
  • xref / reference edges with differing props → distinct signatures → preserved as separate edges (4 xrefs stay 4; def vs syn references stay distinct).
  • Idempotent on re-import; type-agnostic (no per-type special-casing).
  • Scalar key avoids the null-cell / list-equality pitfalls of "MERGE on all columns".

Alternatives considered (and rejected)

  • MERGE → CREATE: would fix the collapse (one CSV row = one edge, and the model never dedups so no combinatorial blow-up) but: (a) non-idempotent — duplicates every edge if ever imported into a non-empty graph; (b) over-produces structural edges where the same logical axiom exists in annotated and plain form, or arrives from multiple merged sources. Unsafe as a global change.
  • MERGE on all property columns: correct in principle but fragile — empty cells, type coercion (split/toX) inside the MERGE map, and order-sensitive list equality make it error-prone.

Fix / release checklist

  • Branch off migrate_neo4j_hk @ 1.2.3.10-PRE (not master).
  • Implement edge_sig column + MERGE key change in N2OCSVWriter.
  • Add a N2OProcedureTest case asserting parallel edges survive: a term with ≥2 database_cross_reference to one Site keeps all; a term whose definition + a synonym cite one pub yields distinct typ:'def' and typ:'syn' edges with no cross-contamination.
  • Confirm structural-edge counts are unchanged on a sample ontology (no regression / no duplication).
  • Tag e.g. 1.2.3.11-PRE on migrate_neo4j_hk; .github/workflows/release.yml builds and publishes the owl2neo4jcsv.jar asset.
  • Bump NEO4J2OWL_VERSION to the new tag in vfb-pipeline-dumps/Dockerfile; rebuild the dumps image.
  • Review/remove the now-redundant finalStep.py:314 accession-split and reconcile the finalStep.py:276 synonym-reference step with the corrected importer.

Post-fix verification (run against rebuilt pdb)

-- A: all xref accessions present
MATCH (n:Class {short_form:'FBbt_00067011'})-[r:database_cross_reference]->(s) RETURN s.short_form, r.accession;
-- expect DoOR (Or65a/Or65b/Or65c) + FlyBrain_NDB:10412

-- B: single correctly-typed reference
MATCH (n:Class {short_form:'FBbt_00049999'})-[r:has_reference]->(p:pub)
RETURN r.typ, r.scope, r.value, r.has_synonym_type, p.short_form ORDER BY r.typ;
-- expect: one typ:'syn' rSMPma edge with BRAIN_NAME_ABV; def edges with NO has_synonym_type/synonym value

-- C: systemic contamination cleared
MATCH ()-[r:has_reference]->() WHERE r.has_synonym_type IS NOT NULL AND coalesce(r.typ,'') <> 'syn'
RETURN count(r);   -- expect 0

References

  • neo4j2owl: src/main/java/ebi/spot/neo4j2owl/importer/N2OCSVWriter.java (constructCypherQuery, prepareRelationCSVsForExport); N2OImportManager.java:22,57–66; N2OOWLRelationship.java (no equals/hashCode); N2OOntologyLoader.java:243–273 (reification).
  • vfb-pipeline-dumps: Dockerfile (NEO4J2OWL_VERSION), dumps.Makefile:165,174,177–178, sparql/_dump_all.sparql.
  • vfb-pipeline-polishing: finalStep.py:276 (synonym reference edges), finalStep.py:314 (accession-split band-aid).
  • Worked terms: FBbt_00067011, FBbt_00049999. Artefacts: https://virtualflybrain.org/data/VFB/OWL/ and .../dumps/csv_imports/.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions