Target repo for issue: VirtualFlyBrain/VFB2
Fix lives in: VirtualFlyBrain/neo4j2owl (owl2neo4jcsv.jar), consumed by VirtualFlyBrain/vfb-pipeline-dumps
Type: bug · data integrity
Severity: high — silent data loss in pdb (dropped xref accessions) and incorrect synonym typing on references
Summary
The neo4j2owl CSV importer loads every relationship with a property-less MERGE:
MERGE (s)-[r:<type>]->(e) SET r.<col> = ...
The key is only (start node, relationship type, end node). Any two relationships of the same type between the same pair of nodes therefore collapse onto a single edge, and the per-row SET clauses overwrite each other (last CSV row wins). This silently:
- Drops
database_cross_reference accessions — a term with several xrefs to the same Site keeps only one.
- Contaminates
has_reference edges — where a term's definition and one or more synonyms cite the same publication, the references merge into one edge that mixes typ, scope, value and has_synonym_type from different sources.
It is not stale data and not a re-run artefact: it happens deterministically on every build.
Affected component and version (verified)
- Repo:
github.com/VirtualFlyBrain/neo4j2owl
- Deployed version:
1.2.3.9-PRE — pinned in vfb-pipeline-dumps/Dockerfile:
ENV NEO4J2OWL_VERSION 1.2.3.9-PRE
ARG OWL2NEO4J_JAR=https://github.com/VirtualFlyBrain/neo4j2owl/releases/download/$NEO4J2OWL_VERSION/owl2neo4jcsv.jar
- Active development line: branch
migrate_neo4j_hk. Tags 1.2.3.9-PRE (2024-10-08) and the newest 1.2.3.10-PRE (2024-10-14) both sit on this branch; their N2OCSVWriter is byte-identical.
- ⚠️
master is the wrong base. master HEAD (c823d8e) is the old 1.2.4-PRE tag (2022). It refactored N2OCSVWriter and removed the csvPostfix / only_nodes|only_edges|all export API that the current dumps Makefile still depends on (pdb_sideloads calls the jar with … <var_part> only_edges, dumps.Makefile:174). A jar built from master would be incompatible with the live pipeline. 1.2.3.9-PRE is not an ancestor of master.
- Base any fix on
migrate_neo4j_hk at 1.2.3.10-PRE.
Location of the defect
src/main/java/ebi/spot/neo4j2owl/importer/N2OCSVWriter.java, method constructCypherQuery(...), RELATIONSHIPS branch (≈ line 80–81 at 1.2.3.9-PRE, ≈ line 101 at 1.2.3.10-PRE):
case RELATIONSHIPS:
cypher += "MATCH (s:Entity { iri: cl.start}),(e:Entity { iri: cl.end})\n" +
"MERGE (s)-[r:" + type + "]->(e) " +
uncomposedSetClauses("cl", "r", manager.getHeadersForRelationships(type));
break;
Why it is purely a load-query problem (so fixing the query is sufficient)
The in-memory model does not deduplicate — it emits one CSV row per asserted relationship:
N2OOWLRelationship (the map key in relationship_properties, N2OImportManager.java:22,57–66) has no equals()/hashCode() override, so it keys on object identity. Every updateRelation(...) inserts a distinct entry.
prepareRelationCSVsForExport(...) (N2OCSVWriter.java:184) writes one row per N2OOWLRelationship.
So the CSV genuinely contains all parallel edges; they are only merged at load time by the MERGE above. No upstream change is needed.
Reproduction
Case A — dropped xref accessions (reported by Clare)
FBbt_00067011 has four database_cross_reference annotations in source: DoOR:Or65a, DoOR:Or65b, DoOR:Or65c, FlyBrain_NDB:10412. In pdb:
MATCH (n:Neuron:Class {short_form:'FBbt_00067011'})-[:database_cross_reference]->(s) RETURN n, s
Returns a single edge/accession; expected three (DoOR) plus the FlyBrain_NDB one.
Case B — contaminated synonym/definition references
FBbt_00049999 ("adult anterior pars intercerebralis"). Source synonyms (authoritative, from the ontology):
| synonym |
scope |
has_synonym_type |
reference |
rSMPma |
has_exact_synonym |
…fbbt#BRAIN_NAME_ABV |
Ito et al., 2014 (FBrf0224194) |
cell body rind of adult medioanterior superior medial protocerebrum |
has_exact_synonym |
(none) |
Ito et al., 2014 (FBrf0224194) |
PIa |
has_exact_synonym |
(none) |
de Velasco et al., 2007 (FBrf0193772) |
The term's definition also cites both FBrf0224194 and FBrf0193772.
Querying references typed BRAIN_NAME_ABV returns two edges instead of one:
MATCH (n)-[r:has_reference {has_synonym_type:['http://purl.obolibrary.org/obo/fbbt#BRAIN_NAME_ABV']}]->(p:pub)
WHERE n.label = "adult anterior pars intercerebralis"
RETURN n.short_form AS sf, r.typ AS typ, r.value AS value, p.short_form AS pub, id(r) AS rid;
Observed:
sf typ value pub rid
FBbt_00049999 syn ["rSMPma"] FBrf0224194 531402 ← correct (from polishing finalStep)
FBbt_00049999 def ["cell body rind of adult medioanterior superior medial …"] FBrf0224194 3094 ← WRONG: typ:'def' edge wearing a synonym value + another synonym's has_synonym_type
Full properties of the bad edge:
type=has_reference
{ iri:"http://purl.org/dc/terms/references", short_form:"references", label:"has_reference",
type:"Annotation", typ:"def", scope:"has_exact_synonym",
value:["cell body rind of adult medioanterior superior medial protocerebrum"],
has_synonym_type:["http://purl.obolibrary.org/obo/fbbt#BRAIN_NAME_ABV"] }
from=FBbt_00049999 to=FBrf0224194
The definition reference (typ:'def', → FBrf0224194), the rSMPma synonym reference and the cell body rind… synonym reference — all to FBrf0224194 — collapsed into one edge. typ ended on 'def'; value came from one synonym; has_synonym_type (BRAIN_NAME_ABV) bled in from rSMPma.
Systemic scale
MATCH ()-[r:has_reference]->()
WHERE r.has_synonym_type IS NOT NULL AND coalesce(r.typ,'') <> 'syn'
RETURN count(r) AS stale_typed_edges, collect(DISTINCT r.typ)[..8] AS typ_values,
collect(DISTINCT r.scope)[..8] AS scopes;
-- 12128 | ["def"] | ["has_exact_synonym","has_broad_synonym","has_narrow_synonym"]
12,128 has_reference edges carry synonym typing on a non-'syn' edge — i.e. every term whose definition and synonyms share a publication. The database_cross_reference loss is unbounded across the graph wherever a term has multiple xrefs to the same Site.
Pipeline path (where these edges are produced)
- KB → triplestore (
ts.p2.virtualflybrain.org): FBbt synonym/definition axioms carry OBO axiom annotations (database_cross_reference, oboInOwl#has_synonym_type).
vfb-pipeline-dumps: sparql/_dump_all.sparql CONSTRUCTs all triples → pdb.owl (dumps.Makefile:165). Published at https://virtualflybrain.org/data/VFB/OWL/; CSV imports at https://virtualflybrain.org/data/VFB/OWL/dumps/csv_imports/.
pdb_csvs (dumps.Makefile:177–178): runs owl2neo4jcsv.jar (= neo4j2owl 1.2.3.9-PRE) over pdb.owl with VFB_CONFIG = http://virtualflybrain.org/config/neo4j2owl-config.yaml. neo4j2owl reifies each IRI-valued annotation assertion into a has_reference / database_cross_reference relationship (N2OOntologyLoader.java:243–273), one row each, then loads them with the collapsing MERGE.
Why current workarounds don't cover it
vfb-pipeline-polishing/finalStep.py runs after import:
finalStep.py:276 creates clean per-synonym edges MERGE (primary)-[r:has_reference {typ:'syn', value:[syn.value]}]->(pub). Because it keys on typ:'syn', it neither matches nor cleans the collapsed import edge — it just adds a parallel correct edge, so the term ends up with both (hence the two BRAIN_NAME_ABV edges in Case B).
finalStep.py:314 splits multi-accession database_cross_reference edges (SIZE(r.accession) > 1). This is a band-aid for the same collapse, but it cannot recover accessions the load SET already overwrote — which is why Case A still shows only one.
These should be reviewed/removed once the importer is fixed.
Proposed fix
Add a deterministic per-row edge signature column at export and include it in the MERGE key, so identical rows dedup but distinct parallel edges are preserved.
N2OCSVWriter.java — constructCypherQuery, RELATIONSHIPS branch:
"MERGE (s)-[r:" + type + " { edge_sig: cl.edge_sig }]->(e) " +
uncomposedSetClauses("cl", "r", manager.getHeadersForRelationships(type));
N2OCSVWriter.java — prepareRelationCSVsForExport / writeCSVRowFromColumns + header: write an edge_sig column per row, computed as a stable hash (e.g. SHA-1) of start | end | type | <sorted property values>.
Properties of this approach:
- Structural edges (
SUBCLASSOF, INSTANCEOF, part_of, …) with identical/empty props → identical signature → dedup to one edge, exactly as today.
- xref / reference edges with differing props → distinct signatures → preserved as separate edges (4 xrefs stay 4; def vs syn references stay distinct).
- Idempotent on re-import; type-agnostic (no per-type special-casing).
- Scalar key avoids the null-cell / list-equality pitfalls of "MERGE on all columns".
Alternatives considered (and rejected)
MERGE → CREATE: would fix the collapse (one CSV row = one edge, and the model never dedups so no combinatorial blow-up) but: (a) non-idempotent — duplicates every edge if ever imported into a non-empty graph; (b) over-produces structural edges where the same logical axiom exists in annotated and plain form, or arrives from multiple merged sources. Unsafe as a global change.
MERGE on all property columns: correct in principle but fragile — empty cells, type coercion (split/toX) inside the MERGE map, and order-sensitive list equality make it error-prone.
Fix / release checklist
Post-fix verification (run against rebuilt pdb)
-- A: all xref accessions present
MATCH (n:Class {short_form:'FBbt_00067011'})-[r:database_cross_reference]->(s) RETURN s.short_form, r.accession;
-- expect DoOR (Or65a/Or65b/Or65c) + FlyBrain_NDB:10412
-- B: single correctly-typed reference
MATCH (n:Class {short_form:'FBbt_00049999'})-[r:has_reference]->(p:pub)
RETURN r.typ, r.scope, r.value, r.has_synonym_type, p.short_form ORDER BY r.typ;
-- expect: one typ:'syn' rSMPma edge with BRAIN_NAME_ABV; def edges with NO has_synonym_type/synonym value
-- C: systemic contamination cleared
MATCH ()-[r:has_reference]->() WHERE r.has_synonym_type IS NOT NULL AND coalesce(r.typ,'') <> 'syn'
RETURN count(r); -- expect 0
References
- neo4j2owl:
src/main/java/ebi/spot/neo4j2owl/importer/N2OCSVWriter.java (constructCypherQuery, prepareRelationCSVsForExport); N2OImportManager.java:22,57–66; N2OOWLRelationship.java (no equals/hashCode); N2OOntologyLoader.java:243–273 (reification).
- vfb-pipeline-dumps:
Dockerfile (NEO4J2OWL_VERSION), dumps.Makefile:165,174,177–178, sparql/_dump_all.sparql.
- vfb-pipeline-polishing:
finalStep.py:276 (synonym reference edges), finalStep.py:314 (accession-split band-aid).
- Worked terms:
FBbt_00067011, FBbt_00049999. Artefacts: https://virtualflybrain.org/data/VFB/OWL/ and .../dumps/csv_imports/.
Target repo for issue:
VirtualFlyBrain/VFB2Fix lives in:
VirtualFlyBrain/neo4j2owl(owl2neo4jcsv.jar), consumed byVirtualFlyBrain/vfb-pipeline-dumpsType: bug · data integrity
Severity: high — silent data loss in
pdb(dropped xref accessions) and incorrect synonym typing on referencesSummary
The neo4j2owl CSV importer loads every relationship with a property-less
MERGE:The key is only
(start node, relationship type, end node). Any two relationships of the same type between the same pair of nodes therefore collapse onto a single edge, and the per-rowSETclauses overwrite each other (last CSV row wins). This silently:database_cross_referenceaccessions — a term with several xrefs to the same Site keeps only one.has_referenceedges — where a term's definition and one or more synonyms cite the same publication, the references merge into one edge that mixestyp,scope,valueandhas_synonym_typefrom different sources.It is not stale data and not a re-run artefact: it happens deterministically on every build.
Affected component and version (verified)
github.com/VirtualFlyBrain/neo4j2owl1.2.3.9-PRE— pinned invfb-pipeline-dumps/Dockerfile:migrate_neo4j_hk. Tags1.2.3.9-PRE(2024-10-08) and the newest1.2.3.10-PRE(2024-10-14) both sit on this branch; theirN2OCSVWriteris byte-identical.masteris the wrong base.masterHEAD (c823d8e) is the old1.2.4-PREtag (2022). It refactoredN2OCSVWriterand removed thecsvPostfix/only_nodes|only_edges|allexport API that the current dumps Makefile still depends on (pdb_sideloadscalls the jar with… <var_part> only_edges,dumps.Makefile:174). A jar built frommasterwould be incompatible with the live pipeline.1.2.3.9-PREis not an ancestor ofmaster.migrate_neo4j_hkat1.2.3.10-PRE.Location of the defect
src/main/java/ebi/spot/neo4j2owl/importer/N2OCSVWriter.java, methodconstructCypherQuery(...),RELATIONSHIPSbranch (≈ line 80–81 at1.2.3.9-PRE, ≈ line 101 at1.2.3.10-PRE):Why it is purely a load-query problem (so fixing the query is sufficient)
The in-memory model does not deduplicate — it emits one CSV row per asserted relationship:
N2OOWLRelationship(the map key inrelationship_properties,N2OImportManager.java:22,57–66) has noequals()/hashCode()override, so it keys on object identity. EveryupdateRelation(...)inserts a distinct entry.prepareRelationCSVsForExport(...)(N2OCSVWriter.java:184) writes one row perN2OOWLRelationship.So the CSV genuinely contains all parallel edges; they are only merged at load time by the
MERGEabove. No upstream change is needed.Reproduction
Case A — dropped xref accessions (reported by Clare)
FBbt_00067011has fourdatabase_cross_referenceannotations in source:DoOR:Or65a,DoOR:Or65b,DoOR:Or65c,FlyBrain_NDB:10412. Inpdb:Returns a single edge/accession; expected three (DoOR) plus the FlyBrain_NDB one.
Case B — contaminated synonym/definition references
FBbt_00049999("adult anterior pars intercerebralis"). Source synonyms (authoritative, from the ontology):rSMPma…fbbt#BRAIN_NAME_ABVFBrf0224194)cell body rind of adult medioanterior superior medial protocerebrumFBrf0224194)PIaFBrf0193772)The term's definition also cites both
FBrf0224194andFBrf0193772.Querying references typed
BRAIN_NAME_ABVreturns two edges instead of one:Observed:
Full properties of the bad edge:
The definition reference (
typ:'def', →FBrf0224194), therSMPmasynonym reference and thecell body rind…synonym reference — all toFBrf0224194— collapsed into one edge.typended on'def';valuecame from one synonym;has_synonym_type(BRAIN_NAME_ABV) bled in fromrSMPma.Systemic scale
12,128
has_referenceedges carry synonym typing on a non-'syn'edge — i.e. every term whose definition and synonyms share a publication. Thedatabase_cross_referenceloss is unbounded across the graph wherever a term has multiple xrefs to the same Site.Pipeline path (where these edges are produced)
ts.p2.virtualflybrain.org): FBbt synonym/definition axioms carry OBO axiom annotations (database_cross_reference,oboInOwl#has_synonym_type).vfb-pipeline-dumps:sparql/_dump_all.sparqlCONSTRUCTs all triples →pdb.owl(dumps.Makefile:165). Published athttps://virtualflybrain.org/data/VFB/OWL/; CSV imports athttps://virtualflybrain.org/data/VFB/OWL/dumps/csv_imports/.pdb_csvs(dumps.Makefile:177–178): runsowl2neo4jcsv.jar(= neo4j2owl1.2.3.9-PRE) overpdb.owlwithVFB_CONFIG = http://virtualflybrain.org/config/neo4j2owl-config.yaml. neo4j2owl reifies each IRI-valued annotation assertion into ahas_reference/database_cross_referencerelationship (N2OOntologyLoader.java:243–273), one row each, then loads them with the collapsingMERGE.Why current workarounds don't cover it
vfb-pipeline-polishing/finalStep.pyruns after import:finalStep.py:276creates clean per-synonym edgesMERGE (primary)-[r:has_reference {typ:'syn', value:[syn.value]}]->(pub). Because it keys ontyp:'syn', it neither matches nor cleans the collapsed import edge — it just adds a parallel correct edge, so the term ends up with both (hence the twoBRAIN_NAME_ABVedges in Case B).finalStep.py:314splits multi-accessiondatabase_cross_referenceedges (SIZE(r.accession) > 1). This is a band-aid for the same collapse, but it cannot recover accessions the loadSETalready overwrote — which is why Case A still shows only one.These should be reviewed/removed once the importer is fixed.
Proposed fix
Add a deterministic per-row edge signature column at export and include it in the
MERGEkey, so identical rows dedup but distinct parallel edges are preserved.N2OCSVWriter.java—constructCypherQuery, RELATIONSHIPS branch:N2OCSVWriter.java—prepareRelationCSVsForExport/writeCSVRowFromColumns+ header: write anedge_sigcolumn per row, computed as a stable hash (e.g. SHA-1) ofstart | end | type | <sorted property values>.Properties of this approach:
SUBCLASSOF,INSTANCEOF,part_of, …) with identical/empty props → identical signature → dedup to one edge, exactly as today.Alternatives considered (and rejected)
MERGE → CREATE: would fix the collapse (one CSV row = one edge, and the model never dedups so no combinatorial blow-up) but: (a) non-idempotent — duplicates every edge if ever imported into a non-empty graph; (b) over-produces structural edges where the same logical axiom exists in annotated and plain form, or arrives from multiple merged sources. Unsafe as a global change.MERGEon all property columns: correct in principle but fragile — empty cells, type coercion (split/toX) inside theMERGEmap, and order-sensitive list equality make it error-prone.Fix / release checklist
migrate_neo4j_hk@1.2.3.10-PRE(notmaster).edge_sigcolumn +MERGEkey change inN2OCSVWriter.N2OProcedureTestcase asserting parallel edges survive: a term with ≥2database_cross_referenceto one Site keeps all; a term whose definition + a synonym cite one pub yields distincttyp:'def'andtyp:'syn'edges with no cross-contamination.1.2.3.11-PREonmigrate_neo4j_hk;.github/workflows/release.ymlbuilds and publishes theowl2neo4jcsv.jarasset.NEO4J2OWL_VERSIONto the new tag invfb-pipeline-dumps/Dockerfile; rebuild the dumps image.finalStep.py:314accession-split and reconcile thefinalStep.py:276synonym-reference step with the corrected importer.Post-fix verification (run against rebuilt
pdb)References
src/main/java/ebi/spot/neo4j2owl/importer/N2OCSVWriter.java(constructCypherQuery,prepareRelationCSVsForExport);N2OImportManager.java:22,57–66;N2OOWLRelationship.java(noequals/hashCode);N2OOntologyLoader.java:243–273(reification).Dockerfile(NEO4J2OWL_VERSION),dumps.Makefile:165,174,177–178,sparql/_dump_all.sparql.finalStep.py:276(synonym reference edges),finalStep.py:314(accession-split band-aid).FBbt_00067011,FBbt_00049999. Artefacts:https://virtualflybrain.org/data/VFB/OWL/and.../dumps/csv_imports/.