Add multilingual entity alias guard#422
Conversation
|
@algora-pbc /claim #17 |
|
Hardening update pushed in 1c90584: multilingual alias normalization now applies Unicode NFKC normalization and collapses internal whitespace before alias lookup. This prevents trusted translated aliases from being suppressed only because an ontology export used decomposed accents or a manuscript had extra spacing. I added a regression that failed before the fix with Validation refreshed locally:
|
|
Hardening update pushed in b912229: language tags are now normalized for alias lookup while the original mention language tag is still preserved in the decision packet. This prevents trusted translated aliases from being suppressed just because a corpus/export used Verification refreshed:
|
|
Follow-up competitive hardening pass for the multilingual entity alias guard. What changed in
Validation refreshed locally:
|
|
Follow-up competitive hardening pass for the multilingual entity alias guard. What changed in
Validation refreshed locally:
|
|
Follow-up competitive hardening pass for the multilingual entity alias guard. What changed in
Validation refreshed locally:
|
|
Follow-up competitive hardening pass for the multilingual entity alias guard. What changed in
Validation refreshed locally:
|
|
Follow-up competitive hardening pass for the multilingual entity alias guard. What changed in
Why this matters:
Validation refreshed locally:
|
|
Follow-up competitive hardening pass for the multilingual entity alias guard. What changed in
Why this matters:
Validation refreshed locally:
|
|
Hardening update pushed in This closes a graph-safety gap where the extractor could propose one canonical entity while multilingual alias lookup resolved the text to a different canonical entity. The guard now holds that disagreement as Fresh validation from
This keeps #422 distinct from #379: #422 protects multilingual alias/entity recommendation correctness, while #379 covers geospatial field-sample provenance and safe location-edge publication. |
|
Hardening update pushed in This closes two malformed-evidence crash paths in the multilingual alias guard:
Fresh validation from
This keeps #422 focused on multilingual alias/entity recommendation correctness and distinct from #379 geospatial provenance and #515 organism/strain boundary work. |
/claim #17
@algora-pbc /claim #17
Summary
Adds a distinct
multilingual-entity-alias-guard/slice for Scientific Knowledge Graph Integration.The guard normalizes multilingual scientific mentions before they become graph nodes, entity-page aliases, or recommendation signals. It accepts trusted translated aliases only when numeric confidence evidence is present, preserves original language tags, normalizes language-tag casing plus underscore and hyphen regional subtags for lookup, emits JSON-LD/schema.org-style entity packets, holds homographs, false friends, same-language alias collisions, extractor-candidate/alias conflicts, regional-tag homographs, malformed alias evidence, malformed mention text, and mixed-script Latin-language lookalikes including lowercase Greek or Cyrillic confusables for curator review, suppresses low-confidence or missing-confidence aliases before graph recommendations are shown, and handles sparse or malformed ontology/corpus exports without runtime failures when localized names, mentions, or homograph policies are omitted or malformed.
Hardening Updates
aliasEvidenceIssuespreserved in the reviewer packet instead of crashing alias indexing.malformed-mention-textcurator holds with high-priorityreview-multilingual-malformed-mentionactions instead of crashing normalization or reaching recommendation-safe IDs.malformed-alias-evidence-packet.jsonandmalformed-mention-text-packet.jsonso reviewers can inspect both malformed-evidence paths.es-MXnow fall back to base-language alias and homograph policy while preserving the original regional tag.es_MXnormalize for lookup like hyphenated regional tags while preserving the original tag, so alias and homograph policies cannot be bypassed by corpus/export separator differences.CRISPR-Cαs9are now covered, closing a realistic scientific-name spoof bypass in the alias guard.localizedNames,mentions, orhomographsare omitted, instead of throwing before curator policy can run.Non-overlap
This is scoped to multilingual scientific alias quality before graph nodes and recommendations are produced. It does not duplicate broad entity extraction/navigation, ontology deprecation or synonym migration, recommendation visibility/diversity, geospatial provenance, organism/strain boundaries, clinical trial, biological accession, software runtime, or temporal validity guards.
Validation
npm testfailed withTypeError: term.normalize is not a functioninbuildAliasIndexbefore implementation.npm testfailed withTypeError: term.normalize is not a functioninmentionDecisionbefore implementation.aliasEvidenceIssues,malformed-mention-textdecisions, and high-priority malformed mention curator actions.cd multilingual-entity-alias-guard && npm testpassed: 20 tests.cd multilingual-entity-alias-guard && npm run checkpassed test, demo, and video generation.node --checkpassed forindex.js,demo.js, andtest.js.ffprobeverifiedmultilingual-entity-alias-guard/reports/demo.mp4as H.264, 1280x720, 4s, 30fps, 50,413 bytes.git diff --checkandgit diff --cached --checkpassed; Git only reported Windows line-ending normalization warnings.CLEAN; no check contexts were reported for this branch.Demo Artifacts
multilingual-entity-alias-guard/reports/alias-guard-packet.jsonmultilingual-entity-alias-guard/reports/sparse-alias-guard-packet.jsonmultilingual-entity-alias-guard/reports/candidate-alias-conflict-packet.jsonmultilingual-entity-alias-guard/reports/malformed-alias-evidence-packet.jsonmultilingual-entity-alias-guard/reports/malformed-mention-text-packet.jsonmultilingual-entity-alias-guard/reports/alias-guard-report.mdmultilingual-entity-alias-guard/reports/summary.svgmultilingual-entity-alias-guard/reports/demo.mp4Synthetic data only. No credentials, private corpora, live ontology calls, search indexes, recommendation systems, or external APIs are used.
AI-assisted with OpenAI Codex; I reviewed and locally verified the diff before submitting.