Skip to content

Add multilingual entity alias guard#422

Open
KoiosSG wants to merge 12 commits into
SCIBASE-AI:mainfrom
KoiosSG:multilingual-entity-alias-17
Open

Add multilingual entity alias guard#422
KoiosSG wants to merge 12 commits into
SCIBASE-AI:mainfrom
KoiosSG:multilingual-entity-alias-17

Conversation

@KoiosSG
Copy link
Copy Markdown

@KoiosSG KoiosSG commented May 28, 2026

/claim #17
@algora-pbc /claim #17

Summary

Adds a distinct multilingual-entity-alias-guard/ slice for Scientific Knowledge Graph Integration.

The guard normalizes multilingual scientific mentions before they become graph nodes, entity-page aliases, or recommendation signals. It accepts trusted translated aliases only when numeric confidence evidence is present, preserves original language tags, normalizes language-tag casing plus underscore and hyphen regional subtags for lookup, emits JSON-LD/schema.org-style entity packets, holds homographs, false friends, same-language alias collisions, extractor-candidate/alias conflicts, regional-tag homographs, malformed alias evidence, malformed mention text, and mixed-script Latin-language lookalikes including lowercase Greek or Cyrillic confusables for curator review, suppresses low-confidence or missing-confidence aliases before graph recommendations are shown, and handles sparse or malformed ontology/corpus exports without runtime failures when localized names, mentions, or homograph policies are omitted or malformed.

Hardening Updates

  • Malformed localized-name evidence is now omitted from alias lookup and JSON-LD alternate names, with aliasEvidenceIssues preserved in the reviewer packet instead of crashing alias indexing.
  • Malformed mention text values now emit malformed-mention-text curator holds with high-priority review-multilingual-malformed-mention actions instead of crashing normalization or reaching recommendation-safe IDs.
  • Added malformed-alias-evidence-packet.json and malformed-mention-text-packet.json so reviewers can inspect both malformed-evidence paths.
  • Extractor candidate IDs that disagree with trusted multilingual alias lookup are held for curator review instead of silently overriding either signal.
  • Same-language translated alias collisions are held for curator review instead of silently attaching a mention to the wrong canonical entity.
  • Unicode NFKC normalization and whitespace collapsing prevent trusted translated aliases from being suppressed because of composed/decomposed accents or spacing differences.
  • Language-tag casing is normalized for alias lookup while the original tag remains preserved in the decision packet.
  • Regional language tags such as es-MX now fall back to base-language alias and homograph policy while preserving the original regional tag.
  • Underscore regional tags such as es_MX normalize for lookup like hyphenated regional tags while preserving the original tag, so alias and homograph policies cannot be bypassed by corpus/export separator differences.
  • Missing or non-numeric confidence evidence now suppresses candidate alias recommendations instead of allowing an unproven canonical entity mapping.
  • Mixed-script Latin-language aliases with Cyrillic or Greek lookalike characters are held for curator review before graph edges or recommendations are produced.
  • Lowercase Greek confusables such as Greek alpha in CRISPR-Cαs9 are now covered, closing a realistic scientific-name spoof bypass in the alias guard.
  • Sparse ontology/corpus exports now emit deterministic empty review or entity-alias evidence when localizedNames, mentions, or homographs are omitted, instead of throwing before curator policy can run.

Non-overlap

This is scoped to multilingual scientific alias quality before graph nodes and recommendations are produced. It does not duplicate broad entity extraction/navigation, ontology deprecation or synonym migration, recommendation visibility/diversity, geospatial provenance, organism/strain boundaries, clinical trial, biological accession, software runtime, or temporal validity guards.

Validation

  • Red regression first for malformed localized-name evidence: npm test failed with TypeError: term.normalize is not a function in buildAliasIndex before implementation.
  • Red regression first for malformed mention text: npm test failed with TypeError: term.normalize is not a function in mentionDecision before implementation.
  • Added sanitized localized-name evidence, aliasEvidenceIssues, malformed-mention-text decisions, and high-priority malformed mention curator actions.
  • cd multilingual-entity-alias-guard && npm test passed: 20 tests.
  • cd multilingual-entity-alias-guard && npm run check passed test, demo, and video generation.
  • node --check passed for index.js, demo.js, and test.js.
  • Parsed all generated JSON packets successfully: main, sparse, candidate-conflict, malformed-alias-evidence, and malformed-mention-text.
  • ffprobe verified multilingual-entity-alias-guard/reports/demo.mp4 as H.264, 1280x720, 4s, 30fps, 50,413 bytes.
  • git diff --check and git diff --cached --check passed; Git only reported Windows line-ending normalization warnings.
  • Focused payout/contact, credential, and token scan returned no matches.
  • GitHub PR merge state before this hardening was CLEAN; no check contexts were reported for this branch.

Demo Artifacts

  • multilingual-entity-alias-guard/reports/alias-guard-packet.json
  • multilingual-entity-alias-guard/reports/sparse-alias-guard-packet.json
  • multilingual-entity-alias-guard/reports/candidate-alias-conflict-packet.json
  • multilingual-entity-alias-guard/reports/malformed-alias-evidence-packet.json
  • multilingual-entity-alias-guard/reports/malformed-mention-text-packet.json
  • multilingual-entity-alias-guard/reports/alias-guard-report.md
  • multilingual-entity-alias-guard/reports/summary.svg
  • multilingual-entity-alias-guard/reports/demo.mp4

Synthetic data only. No credentials, private corpora, live ontology calls, search indexes, recommendation systems, or external APIs are used.

AI-assisted with OpenAI Codex; I reviewed and locally verified the diff before submitting.

@KoiosSG
Copy link
Copy Markdown
Author

KoiosSG commented May 28, 2026

@algora-pbc /claim #17

@KoiosSG
Copy link
Copy Markdown
Author

KoiosSG commented May 28, 2026

Hardening update pushed in 1c90584: multilingual alias normalization now applies Unicode NFKC normalization and collapses internal whitespace before alias lookup. This prevents trusted translated aliases from being suppressed only because an ontology export used decomposed accents or a manuscript had extra spacing.

I added a regression that failed before the fix with suppress-recommendation == accept-canonical-entity for a Spanish gene-therapy mention containing composed/decomposed accent and whitespace differences, and now passes.

Validation refreshed locally:

  • npm test -> 7 multilingual entity alias guard tests passed
  • npm run check -> tests, demo, and demo video regenerated successfully
  • ffprobe on reports/demo.mp4 -> H.264, 1280x720, 4s, 30fps, 46,481 bytes
  • git diff --check
  • sensitive-term scan with rg -n "(password|secret|wallet|paypal|bank|passport|private key|api key)" multilingual-entity-alias-guard returned no matches

@KoiosSG
Copy link
Copy Markdown
Author

KoiosSG commented May 29, 2026

Hardening update pushed in b912229: language tags are now normalized for alias lookup while the original mention language tag is still preserved in the decision packet. This prevents trusted translated aliases from being suppressed just because a corpus/export used ES while ontology aliases were keyed as es.

Verification refreshed:

  • Red regression first: npm test failed on the uppercase language-tag alias case (suppress-recommendation vs accept-canonical-entity).
  • Green: npm test passes with 8 multilingual entity alias guard tests.
  • npm run check passes: tests, demo packet/report/SVG, and demo MP4 generation.
  • ffprobe confirms reports/demo.mp4 is H.264, 1280x720, 30fps, 4s, 46,481 bytes.
  • git diff --check and git diff --cached --check pass.
  • Credential/payout-focused scan across changed code/docs/reports returned no matches.

@KoiosSG
Copy link
Copy Markdown
Author

KoiosSG commented May 29, 2026

Follow-up competitive hardening pass for the multilingual entity alias guard.

What changed in 90f1648:

  • Added coverage for regional language tags such as es-MX falling back to base-language alias lookup (es) while preserving the original regional tag in decision output.
  • The same base-language fallback is used for homograph/false-friend policy, so es-MX:control is still held for curator review under the existing Spanish homograph rule instead of bypassing it.
  • Updated README, requirements map, and acceptance notes to make regional-subtag handling part of the reviewer-visible contract.

Validation refreshed locally:

  • Confirmed the regional-alias regression failed before implementation with suppress-recommendation instead of accept-canonical-entity.
  • npm test -> 10 multilingual entity alias guard tests passed.
  • npm run demo -> regenerated JSON/Markdown/SVG artifacts.
  • npm run video -> regenerated reports/demo.mp4.
  • npm run check -> test, demo, and video generation passed.
  • node --check on index/demo/test passed.
  • ffprobe verified reports/demo.mp4 as H.264, 1280x720, 30 fps, 4.0s, 46,481 bytes.
  • git diff --check and git diff --cached --check passed; only Git line-ending normalization warnings appeared on Windows.
  • Credential/payout-focused scan returned no matches.

@KoiosSG
Copy link
Copy Markdown
Author

KoiosSG commented May 29, 2026

Follow-up competitive hardening pass for the multilingual entity alias guard.

What changed in 8011d06:

  • Added a regression for trusted translated aliases that omit confidence evidence.
  • Missing or non-numeric confidence now suppresses recommendations instead of accepting a canonical entity mapping.
  • Candidate entity evidence remains auditable, but the entity is not included in safe recommendation IDs or entity-page mentions until confidence evidence is present.
  • README, requirements map, and acceptance notes now explicitly cover missing-confidence alias suppression.

Validation refreshed locally:

  • Confirmed the new regression failed before implementation with accept-canonical-entity instead of suppress-recommendation.
  • npm test -> 11 multilingual entity alias guard tests passed.
  • npm run check -> test, demo, and video generation passed.
  • npm run demo -> regenerated alias packet/report/SVG artifacts.
  • npm run video -> regenerated reports/demo.mp4.
  • ffprobe verified reports/demo.mp4 as H.264, 1280x720, 30 fps, 4.0s, 46,481 bytes.
  • git diff --check and git diff --cached --check passed; only Git line-ending normalization warnings appeared on Windows.
  • Sensitive-term scan returned no payout or credential strings.
  • GitHub PR merge state after push: CLEAN.

@KoiosSG
Copy link
Copy Markdown
Author

KoiosSG commented May 29, 2026

Follow-up competitive hardening pass for the multilingual entity alias guard.

What changed in f90337e:

  • Normalized underscore regional language tags such as es_MX to the same lookup path as hyphenated tags such as es-MX.
  • Preserved the original mention language tag in decision output while applying the normalized lookup key internally.
  • Added regression coverage showing both trusted alias acceptance and homograph/false-friend holds still apply under underscore regional tags.
  • Updated README, requirements map, and acceptance notes so this separator normalization is reviewer-visible contract, not incidental behavior.

Validation refreshed locally:

  • Confirmed the new regression failed before implementation with suppress-recommendation instead of accept-canonical-entity for es_MX.
  • npm test -> 12 multilingual entity alias guard tests passed.
  • npm run check -> test, demo, and video generation passed.
  • node --check on index/demo/test passed.
  • ffprobe verified reports/demo.mp4 as H.264, 1280x720, 30 fps, 4.0s, 46,481 bytes.
  • git diff --check and git diff --cached --check passed.
  • Sensitive-term scan returned no payout or credential strings.

@KoiosSG
Copy link
Copy Markdown
Author

KoiosSG commented May 29, 2026

Follow-up competitive hardening pass for the multilingual entity alias guard.

What changed in e7ffd6c:

  • Added detection for Latin-language scientific aliases that mix normal Latin text with Cyrillic or Greek lookalike characters.
  • These visually confusable aliases are now held for curator review instead of becoming quiet unknowns or trusted graph mappings.
  • The candidate entity evidence remains visible in the decision packet, but the mention is blocked from entity-page/recommendation outputs until reviewed.
  • README, requirements map, acceptance notes, and generated demo artifacts now make this mixed-script guard part of the reviewer-visible contract.

Validation refreshed locally:

  • Confirmed the new regression failed before implementation with suppress-recommendation instead of hold-for-curator-review.
  • npm test -> 13 multilingual entity alias guard tests passed.
  • npm run check -> test, demo, and video generation passed.
  • node --check on index/demo/test passed.
  • ffprobe verified reports/demo.mp4 as H.264, 1280x720, 30 fps, 4.0s, 46,481 bytes.
  • git diff --check and git diff --cached --check passed.
  • Sensitive-term scan returned no payout, credential, or token strings.
  • GitHub PR merge state after push: CLEAN.

@KoiosSG
Copy link
Copy Markdown
Author

KoiosSG commented May 30, 2026

Follow-up competitive hardening pass for the multilingual entity alias guard.

What changed in 2e4c822:

  • Added a regression for Latin-language aliases using lowercase Greek lookalike characters, e.g. CRISPR-Cαs9 with Greek alpha.
  • The script-confusable detector now covers lowercase Greek confusables, so visually spoofed Latin-language scientific aliases are held for curator review instead of quietly becoming suppressed/unknown aliases.
  • The demo corpus now includes mention-crispr-greek-alpha-spoof, and README/requirements/acceptance notes plus reviewer artifacts were refreshed so this gate is visible.

Why this matters:

  • Lowercase Greek letters are common visual spoofing characters in scientific names and identifiers. A graph alias guard that catches only uppercase Greek leaves a realistic mixed-script bypass.
  • This keeps PR Add multilingual entity alias guard #422 focused on multilingual alias quality while strengthening the reviewer-facing graph/recommendation safety contract.

Validation refreshed locally:

  • Confirmed the new regression failed before implementation with suppress-recommendation instead of hold-for-curator-review.
  • npm test -> 14 multilingual entity alias guard tests passed.
  • npm run check -> test, demo, and video generation passed.
  • npm run demo -> regenerated alias packet/report/SVG with held curator-review mentions now 3.
  • npm run video -> regenerated reports/demo.mp4.
  • node --check passed for index.js, demo.js, and test.js.
  • ffprobe verified reports/demo.mp4 as H.264, 1280x720, 30 fps, 4.0s, 46,481 bytes.
  • git diff --check and git diff --cached --check passed; Git only reported Windows line-ending normalization warnings.
  • Focused sensitive scan returned no payout, credential, or token strings.
  • Expanded private-term scan only matched explicit safety-boundary wording in docs/report.
  • GitHub PR merge state after push: CLEAN; no checks are reported for this branch.

@KoiosSG
Copy link
Copy Markdown
Author

KoiosSG commented May 30, 2026

Follow-up competitive hardening pass for the multilingual entity alias guard.

What changed in 2b9700d:

  • Added sparse ontology/corpus regressions for omitted localizedNames, mentions, and homographs.
  • Entity packets now emit empty localized-name/alternate-name evidence instead of crashing on partial ontology exports.
  • Missing mention lists produce deterministic empty review evidence, and missing homograph policy defaults to an empty policy.
  • Demo/docs now include reports/sparse-alias-guard-packet.json for reviewer inspection.

Why this matters:

  • Knowledge graph imports often arrive from partial ontology exports or incremental corpus batches. The alias guard should emit auditable graph evidence rather than crash before curator/recommendation policy can run.
  • This keeps PR Add multilingual entity alias guard #422 focused on multilingual alias quality while making it more robust than a normalization-only slice.

Validation refreshed locally:

  • Confirmed localized-name regression failed before implementation with TypeError: Cannot convert undefined or null to object at Object.entries(entity.localizedNames).
  • Confirmed missing-homograph regression failed before implementation with TypeError: Cannot read properties of undefined.
  • npm test -> 17 multilingual entity alias guard tests passed.
  • npm run demo, npm run video, and npm run check passed.
  • ffprobe verified reports/demo.mp4 as H.264, 1280x720, 30 fps, 4.0s, 50,395 bytes.
  • git diff --check and git diff --cached --check passed; only Windows line-ending normalization warnings appeared.
  • Sensitive-term scan returned no payout, credential, or token strings.
  • GitHub PR merge state after push: CLEAN; no checks are reported for this branch.

@KoiosSG
Copy link
Copy Markdown
Author

KoiosSG commented May 30, 2026

Hardening update pushed in 449d7d8.

This closes a graph-safety gap where the extractor could propose one canonical entity while multilingual alias lookup resolved the text to a different canonical entity. The guard now holds that disagreement as candidate-alias-conflict, emits a high-priority review-multilingual-candidate-alias-conflict curator action, and keeps the conflicted mention out of recommendation-safe entity IDs.

Fresh validation from multilingual-entity-alias-guard/:

  • Red/green regression: before implementation, the conflicting candidate/alias fixture returned accept-canonical-entity; after implementation it returns hold-for-curator-review.
  • npm test passed: 18 tests.
  • npm run check passed, including demo and video generation.
  • npm run demo added reports/candidate-alias-conflict-packet.json for reviewer inspection.
  • ffprobe verified reports/demo.mp4 as H.264, 1280x720, 4s, 30fps, 50,395 bytes.
  • Parsed all JSON reports successfully: main packet 6 accepted/3 held/1 suppressed, conflict packet 0 accepted/1 held/0 suppressed, sparse packet 0/0/0.
  • git diff --check and git diff --cached --check passed; only Windows line-ending normalization warnings appeared.
  • Restricted-term scan of the module returned no matches, and report scanning found no unexpected private fixture terms.

This keeps #422 distinct from #379: #422 protects multilingual alias/entity recommendation correctness, while #379 covers geospatial field-sample provenance and safe location-edge publication.

@KoiosSG
Copy link
Copy Markdown
Author

KoiosSG commented May 31, 2026

Hardening update pushed in 58d6051.

This closes two malformed-evidence crash paths in the multilingual alias guard:

  • malformed localized-name entries are omitted from alias lookup and JSON-LD alternate names, with aliasEvidenceIssues preserved for review instead of crashing alias indexing;
  • malformed mention text now emits a malformed-mention-text curator hold with high-priority review-multilingual-malformed-mention, keeping it out of recommendation-safe IDs.

Fresh validation from multilingual-entity-alias-guard/:

  • Red regressions first: malformed localized-name and malformed mention-text fixtures both failed before the fix with TypeError: term.normalize is not a function.
  • npm test passed: 20 tests.
  • npm run check passed, including demo and video generation.
  • Added reviewer packets: reports/malformed-alias-evidence-packet.json and reports/malformed-mention-text-packet.json.
  • Parsed all JSON reports successfully.
  • ffprobe verified reports/demo.mp4 as H.264, 1280x720, 4s, 30fps, 50,413 bytes.
  • node --check, git diff --check, git diff --cached --check, and the focused credential/payout/token scan passed.

This keeps #422 focused on multilingual alias/entity recommendation correctness and distinct from #379 geospatial provenance and #515 organism/strain boundary work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant