Skip to content

fix(mining): add tech-term stopwords and sentence-start filter to entity detector#780

Open
arnoldwender wants to merge 2 commits intoMemPalace:developfrom
arnoldwender:fix/entity-detector-tech-stopwords
Open

fix(mining): add tech-term stopwords and sentence-start filter to entity detector#780
arnoldwender wants to merge 2 commits intoMemPalace:developfrom
arnoldwender:fix/entity-detector-tech-stopwords

Conversation

@arnoldwender
Copy link
Copy Markdown
Contributor

Summary

Fixes #476entity_detector.py was flagging common technical terms (e.g. Handler, Service, Node, Client) as entity candidates, and capturing sentence-opening words like One that happen to be capitalized.

Changes

mempalace/entity_detector.py

  • Added _TECH_STOPWORDS frozenset (~60 architecture/runtime/doc terms) merged into STOPWORDS via STOPWORDS = STOPWORDS | _TECH_STOPWORDS
  • Replaced re.findall loop in extract_candidates() with re.finditer that checks the 50-character window before each match; words at position 0 or immediately following ., !, ?, or \n are skipped (sentence-start filter)
  • Added _SENTENCE_END_RE compiled regex constant for the sentence-end check

tests/test_entity_detector.py

  • Updated test_extract_candidates_finds_frequent_names and test_detect_entities_with_person_file to place names mid-sentence (after an opener like "Today", "Later", etc.) so the sentence-start filter does not discard them — reflects correct expected behavior
  • All 32 tests pass

Test plan

  • Run uv run pytest tests/test_entity_detector.py -v — 32/32 passed
  • Verify Handler, Service, Node, Client no longer appear in detect_entities() output on a typical codebase README
  • Verify One, Two at sentence start are no longer captured

Closes #476

@arnoldwender arnoldwender force-pushed the fix/entity-detector-tech-stopwords branch from cbdfcdb to 886ce52 Compare April 13, 2026 10:04
@igorls igorls added area/kg Knowledge graph area/mining File and conversation mining bug Something isn't working labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/kg Knowledge graph area/mining File and conversation mining bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Entity detector flags common code terms as projects (Handler, Node, One, Service)

2 participants