fix(mining): add tech-term stopwords and sentence-start filter to entity detector by arnoldwender · Pull Request #780 · MemPalace/mempalace

arnoldwender · 2026-04-13T09:57:18Z

Summary

Fixes #476 — entity_detector.py was flagging common technical terms (e.g. Handler, Service, Node, Client) as entity candidates, and capturing sentence-opening words like One that happen to be capitalized.

Changes

mempalace/entity_detector.py

Added _TECH_STOPWORDS frozenset (~60 architecture/runtime/doc terms) merged into STOPWORDS via STOPWORDS = STOPWORDS | _TECH_STOPWORDS
Replaced re.findall loop in extract_candidates() with re.finditer that checks the 50-character window before each match; words at position 0 or immediately following ., !, ?, or \n are skipped (sentence-start filter)
Added _SENTENCE_END_RE compiled regex constant for the sentence-end check

tests/test_entity_detector.py

Updated test_extract_candidates_finds_frequent_names and test_detect_entities_with_person_file to place names mid-sentence (after an opener like "Today", "Later", etc.) so the sentence-start filter does not discard them — reflects correct expected behavior
All 32 tests pass

Test plan

Run uv run pytest tests/test_entity_detector.py -v — 32/32 passed
Verify Handler, Service, Node, Client no longer appear in detect_entities() output on a typical codebase README
Verify One, Two at sentence start are no longer captured

Closes #476

…etector

arnoldwender requested review from bensig, igorls and milla-jovovich as code owners April 13, 2026 09:57

arnoldwender added 2 commits April 13, 2026 12:03

fix(mining): add TECH_STOPWORDS and sentence-start filter to entity d…

c33bd0e

…etector

style: ruff format + sync version.py to 3.2.0

886ce52

arnoldwender force-pushed the fix/entity-detector-tech-stopwords branch from cbdfcdb to 886ce52 Compare April 13, 2026 10:04

igorls added area/kg Knowledge graph area/mining File and conversation mining bug Something isn't working labels Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mining): add tech-term stopwords and sentence-start filter to entity detector#780

fix(mining): add tech-term stopwords and sentence-start filter to entity detector#780
arnoldwender wants to merge 2 commits intoMemPalace:developfrom
arnoldwender:fix/entity-detector-tech-stopwords

arnoldwender commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arnoldwender commented Apr 13, 2026

Summary

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants