Skip to content

Commit efe6193

Browse files
committed
ner.fast
1 parent bc228f7 commit efe6193

19 files changed

Lines changed: 2050 additions & 3 deletions

notebooks/02_fast_ner.ipynb

Lines changed: 416 additions & 0 deletions
Large diffs are not rendered by default.

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ dependencies = [
1111

1212
[project.optional-dependencies]
1313
dataframe = ["pandas>=1.5"]
14+
fast = ["rapidfuzz>=3.0", "PyYAML>=6.0"]
1415

1516
[dependency-groups]
1617
dev = [

structflo/ner/__init__.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,12 @@
4949
TargetEntity,
5050
)
5151
from structflo.ner.extractor import NERExtractor
52+
53+
try:
54+
from structflo.ner.fast import FastNERExtractor
55+
except ImportError: # rapidfuzz / PyYAML not installed
56+
FastNERExtractor = None # type: ignore[assignment,misc]
57+
5258
from structflo.ner.profiles import (
5359
BIOACTIVITY,
5460
BIOLOGY,
@@ -61,11 +67,12 @@
6167
EntityProfile,
6268
)
6369

64-
__version__ = "0.2.1"
70+
__version__ = "0.2.2"
6571

6672
__all__ = [
67-
# Main class
73+
# Main classes
6874
"NERExtractor",
75+
"FastNERExtractor",
6976
# Profile system
7077
"EntityProfile",
7178
"FULL",

structflo/ner/fast/README.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# structflo.ner.fast — Dictionary-Based NER for TB Drug Discovery
2+
3+
Fast, deterministic entity extraction using curated YAML gazetteers. No LLM, no API key, no network — runs in milliseconds.
4+
5+
## Install
6+
7+
```bash
8+
uv add "structflo-ner[fast]"
9+
10+
# with DataFrame support
11+
uv add "structflo-ner[fast,dataframe]"
12+
```
13+
14+
## Quick Start
15+
16+
```python
17+
from structflo.ner.fast import FastNERExtractor
18+
19+
fast = FastNERExtractor()
20+
result = fast.extract("Bedaquiline inhibits AtpE (Rv1305) in MDR-TB.")
21+
22+
print(result.compounds) # [ChemicalEntity(text='Bedaquiline', ...)]
23+
print(result.targets) # [TargetEntity(text='AtpE', ...)]
24+
print(result.accessions) # [AccessionEntity(text='Rv1305', ...)]
25+
print(result.diseases) # [DiseaseEntity(text='MDR-TB', ...)]
26+
27+
df = result.to_dataframe()
28+
result.display() # interactive HTML in Jupyter
29+
```
30+
31+
## How It Works
32+
33+
Three-phase matching, all without an LLM:
34+
35+
### Phase 1 — Exact Dictionary Match
36+
Looks up every text span against a normalized dictionary built from the YAML gazetteers. Auto-derived variants include:
37+
- **Case variants**: InhA, inha, INHA
38+
- **Hyphen-optional**: DprE-1 ↔ DprE1, MDR-TB ↔ MDRTB
39+
- **Period-optional**: M. tuberculosis ↔ M tuberculosis
40+
- **Greek letters**: β-lactam ↔ beta-lactam
41+
42+
Word boundaries are enforced — "Rho" won't match inside "Rhodamine".
43+
44+
### Phase 1b — Regex Patterns (Accession Numbers)
45+
Seed entries in `accession_number.yml` auto-derive regex patterns for entire ID families:
46+
47+
| Seed | Auto-derived Pattern | Matches |
48+
|---|---|---|
49+
| `Rv0005` | `Rv\d{4}[c]?` | All Rv locus tags |
50+
| `MT0005` | `MT\w+` | Mycobrowser IDs |
51+
| `P9WGR1` | `[OPQ][0-9][A-Z0-9]{3}[0-9]` | UniProt accessions |
52+
| `4TZK` | `[0-9][A-Z0-9]{3}` | PDB codes |
53+
| `WP_003407354` | `WP_\d+` | NCBI RefSeq proteins |
54+
55+
### Phase 2 — Fuzzy Match
56+
Unmatched "entity-like" tokens (capitalized, contain digits, length ≥ 4) are compared against the dictionary using rapidfuzz. Catches typos and minor variants.
57+
58+
```python
59+
# Configurable threshold (0–100, default 85)
60+
strict = FastNERExtractor(fuzzy_threshold=0) # disable fuzzy
61+
lenient = FastNERExtractor(fuzzy_threshold=75) # more permissive
62+
```
63+
64+
## Gazetteers
65+
66+
YAML files live in `structflo/ner/fast/gazetteers/`. Each file is a simple list of names — **nothing else**:
67+
68+
```yaml
69+
# target.yml
70+
- InhA
71+
- DprE1
72+
- MmpL3
73+
- AtpE
74+
```
75+
76+
The filename (without `.yml`) becomes the `entity_type`. Built-in gazetteers:
77+
78+
| File | Entity Type | Coverage |
79+
|---|---|---|
80+
| `target.yml` | target → `TargetEntity` | ~80 TB drug targets |
81+
| `gene_name.yml` | gene_name → `TargetEntity` | ~75 Mtb gene names |
82+
| `compound_name.yml` | compound_name → `ChemicalEntity` | ~50 TB compounds & abbreviations |
83+
| `disease.yml` | disease → `DiseaseEntity` | TB disease variants |
84+
| `accession_number.yml` | accession_number → `AccessionEntity` | Seed entries → regex patterns |
85+
| `screening_method.yml` | screening_method → `ScreeningMethodEntity` | ~35 screening approaches |
86+
| `functional_category.yml` | functional_category → `FunctionalCategoryEntity` | ~25 Mtb functional categories |
87+
| `product.yml` | product → `ProductEntity` | ~35 gene product descriptions |
88+
89+
## Adding New Gazetteers
90+
91+
### Option 1: Add to existing files
92+
Edit a YAML file and add names:
93+
94+
```yaml
95+
# target.yml
96+
- InhA
97+
- DprE1
98+
- MyNewTarget # just add it
99+
```
100+
101+
### Option 2: Create a new YAML file
102+
Drop a new `.yml` file into any directory:
103+
104+
```yaml
105+
# my_gazetteers/assay.yml
106+
- resazurin assay
107+
- luciferase reporter assay
108+
- disk diffusion assay
109+
```
110+
111+
```python
112+
fast = FastNERExtractor(gazetteer_dir="my_gazetteers/")
113+
```
114+
115+
### Option 3: Add terms programmatically
116+
117+
```python
118+
fast = FastNERExtractor(
119+
extra_gazetteers={
120+
"target": ["NovelTarget1", "NovelTarget2"],
121+
"compound_name": ["CompoundXYZ"],
122+
}
123+
)
124+
```
125+
126+
## Output Compatibility
127+
128+
`FastNERExtractor` produces identical `NERResult` objects as the LLM-based `NERExtractor`. Everything downstream works the same:
129+
130+
```python
131+
result.all_entities() # flat list
132+
result.to_dict() # serializable dict
133+
result.to_dataframe() # pandas DataFrame
134+
result.display() # interactive HTML
135+
```
136+
137+
Each entity includes `match_method` ("exact", "regex", or "fuzzy") and `canonical` (the gazetteer term it matched) in its `attributes` dict.
138+
139+
## Fast vs LLM
140+
141+
| | `FastNERExtractor` | `NERExtractor` |
142+
|---|---|---|
143+
| Speed | ~1–5 ms per abstract | ~2–5 s per abstract |
144+
| Novel entities | Only known terms | Discovers new entities |
145+
| Context | String matching | Full contextual understanding |
146+
| Cost | Free | API calls or GPU |
147+
| Setup | Zero config | API key or Ollama |
148+
149+
**Recommended workflow**: Fast extractor as first pass (bulk screening), LLM extractor as second pass (deep analysis on interesting papers).

structflo/ner/fast/__init__.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
"""Fast dictionary-based NER for TB drug discovery — no LLM required.
2+
3+
Quick start::
4+
5+
from structflo.ner.fast import FastNERExtractor
6+
7+
extractor = FastNERExtractor()
8+
result = extractor.extract(
9+
"Bedaquiline inhibits AtpE (Rv1305) in M. tuberculosis."
10+
)
11+
print(result.compounds)
12+
print(result.targets)
13+
df = result.to_dataframe()
14+
15+
Custom gazetteers::
16+
17+
extractor = FastNERExtractor(gazetteer_dir="/path/to/my/gazetteers")
18+
"""
19+
20+
from structflo.ner.fast.extractor import FastNERExtractor
21+
22+
__all__ = ["FastNERExtractor"]

structflo/ner/fast/_loader.py

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
"""Load YAML gazetteer files and auto-derive regex patterns for accession numbers."""
2+
3+
from __future__ import annotations
4+
5+
import logging
6+
import re
7+
from pathlib import Path
8+
9+
import yaml
10+
11+
from structflo.ner._entities import _ENTITY_CLASS_MAP
12+
13+
logger = logging.getLogger(__name__)
14+
15+
# Default gazetteer directory (shipped with the package)
16+
_DEFAULT_GAZETTEER_DIR = Path(__file__).parent / "gazetteers"
17+
18+
# Known accession-number patterns: (regex_to_detect_seed, full_pattern_with_word_boundaries)
19+
_ACCESSION_PATTERNS: list[tuple[re.Pattern[str], re.Pattern[str], str]] = [
20+
# Rv locus tags: Rv0005, Rv3854c
21+
(re.compile(r"^Rv\d{4}[c]?$"), re.compile(r"\bRv\d{4}[c]?\b"), "Rv locus tag"),
22+
# Mycobrowser MT IDs: MT0005, MTCI00.01
23+
(re.compile(r"^MT\w+$"), re.compile(r"\bMT\w+\b"), "Mycobrowser ID"),
24+
# UniProt accessions: P9WGR1, O53617
25+
(
26+
re.compile(r"^[OPQ][0-9][A-Z0-9]{3}[0-9]$"),
27+
re.compile(r"\b[OPQ][0-9][A-Z0-9]{3}[0-9]\b"),
28+
"UniProt accession",
29+
),
30+
# PDB codes: 4TZK, 1P44
31+
(re.compile(r"^[0-9][A-Z0-9]{3}$"), re.compile(r"\b[0-9][A-Z0-9]{3}\b"), "PDB code"),
32+
# NCBI RefSeq protein: WP_003407354
33+
(re.compile(r"^WP_\d+$"), re.compile(r"\bWP_\d+\b"), "NCBI RefSeq"),
34+
]
35+
36+
37+
def load_gazetteer(path: Path) -> tuple[str, list[str]]:
38+
"""Load a single YAML gazetteer file.
39+
40+
Returns:
41+
Tuple of (entity_type, list_of_terms) where entity_type is derived
42+
from the filename stem.
43+
"""
44+
entity_type = path.stem
45+
with open(path) as f:
46+
terms = yaml.safe_load(f)
47+
48+
if not isinstance(terms, list):
49+
msg = f"Gazetteer {path.name} must be a YAML list, got {type(terms).__name__}"
50+
raise ValueError(msg)
51+
52+
# Coerce all entries to strings
53+
terms = [str(t).strip() for t in terms if t is not None and str(t).strip()]
54+
return entity_type, terms
55+
56+
57+
def load_all_gazetteers(
58+
directory: Path | str | None = None,
59+
) -> dict[str, list[str]]:
60+
"""Load all YAML gazetteer files from a directory.
61+
62+
Args:
63+
directory: Path to gazetteer directory. Defaults to the built-in
64+
gazetteers shipped with the package.
65+
66+
Returns:
67+
Dict mapping entity_type → list of canonical terms.
68+
"""
69+
dirpath = Path(directory) if directory is not None else _DEFAULT_GAZETTEER_DIR
70+
71+
if not dirpath.is_dir():
72+
msg = f"Gazetteer directory does not exist: {dirpath}"
73+
raise FileNotFoundError(msg)
74+
75+
gazetteers: dict[str, list[str]] = {}
76+
77+
for yml_path in sorted(dirpath.glob("*.yml")):
78+
entity_type, terms = load_gazetteer(yml_path)
79+
80+
if entity_type not in _ENTITY_CLASS_MAP:
81+
logger.warning(
82+
"Gazetteer %s maps to unknown entity_type %r — entities will be unclassified",
83+
yml_path.name,
84+
entity_type,
85+
)
86+
87+
gazetteers[entity_type] = terms
88+
logger.debug("Loaded %d terms for %s from %s", len(terms), entity_type, yml_path.name)
89+
90+
return gazetteers
91+
92+
93+
def derive_accession_patterns(terms: list[str]) -> list[tuple[re.Pattern[str], str]]:
94+
"""Auto-derive regex patterns from accession number seed entries.
95+
96+
Examines each term against known ID formats and returns compiled regex
97+
patterns that will match the entire family (not just the listed seeds).
98+
99+
Returns:
100+
List of (compiled_pattern, description) tuples.
101+
"""
102+
detected: list[tuple[re.Pattern[str], str]] = []
103+
seen_descriptions: set[str] = set()
104+
105+
for term in terms:
106+
for seed_re, full_re, description in _ACCESSION_PATTERNS:
107+
if description not in seen_descriptions and seed_re.match(term):
108+
detected.append((full_re, description))
109+
seen_descriptions.add(description)
110+
logger.debug(
111+
"Auto-derived %s pattern from seed %r",
112+
description,
113+
term,
114+
)
115+
116+
return detected

0 commit comments

Comments
 (0)