Skip to content

feat(extraction): add TrapPruner, MissingnessRecognizer, TargetLeakageAuditor#70

Merged
nxank4 merged 3 commits intomainfrom
feat/extraction-auditors
Feb 27, 2026
Merged

feat(extraction): add TrapPruner, MissingnessRecognizer, TargetLeakageAuditor#70
nxank4 merged 3 commits intomainfrom
feat/extraction-auditors

Conversation

@nxank4
Copy link
Copy Markdown
Collaborator

@nxank4 nxank4 commented Feb 26, 2026

Summary

Adds three new LLM-driven extraction modules for automated data quality analysis:

New Modules

Module API Purpose
TrapPruner loclean.prune_traps() Detects uncorrelated Gaussian noise columns via statistical profiling + LLM verification
MissingnessRecognizer loclean.recognize_missingness() Identifies MNAR patterns and encodes as boolean {col}_mnar feature flags
TargetLeakageAuditor loclean.audit_leakage() Semantic timeline evaluation to detect features that leak the target variable

Design

  • Backend agnostic — all DataFrame operations use Narwhals
  • Sandbox execution — LLM-generated code compiled via compile_sandboxed
  • Semantic agnosticism — TrapPruner anonymises column names before LLM evaluation
  • Graceful degradation — all modules keep data intact if LLM fails

Changes

  • src/loclean/extraction/trap_pruner.py [NEW]
  • src/loclean/extraction/missingness_recognizer.py [NEW]
  • src/loclean/extraction/leakage_auditor.py [NEW]
  • src/loclean/extraction/__init__.py — lazy imports
  • src/loclean/__init__.py — Loclean class methods + module-level functions + __all__

Tests

39 unit tests (13 per module) covering profiling, prompt construction, verdict parsing, and mock-LLM integration.

uv run pytest tests/unit/extraction/test_trap_pruner.py tests/unit/extraction/test_missingness_recognizer.py tests/unit/extraction/test_leakage_auditor.py -v --no-cov

@nxank4 nxank4 force-pushed the feat/extraction-auditors branch 2 times, most recently from 69a140a to 2ccb620 Compare February 27, 2026 17:39
…eAuditor

- TrapPruner: statistical profiling + LLM verification of Gaussian noise columns
- MissingnessRecognizer: MNAR pattern detection with sandbox-compiled encoders
- TargetLeakageAuditor: semantic timeline evaluation for target leakage
…o public API

- Add all three to extraction/__init__.py lazy imports
- Add Loclean class methods + module-level convenience functions
- Update __all__ in loclean/__init__.py
…r, TargetLeakageAuditor

- 13 tests each (39 total) covering profiling, prompt construction,
  verdict parsing, verification, and mock-LLM integration
@nxank4 nxank4 force-pushed the feat/extraction-auditors branch from 2ccb620 to 67c5e49 Compare February 27, 2026 17:44
@nxank4 nxank4 merged commit 900d6a0 into main Feb 27, 2026
21 checks passed
@devactivity-app
Copy link
Copy Markdown

Pull Request Summary by devActivity

Metrics

Cycle Time: 3d 15h 30m

Achievements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant