Part of #336 (usability epic).
Problem
CPP's feature filter ranks features by their individual mean_dif/abs_auc, so it is blind to
distributed signals — feature blocks that are individually weak but jointly decisive. Concrete case
from this project (iBCE-EL linear epitopes):
- Amino-acid composition: each amino acid's
abs_auc ≈ 0.03 (≈ random), yet the 20 together give
ROC-AUC ≈ 0.75.
- When CPP was given a combined identity + physicochemical scale set, the filter selected 0%
identity features (physicochemical scales score higher individually) and performance collapsed
0.75 → 0.57 — the winning signal was filtered out.
Diagnostic that catches it: the marginal-vs-joint "lift" = full-block model AUC − best-single-feature
AUC (AAC +0.21 vs physicochemical +0.04).
Suggestion
Part of #336 (usability epic).
Problem
CPP's feature filter ranks features by their individual
mean_dif/abs_auc, so it is blind todistributed signals — feature blocks that are individually weak but jointly decisive. Concrete case
from this project (iBCE-EL linear epitopes):
abs_auc≈ 0.03 (≈ random), yet the 20 together giveROC-AUC ≈ 0.75.
identity features (physicochemical scales score higher individually) and performance collapsed
0.75 → 0.57 — the winning signal was filtered out.
Diagnostic that catches it: the marginal-vs-joint "lift" = full-block model AUC − best-single-feature
AUC (AAC +0.21 vs physicochemical +0.04).
Suggestion
block-level evaluation (ties to feat: aa.eval_features — model + CV + metric feature-set scorer (incl. PU mask-CV) #309), and/or warn when a scale group has high joint-vs-marginal
lift but low per-feature scores.