feat(sf): SequenceFeature.scale_composition — scale-based baseline featurizer (#307)#319
Conversation
…#307) Add a first-class no-positional-split scale-average featurizer: SequenceFeature.scale_mean(df_seq, df_scales, list_parts=None, return_df=False) returns a (n_seq, n_scales) matrix by averaging each scale over a sequence span. list_parts=None uses the whole TMD-JMD span (jmd_n + tmd + jmd_c). Missing / non-canonical residues (gaps, 'X', anything not in df_scales.index) are dropped before averaging; an all-non-canonical span yields an all-NaN row (verbose warn). Matches the gamma-secretase notebook cell 27 `scale_X` comprehension within float tolerance. Backend: get_scale_mean_ in _backend/cpp/sequence_feature.py. Adds numpydoc with Examples include, an executed example notebook, 18 unit tests (positive+negative per param, golden vs the manual comprehension, empty/all-non-canonical edge case), and an Unreleased release-notes entry. No __init__ change (method on existing class). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atmul Replace the per-sequence Python comprehension (DataFrame.loc + mean per row) with a fully vectorized backend: flatten all residues into one byte array, map to scale rows through a 256-entry lookup, tally a small (n_seq, n_letters) residue-count matrix with a single np.bincount, and obtain per-sequence sums via one BLAS matmul against the scale matrix. ~15x faster on DOM_GSEC-scale inputs across scale widths; output matches the notebook comprehension within float tolerance (max abs diff ~5e-16, golden test green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…+ non-latin1 residue) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… trusts frontend) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eline # Conflicts: # docs/source/index/release_notes.rst
breimanntools
left a comment
There was a problem hiding this comment.
Why do we need this? This is alreday done in CPP.run(split_n_max=1). However, this is indeed a simplificaiton of the process without filtering redudant features. So, it is worht integrateing. Okay from my side
breimanntools
left a comment
There was a problem hiding this comment.
This is more a baseline prediction method, or? Should we simply make a baselin predicto and integrate this there? Then we can also add AAC, DPC, CKSAAP and other approaches to get simple baseline predictors? Perhaps in AAPred we have a baseline method. This can load the repreaentations from iFeature +
…ine application Rename SequenceFeature.scale_mean -> scale_composition (and backend get_scale_mean_ -> get_scale_composition_) to fit the compositional-descriptor family (sibling to a future aa_composition/AAC; see the baseline issue), and to avoid reading like df_scales (the letters x scales matrix). The output is the sequence's mean profile in scale-space -- the scale-based analogue of amino-acid composition. Docstring + example notebook now lead with the clear **application**: build a baseline feature set and compare the same model on it vs a CPP feature_matrix to show what the positional Part-Split-Scale features add. Renames the example notebook + test file to match; updates the docstring include and release notes. 195 sequence_feature + api-meta tests green; notebook re-executed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…backend, docstring, notebook, tests, release notes) Follow-up to the rename commit which only moved the files: this applies the actual content changes -- SequenceFeature.scale_mean -> scale_composition, backend get_scale_mean_ -> get_scale_composition_, the docstring 'Application' note + include path, the re-executed example notebook, the renamed tests, and the release-notes entry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #319 +/- ##
==========================================
+ Coverage 94.93% 94.97% +0.03%
==========================================
Files 185 187 +2
Lines 17883 17994 +111
Branches 3038 3051 +13
==========================================
+ Hits 16978 17089 +111
+ Misses 598 597 -1
- Partials 307 308 +1
... and 38 files with indirect coverage changes
🚀 New features to boost your workflow:
|
Adds
SequenceFeature.scale_composition— a no-positional-split baseline featurizer. For each sequence it averages every scale over the residues of a span, giving the(n_seq, n_scales)matrixX: the sequence's mean profile in scale-space (the scale-based analogue of amino-acid composition), with no positional information.Application. Build a baseline feature set for a prediction model and compare the same classifier on this
Xversus aCPPfeature_matrix, to show how much the positional Part-Split-Scale features add over a plain scale average (the "scale baseline vs CPP" comparison). It is not a positional feature set.Notes:
scale_composition(notscale_mean) to fit the compositional-descriptor family — see the follow-up issue Compositional baseline featurizers (scale, AAC, ACC, …) + baseline comparison in AAPred #335 (AAC, ACC, and baseline comparison folded intoAAPred) — and to avoid reading likedf_scales(the letters × scales matrix).bincount+ a single BLAS matmul; no per-sequence Python loop). Correct missing/non-canonical-residue dropping; all-NaN rows for empty spans with a verbose warning.return_df=Truefor a labeled frame.list_parts=None→ wholejmd_n+tmd+jmd_cspan.Tests + api-meta green (195); example notebook re-executed with outputs.
🤖 Generated with Claude Code