feat(sf): SequenceFeature.scale_composition — scale-based baseline featurizer (#307) by breimanntools · Pull Request #319 · breimanntools/aaanalysis

breimanntools · 2026-06-30T23:32:07Z

Adds SequenceFeature.scale_composition — a no-positional-split baseline featurizer. For each sequence it averages every scale over the residues of a span, giving the (n_seq, n_scales) matrix X: the sequence's mean profile in scale-space (the scale-based analogue of amino-acid composition), with no positional information.

Application. Build a baseline feature set for a prediction model and compare the same classifier on this X versus a CPP feature_matrix, to show how much the positional Part-Split-Scale features add over a plain scale average (the "scale baseline vs CPP" comparison). It is not a positional feature set.

Notes:

Named scale_composition (not scale_mean) to fit the compositional-descriptor family — see the follow-up issue Compositional baseline featurizers (scale, AAC, ACC, …) + baseline comparison in AAPred #335 (AAC, ACC, and baseline comparison folded into AAPred) — and to avoid reading like df_scales (the letters × scales matrix).
Fully vectorized backend (byte→scale lookup + one bincount + a single BLAS matmul; no per-sequence Python loop). Correct missing/non-canonical-residue dropping; all-NaN rows for empty spans with a verbose warning. return_df=True for a labeled frame.
list_parts=None → whole jmd_n + tmd + jmd_c span.

Tests + api-meta green (195); example notebook re-executed with outputs.

🤖 Generated with Claude Code

…#307) Add a first-class no-positional-split scale-average featurizer: SequenceFeature.scale_mean(df_seq, df_scales, list_parts=None, return_df=False) returns a (n_seq, n_scales) matrix by averaging each scale over a sequence span. list_parts=None uses the whole TMD-JMD span (jmd_n + tmd + jmd_c). Missing / non-canonical residues (gaps, 'X', anything not in df_scales.index) are dropped before averaging; an all-non-canonical span yields an all-NaN row (verbose warn). Matches the gamma-secretase notebook cell 27 `scale_X` comprehension within float tolerance. Backend: get_scale_mean_ in _backend/cpp/sequence_feature.py. Adds numpydoc with Examples include, an executed example notebook, 18 unit tests (positive+negative per param, golden vs the manual comprehension, empty/all-non-canonical edge case), and an Unreleased release-notes entry. No __init__ change (method on existing class). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…atmul Replace the per-sequence Python comprehension (DataFrame.loc + mean per row) with a fully vectorized backend: flatten all residues into one byte array, map to scale rows through a 256-entry lookup, tally a small (n_seq, n_letters) residue-count matrix with a single np.bincount, and obtain per-sequence sums via one BLAS matmul against the scale matrix. ~15x faster on DOM_GSEC-scale inputs across scale widths; output matches the notebook comprehension within float tolerance (max abs diff ~5e-16, golden test green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…+ non-latin1 residue) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… trusts frontend) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…eline # Conflicts: # docs/source/index/release_notes.rst

breimanntools

Why do we need this? This is alreday done in CPP.run(split_n_max=1). However, this is indeed a simplificaiton of the process without filtering redudant features. So, it is worht integrateing. Okay from my side

…eline

breimanntools

This is more a baseline prediction method, or? Should we simply make a baselin predicto and integrate this there? Then we can also add AAC, DPC, CKSAAP and other approaches to get simple baseline predictors? Perhaps in AAPred we have a baseline method. This can load the repreaentations from iFeature +

…ine application Rename SequenceFeature.scale_mean -> scale_composition (and backend get_scale_mean_ -> get_scale_composition_) to fit the compositional-descriptor family (sibling to a future aa_composition/AAC; see the baseline issue), and to avoid reading like df_scales (the letters x scales matrix). The output is the sequence's mean profile in scale-space -- the scale-based analogue of amino-acid composition. Docstring + example notebook now lead with the clear **application**: build a baseline feature set and compare the same model on it vs a CPP feature_matrix to show what the positional Part-Split-Scale features add. Renames the example notebook + test file to match; updates the docstring include and release notes. 195 sequence_feature + api-meta tests green; notebook re-executed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…backend, docstring, notebook, tests, release notes) Follow-up to the rename commit which only moved the files: this applies the actual content changes -- SequenceFeature.scale_mean -> scale_composition, backend get_scale_mean_ -> get_scale_composition_, the docstring 'Application' note + include path, the re-executed example notebook, the renamed tests, and the release-notes entry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

codecov · 2026-07-03T20:02:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.97%. Comparing base (1a152de) to head (a008c3e).
⚠️ Report is 30 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #319      +/-   ##
==========================================
+ Coverage   94.93%   94.97%   +0.03%     
==========================================
  Files         185      187       +2     
  Lines       17883    17994     +111     
  Branches     3038     3051      +13     
==========================================
+ Hits        16978    17089     +111     
+ Misses        598      597       -1     
- Partials      307      308       +1

Files with missing lines	Coverage Δ
...ature_engineering/_backend/cpp/sequence_feature.py	`100.00% <100.00%> (ø)`
...aanalysis/feature_engineering/_sequence_feature.py	`98.32% <100.00%> (+0.05%)`	⬆️

... and 38 files with indirect coverage changes

Components	Coverage Δ
cpp_core	`94.95% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

breimanntools and others added 5 commits July 1, 2026 01:31

round2(scale_mean): lock single-char-LUT invariant (multi-char label …

06160a0

…+ non-latin1 residue) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

round4(scale_mean): drop redundant df_parts.astype(str) copy (backend…

69bca17

… trusts frontend) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/master' into feat/scale-mean-bas…

2316f0b

…eline # Conflicts: # docs/source/index/release_notes.rst

breimanntools commented Jul 2, 2026

View reviewed changes

Merge remote-tracking branch 'origin/master' into feat/scale-mean-bas…

ec37263

…eline

breimanntools marked this pull request as ready for review July 2, 2026 12:48

breimanntools commented Jul 3, 2026

View reviewed changes

breimanntools mentioned this pull request Jul 3, 2026

Compositional baseline featurizers (scale, AAC, ACC, …) + baseline comparison in AAPred #335

Open

breimanntools and others added 2 commits July 3, 2026 21:33

breimanntools changed the title ~~feat: SequenceFeature.scale_mean — scale-average baseline (prototype #307)~~ feat(sf): SequenceFeature.scale_composition — scale-based baseline featurizer (#307) Jul 3, 2026

breimanntools merged commit 69103b4 into master Jul 3, 2026
16 checks passed

breimanntools deleted the feat/scale-mean-baseline branch July 3, 2026 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sf): SequenceFeature.scale_composition — scale-based baseline featurizer (#307)#319

feat(sf): SequenceFeature.scale_composition — scale-based baseline featurizer (#307)#319
breimanntools merged 8 commits into
masterfrom
feat/scale-mean-baseline

breimanntools commented Jun 30, 2026 •

edited

Loading

Uh oh!

breimanntools left a comment

Uh oh!

breimanntools left a comment

Uh oh!

codecov Bot commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

breimanntools commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

breimanntools left a comment

Choose a reason for hiding this comment

Uh oh!

breimanntools left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jul 3, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

breimanntools commented Jun 30, 2026 •

edited

Loading