Skip to content

feat(sf): SequenceFeature.scale_composition — scale-based baseline featurizer (#307)#319

Merged
breimanntools merged 8 commits into
masterfrom
feat/scale-mean-baseline
Jul 3, 2026
Merged

feat(sf): SequenceFeature.scale_composition — scale-based baseline featurizer (#307)#319
breimanntools merged 8 commits into
masterfrom
feat/scale-mean-baseline

Conversation

@breimanntools

@breimanntools breimanntools commented Jun 30, 2026

Copy link
Copy Markdown
Owner

Adds SequenceFeature.scale_composition — a no-positional-split baseline featurizer. For each sequence it averages every scale over the residues of a span, giving the (n_seq, n_scales) matrix X: the sequence's mean profile in scale-space (the scale-based analogue of amino-acid composition), with no positional information.

Application. Build a baseline feature set for a prediction model and compare the same classifier on this X versus a CPP feature_matrix, to show how much the positional Part-Split-Scale features add over a plain scale average (the "scale baseline vs CPP" comparison). It is not a positional feature set.

Notes:

  • Named scale_composition (not scale_mean) to fit the compositional-descriptor family — see the follow-up issue Compositional baseline featurizers (scale, AAC, ACC, …) + baseline comparison in AAPred #335 (AAC, ACC, and baseline comparison folded into AAPred) — and to avoid reading like df_scales (the letters × scales matrix).
  • Fully vectorized backend (byte→scale lookup + one bincount + a single BLAS matmul; no per-sequence Python loop). Correct missing/non-canonical-residue dropping; all-NaN rows for empty spans with a verbose warning. return_df=True for a labeled frame.
  • list_parts=None → whole jmd_n + tmd + jmd_c span.

Tests + api-meta green (195); example notebook re-executed with outputs.

🤖 Generated with Claude Code

breimanntools and others added 5 commits July 1, 2026 01:31
…#307)

Add a first-class no-positional-split scale-average featurizer:
SequenceFeature.scale_mean(df_seq, df_scales, list_parts=None, return_df=False)
returns a (n_seq, n_scales) matrix by averaging each scale over a sequence span.
list_parts=None uses the whole TMD-JMD span (jmd_n + tmd + jmd_c). Missing /
non-canonical residues (gaps, 'X', anything not in df_scales.index) are dropped
before averaging; an all-non-canonical span yields an all-NaN row (verbose warn).
Matches the gamma-secretase notebook cell 27 `scale_X` comprehension within float
tolerance.

Backend: get_scale_mean_ in _backend/cpp/sequence_feature.py. Adds numpydoc with
Examples include, an executed example notebook, 18 unit tests (positive+negative
per param, golden vs the manual comprehension, empty/all-non-canonical edge case),
and an Unreleased release-notes entry. No __init__ change (method on existing class).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atmul

Replace the per-sequence Python comprehension (DataFrame.loc + mean per row)
with a fully vectorized backend: flatten all residues into one byte array,
map to scale rows through a 256-entry lookup, tally a small (n_seq, n_letters)
residue-count matrix with a single np.bincount, and obtain per-sequence sums
via one BLAS matmul against the scale matrix. ~15x faster on DOM_GSEC-scale
inputs across scale widths; output matches the notebook comprehension within
float tolerance (max abs diff ~5e-16, golden test green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…+ non-latin1 residue)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… trusts frontend)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eline

# Conflicts:
#	docs/source/index/release_notes.rst

@breimanntools breimanntools left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? This is alreday done in CPP.run(split_n_max=1). However, this is indeed a simplificaiton of the process without filtering redudant features. So, it is worht integrateing. Okay from my side

@breimanntools breimanntools marked this pull request as ready for review July 2, 2026 12:48

@breimanntools breimanntools left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more a baseline prediction method, or? Should we simply make a baselin predicto and integrate this there? Then we can also add AAC, DPC, CKSAAP and other approaches to get simple baseline predictors? Perhaps in AAPred we have a baseline method. This can load the repreaentations from iFeature +

breimanntools and others added 2 commits July 3, 2026 21:33
…ine application

Rename SequenceFeature.scale_mean -> scale_composition (and backend get_scale_mean_ ->
get_scale_composition_) to fit the compositional-descriptor family (sibling to a future
aa_composition/AAC; see the baseline issue), and to avoid reading like df_scales (the
letters x scales matrix). The output is the sequence's mean profile in scale-space -- the
scale-based analogue of amino-acid composition.

Docstring + example notebook now lead with the clear **application**: build a baseline
feature set and compare the same model on it vs a CPP feature_matrix to show what the
positional Part-Split-Scale features add. Renames the example notebook + test file to
match; updates the docstring include and release notes. 195 sequence_feature + api-meta
tests green; notebook re-executed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…backend, docstring, notebook, tests, release notes)

Follow-up to the rename commit which only moved the files: this applies the actual
content changes -- SequenceFeature.scale_mean -> scale_composition, backend
get_scale_mean_ -> get_scale_composition_, the docstring 'Application' note + include
path, the re-executed example notebook, the renamed tests, and the release-notes entry.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@breimanntools breimanntools changed the title feat: SequenceFeature.scale_mean — scale-average baseline (prototype #307) feat(sf): SequenceFeature.scale_composition — scale-based baseline featurizer (#307) Jul 3, 2026
@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.97%. Comparing base (1a152de) to head (a008c3e).
⚠️ Report is 30 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #319      +/-   ##
==========================================
+ Coverage   94.93%   94.97%   +0.03%     
==========================================
  Files         185      187       +2     
  Lines       17883    17994     +111     
  Branches     3038     3051      +13     
==========================================
+ Hits        16978    17089     +111     
+ Misses        598      597       -1     
- Partials      307      308       +1     
Files with missing lines Coverage Δ
...ature_engineering/_backend/cpp/sequence_feature.py 100.00% <100.00%> (ø)
...aanalysis/feature_engineering/_sequence_feature.py 98.32% <100.00%> (+0.05%) ⬆️

... and 38 files with indirect coverage changes

Components Coverage Δ
cpp_core 94.95% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@breimanntools breimanntools merged commit 69103b4 into master Jul 3, 2026
16 checks passed
@breimanntools breimanntools deleted the feat/scale-mean-baseline branch July 3, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant