feat(num_feat): add NumericalFeature.feature_matrix (#337) by breimanntools · Pull Request #347 · breimanntools/aaanalysis

breimanntools · 2026-07-04T21:01:40Z

Closes #337.

Summary

Adds NumericalFeature.feature_matrix(features, dict_num_parts, df_parts, df_scales=..., n_jobs=1),
the numerical analog of SequenceFeature.feature_matrix: it turns CPP.run_num-selected features back
into a model matrix X, preserving the per-residue context that per-AA-averaged sequence features
discard. run_num returns df_feat (selected feature ids + stats), never X; this is the missing
materializer that turns those ids into the model matrix — the numerical-mode counterpart of the
sequence-mode SequenceFeature.feature_matrix step used throughout the aapred_* / seqopt_* examples.

Details

Values are reconstructed exactly the way CPP.run_num does — the SPLIT in each feature id is
re-applied to the part's 0-based residue axis (arange(L_part)), the SCALE selects the column, and
the selected residues are nanmean-averaged (round 5).
Per-part length L_part comes from df_parts via the same helper run_num uses internally
(_derive_dict_part_lens, non-gap character count), not inferred from the tensor's NaN padding. So
X is byte-identical to run_num's value reconstruction in every case — including when a
genuine residue is all-NaN across D (an unresolved structure position or masked embedding), which
a NaN-based length would have mis-counted as padding and shifted the split boundaries. _cpp.py is
untouched, so run_num itself is unaffected.
The df_feat positions column is a JMD-offset display numbering, not a tensor index, so it is
deliberately not used for value lookup (documented in the method Notes).
The frontend validates df_parts (row count, part-column coverage, real length ≤ padded tensor
length) before dispatch.
Heavy lifting in _backend/num_feat/feature_matrix.py. @staticmethod, no __init__.py change.

Verification

Byte-identical to run_num's engine (recompute_feature_matrix) for uniform, ragged/variable-length,
different-D, and the previously-divergent all-NaN-real-residue inputs.

Ripple

numpydoc docstring (named Returns / Raises / Examples include)
executed example notebook examples/nf_feature_matrix.ipynb (every public parameter, display_df tables)
unit tests (per-parameter positive+negative including the new df_parts arg, golden hand-computed
means, run_num consistency incl. all-NaN-real-residue parity, ragged parts)
release-notes Unreleased entry

Part of epic #336.

🤖 Generated with Claude Code

codecov · 2026-07-04T21:52:02Z

Codecov Report

❌ Patch coverage is 92.13483% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.84%. Comparing base (7dcc8d8) to head (38366b9).
⚠️ Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
...analysis/feature_engineering/_numerical_feature.py	91.30%	3 Missing and 1 partial ⚠️
...re_engineering/_backend/num_feat/feature_matrix.py	93.02%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #347      +/-   ##
==========================================
+ Coverage   94.83%   94.84%   +0.01%     
==========================================
  Files         196      197       +1     
  Lines       18767    18871     +104     
  Branches     3175     3198      +23     
==========================================
+ Hits        17797    17898     +101     
- Misses        633      636       +3     
  Partials      337      337

Files with missing lines	Coverage Δ
...re_engineering/_backend/num_feat/feature_matrix.py	`93.02% <93.02%> (ø)`
...analysis/feature_engineering/_numerical_feature.py	`96.52% <91.30%> (-3.48%)`	⬇️

... and 13 files with indirect coverage changes

Components	Coverage Δ
cpp_core	`94.95% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

breimanntools

Is this really necesary? dict_num_parts goes into cpp_run__num and we then get X and not dict_num_parst to X. Why did you add this here? Was there any use caes that required this

Add NumericalFeature.feature_matrix(features, dict_num_parts, df_parts, df_scales=..., n_jobs=1), the numerical analog of SequenceFeature.feature_matrix: it turns CPP.run_num-selected features back into a model matrix X while preserving the per-residue context that per-AA-averaged sequence features discard. Values are reconstructed exactly the way CPP.run_num does — the SPLIT in each feature id is re-applied to the part's 0-based residue axis (arange(L_part)), the SCALE selects the D column, and the selected residues are nanmean-averaged (round 5). Crucially, the per-part real length L_part comes from df_parts via the SAME helper run_num uses internally (_derive_dict_part_lens, non-gap character count) rather than being inferred from the tensor's NaN padding, so X is byte-identical to run_num's value reconstruction in every case — including when a genuine residue is all-NaN across D (an unresolved structure position or masked embedding), which NaN-inference would have mis-counted as padding and shifted the split boundaries. Verified against recompute_feature_matrix for uniform, ragged, and all-NaN-real-residue inputs. The df_feat 'positions' column is a JMD-offset display numbering (e.g. 21..30 for a TMD), NOT a tensor index, so it is deliberately not used for value lookup; this is documented in the method Notes. The frontend validates df_parts (row count, part coverage, real length <= padded tensor length) before dispatch. Heavy lifting lives in NumericalFeature's own _backend/num_feat/feature_matrix.py (reusing the shared cpp split/parse helpers). Ripple: numpydoc docstring with named Returns / Raises / Examples include; executed examples notebook nf_feature_matrix.ipynb (every public parameter, display_df tables); unit tests (per-parameter positive+negative, golden hand-computed means, run_num consistency incl. the all-NaN-real-residue case, ragged parts); release-notes Unreleased entry. No __init__.py change (method on an already-exported class). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

breimanntools force-pushed the feat/337-numericalfeature-feature-matrix branch from 840b3af to a2d28f9 Compare July 4, 2026 21:24

breimanntools commented Jul 5, 2026

View reviewed changes

breimanntools force-pushed the feat/337-numericalfeature-feature-matrix branch from a2d28f9 to 38366b9 Compare July 5, 2026 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(num_feat): add NumericalFeature.feature_matrix (#337)#347

feat(num_feat): add NumericalFeature.feature_matrix (#337)#347
breimanntools wants to merge 1 commit into
masterfrom
feat/337-numericalfeature-feature-matrix

breimanntools commented Jul 4, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 4, 2026 •

edited

Loading

Uh oh!

breimanntools left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

breimanntools commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Verification

Ripple

Uh oh!

codecov Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

breimanntools left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

breimanntools commented Jul 4, 2026 •

edited

Loading

codecov Bot commented Jul 4, 2026 •

edited

Loading