Skip to content

feat(num_feat): add NumericalFeature.feature_matrix (#337)#347

Draft
breimanntools wants to merge 1 commit into
masterfrom
feat/337-numericalfeature-feature-matrix
Draft

feat(num_feat): add NumericalFeature.feature_matrix (#337)#347
breimanntools wants to merge 1 commit into
masterfrom
feat/337-numericalfeature-feature-matrix

Conversation

@breimanntools

@breimanntools breimanntools commented Jul 4, 2026

Copy link
Copy Markdown
Owner

Closes #337.

Summary

Adds NumericalFeature.feature_matrix(features, dict_num_parts, df_parts, df_scales=..., n_jobs=1),
the numerical analog of SequenceFeature.feature_matrix: it turns CPP.run_num-selected features back
into a model matrix X, preserving the per-residue context that per-AA-averaged sequence features
discard. run_num returns df_feat (selected feature ids + stats), never X; this is the missing
materializer that turns those ids into the model matrix — the numerical-mode counterpart of the
sequence-mode SequenceFeature.feature_matrix step used throughout the aapred_* / seqopt_* examples.

Details

  • Values are reconstructed exactly the way CPP.run_num does — the SPLIT in each feature id is
    re-applied to the part's 0-based residue axis (arange(L_part)), the SCALE selects the column, and
    the selected residues are nanmean-averaged (round 5).
  • Per-part length L_part comes from df_parts via the same helper run_num uses internally
    (_derive_dict_part_lens, non-gap character count), not inferred from the tensor's NaN padding. So
    X is byte-identical to run_num's value reconstruction in every case — including when a
    genuine residue is all-NaN across D (an unresolved structure position or masked embedding), which
    a NaN-based length would have mis-counted as padding and shifted the split boundaries. _cpp.py is
    untouched, so run_num itself is unaffected.
  • The df_feat positions column is a JMD-offset display numbering, not a tensor index, so it is
    deliberately not used for value lookup (documented in the method Notes).
  • The frontend validates df_parts (row count, part-column coverage, real length ≤ padded tensor
    length) before dispatch.
  • Heavy lifting in _backend/num_feat/feature_matrix.py. @staticmethod, no __init__.py change.

Verification

Byte-identical to run_num's engine (recompute_feature_matrix) for uniform, ragged/variable-length,
different-D, and the previously-divergent all-NaN-real-residue inputs.

Ripple

  • numpydoc docstring (named Returns / Raises / Examples include)
  • executed example notebook examples/nf_feature_matrix.ipynb (every public parameter, display_df tables)
  • unit tests (per-parameter positive+negative including the new df_parts arg, golden hand-computed
    means, run_num consistency incl. all-NaN-real-residue parity, ragged parts)
  • release-notes Unreleased entry

Part of epic #336.

🤖 Generated with Claude Code

@breimanntools breimanntools force-pushed the feat/337-numericalfeature-feature-matrix branch from 840b3af to a2d28f9 Compare July 4, 2026 21:24
@codecov

codecov Bot commented Jul 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.13483% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.84%. Comparing base (7dcc8d8) to head (38366b9).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
...analysis/feature_engineering/_numerical_feature.py 91.30% 3 Missing and 1 partial ⚠️
...re_engineering/_backend/num_feat/feature_matrix.py 93.02% 2 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #347      +/-   ##
==========================================
+ Coverage   94.83%   94.84%   +0.01%     
==========================================
  Files         196      197       +1     
  Lines       18767    18871     +104     
  Branches     3175     3198      +23     
==========================================
+ Hits        17797    17898     +101     
- Misses        633      636       +3     
  Partials      337      337              
Files with missing lines Coverage Δ
...re_engineering/_backend/num_feat/feature_matrix.py 93.02% <93.02%> (ø)
...analysis/feature_engineering/_numerical_feature.py 96.52% <91.30%> (-3.48%) ⬇️

... and 13 files with indirect coverage changes

Components Coverage Δ
cpp_core 94.95% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@breimanntools breimanntools left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really necesary? dict_num_parts goes into cpp_run__num and we then get X and not dict_num_parst to X. Why did you add this here? Was there any use caes that required this

Add NumericalFeature.feature_matrix(features, dict_num_parts, df_parts,
df_scales=..., n_jobs=1), the numerical analog of SequenceFeature.feature_matrix:
it turns CPP.run_num-selected features back into a model matrix X while preserving
the per-residue context that per-AA-averaged sequence features discard.

Values are reconstructed exactly the way CPP.run_num does — the SPLIT in each
feature id is re-applied to the part's 0-based residue axis (arange(L_part)), the
SCALE selects the D column, and the selected residues are nanmean-averaged (round 5).
Crucially, the per-part real length L_part comes from df_parts via the SAME helper
run_num uses internally (_derive_dict_part_lens, non-gap character count) rather than
being inferred from the tensor's NaN padding, so X is byte-identical to run_num's
value reconstruction in every case — including when a genuine residue is all-NaN
across D (an unresolved structure position or masked embedding), which NaN-inference
would have mis-counted as padding and shifted the split boundaries. Verified against
recompute_feature_matrix for uniform, ragged, and all-NaN-real-residue inputs.

The df_feat 'positions' column is a JMD-offset display numbering (e.g. 21..30 for a
TMD), NOT a tensor index, so it is deliberately not used for value lookup; this is
documented in the method Notes. The frontend validates df_parts (row count, part
coverage, real length <= padded tensor length) before dispatch.

Heavy lifting lives in NumericalFeature's own _backend/num_feat/feature_matrix.py
(reusing the shared cpp split/parse helpers). Ripple: numpydoc docstring with named
Returns / Raises / Examples include; executed examples notebook nf_feature_matrix.ipynb
(every public parameter, display_df tables); unit tests (per-parameter positive+negative,
golden hand-computed means, run_num consistency incl. the all-NaN-real-residue case,
ragged parts); release-notes Unreleased entry. No __init__.py change (method on an
already-exported class).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@breimanntools breimanntools force-pushed the feat/337-numericalfeature-feature-matrix branch from a2d28f9 to 38366b9 Compare July 5, 2026 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: NumericalFeature.feature_matrix for numeric CPP (run_num) outputs

1 participant