Skip to content

feat: NumericalFeature.feature_matrix for numeric CPP (run_num) outputs #337

Description

@breimanntools

Part of #336 (usability epic).

Problem

Numeric CPP on embeddings/structure works for discovery but not for use:
EmbeddingPreprocessor/StructurePreprocessor + NumericalFeature.get_parts + CPP.run_num return a
df_feat, but there is no NumericalFeature.feature_matrix to turn selected features into a model
matrix. Worse, run_num's positions column uses a JMD-offset numbering (e.g. 21,22,…,30 for a
jmd=0 TMD) that doesn't map to the (L, D) array indices, so I couldn't reconstruct feature values
myself. I abandoned run_num and fell back to build_scales → SequenceFeature.feature_matrix
(per-AA-averaged), which discards the per-residue context that motivates using embeddings/structure.

Suggestion

  • Add NumericalFeature.feature_matrix(features, dict_num_parts, df_scales=…) mirroring
    SequenceFeature.feature_matrix, or let CPP.run_num(..., return_X=True) return the matrix.
  • Document how df_feat.positions from run_num maps back to dict_num_parts indices.

Why it matters

Without this, the embedding/structure branches of CPP can find interpretable features but can't feed
them to a downstream classifier without bespoke glue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions