Part of #336 (usability epic).
Problem
Numeric CPP on embeddings/structure works for discovery but not for use:
EmbeddingPreprocessor/StructurePreprocessor + NumericalFeature.get_parts + CPP.run_num return a
df_feat, but there is no NumericalFeature.feature_matrix to turn selected features into a model
matrix. Worse, run_num's positions column uses a JMD-offset numbering (e.g. 21,22,…,30 for a
jmd=0 TMD) that doesn't map to the (L, D) array indices, so I couldn't reconstruct feature values
myself. I abandoned run_num and fell back to build_scales → SequenceFeature.feature_matrix
(per-AA-averaged), which discards the per-residue context that motivates using embeddings/structure.
Suggestion
- Add
NumericalFeature.feature_matrix(features, dict_num_parts, df_scales=…) mirroring
SequenceFeature.feature_matrix, or let CPP.run_num(..., return_X=True) return the matrix.
- Document how
df_feat.positions from run_num maps back to dict_num_parts indices.
Why it matters
Without this, the embedding/structure branches of CPP can find interpretable features but can't feed
them to a downstream classifier without bespoke glue.
Part of #336 (usability epic).
Problem
Numeric CPP on embeddings/structure works for discovery but not for use:
EmbeddingPreprocessor/StructurePreprocessor+NumericalFeature.get_parts+CPP.run_numreturn adf_feat, but there is noNumericalFeature.feature_matrixto turn selected features into a modelmatrix. Worse,
run_num'spositionscolumn uses a JMD-offset numbering (e.g.21,22,…,30for ajmd=0TMD) that doesn't map to the(L, D)array indices, so I couldn't reconstruct feature valuesmyself. I abandoned
run_numand fell back tobuild_scales → SequenceFeature.feature_matrix(per-AA-averaged), which discards the per-residue context that motivates using embeddings/structure.
Suggestion
NumericalFeature.feature_matrix(features, dict_num_parts, df_scales=…)mirroringSequenceFeature.feature_matrix, or letCPP.run_num(..., return_X=True)return the matrix.df_feat.positionsfromrun_nummaps back todict_num_partsindices.Why it matters
Without this, the embedding/structure branches of CPP can find interpretable features but can't feed
them to a downstream classifier without bespoke glue.