Part of #305.
Problem
The standard "scale-based vs CPP" comparison baseline — averaging each scale
over a whole sequence (no positional split) — has no library API, so the
γ-secretase notebook (cell 27) hand-rolls it with a raw residue comprehension:
def scale_X(df):
seqs = (df["jmd_n"] + df["tmd"] + df["jmd_c"]).to_list()
return np.array([df_scales_red.loc[[a for a in s if a in df_scales_red.index]].mean(axis=0).values
for s in seqs])
This is a legitimate, commonly-needed baseline featurization, but every user
reinvents it (and gets the missing-residue filtering subtly wrong).
Goal
Add a first-class scale-average featurizer that turns sequences + scales into a
(n_seq, n_scales) matrix, with the name matching its output noun.
Requirements
KPIs / Acceptance criteria
Scope / non-goals
- A no-positional-split mean only; positional splits remain CPP's job.
Dependencies
Standards checklist
Part of #305.
Problem
The standard "scale-based vs CPP" comparison baseline — averaging each scale
over a whole sequence (no positional split) — has no library API, so the
γ-secretase notebook (cell 27) hand-rolls it with a raw residue comprehension:
This is a legitimate, commonly-needed baseline featurization, but every user
reinvents it (and gets the missing-residue filtering subtly wrong).
Goal
Add a first-class scale-average featurizer that turns sequences + scales into a
(n_seq, n_scales)matrix, with the name matching its output noun.Requirements
SequenceFeature.scale_mean(df_seq, df_scales, list_parts=None)(or
NumericalFeatureif a better fit) returning the per-sequence scaleaverage matrix.
list_parts=None→ whole sequence.rest of the package; document the rule.
return_df=Truefor a labeled DataFrame (matches the package'sreturn_dfconvention).Examplesinclude.KPIs / Acceptance criteria
(within float tolerance).
non-canonical sequence edge case.
Scope / non-goals
Dependencies
Standards checklist