Skip to content

feat: SequenceFeature.scale_mean — scale-average baseline featurizer #307

Description

@breimanntools

Part of #305.

Problem

The standard "scale-based vs CPP" comparison baseline — averaging each scale
over a whole sequence (no positional split) — has no library API, so the
γ-secretase notebook (cell 27) hand-rolls it with a raw residue comprehension:

def scale_X(df):
    seqs = (df["jmd_n"] + df["tmd"] + df["jmd_c"]).to_list()
    return np.array([df_scales_red.loc[[a for a in s if a in df_scales_red.index]].mean(axis=0).values
                     for s in seqs])

This is a legitimate, commonly-needed baseline featurization, but every user
reinvents it (and gets the missing-residue filtering subtly wrong).

Goal

Add a first-class scale-average featurizer that turns sequences + scales into a
(n_seq, n_scales) matrix, with the name matching its output noun.

Requirements

  • Add SequenceFeature.scale_mean(df_seq, df_scales, list_parts=None)
    (or NumericalFeature if a better fit) returning the per-sequence scale
    average matrix. list_parts=None → whole sequence.
  • Handle non-canonical / missing-in-scale residues consistently with the
    rest of the package; document the rule.
  • Optional return_df=True for a labeled DataFrame (matches the package's
    return_df convention).
  • numpydoc + per-method Examples include.

KPIs / Acceptance criteria

  • Output equals the notebook's manual comprehension on the canonical fixture
    (within float tolerance).
  • ≥1 unit test per parameter (positive + negative); covers an empty/all-
    non-canonical sequence edge case.

Scope / non-goals

  • A no-positional-split mean only; positional splits remain CPP's job.

Dependencies

Standards checklist

  • frontend/backend · validation block · numpydoc · tests · no-print

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions