Skip to content

Fix batch of low-risk correctness bugs: defensive DataFrame copies, config/plotting validation gaps, validator messages #342

Description

@breimanntools

Problem

A full read-only audit (July 2026) surfaced a cluster of independent, low-risk
correctness defects that share one property: none change published numeric
output
, so all can be fixed and regression-tested without a regolden. Left in
place they cause real harm:

  • Silent data corruption (High). load_dataset(non_canonical_aa="gap") and
    load_features() return the shared @lru_cache DataFrame without a defensive
    copy, so an in-place edit — or the gap substitution itself — poisons the cache
    for every later caller in the process. SequenceFeature.get_df_feat(accept_gaps=True)
    similarly rewrites the caller's df_parts in place.
  • Cryptic crashes on valid edge inputs. read_fasta raises
    KeyError('sequence') on a leading blank line; SeqMut.mutate(df_feat=…)
    raises a length-mismatch ValueError when two entries share a mutation label;
    CPP.eval raises KMeans(n_clusters=0) on a single-feature set.
  • Dead validation / wrong output. Several options/plotting guards never
    fire (or always fire) due to bool-vs-str, off-by-one, or reachability bugs, and
    several validators report the wrong parameter name.

Goal

Fix all items below with a regression test each; default output byte-identical
(regression-tested), no regolden, no new dependencies.

Requirements

Backend-trusts-frontend and the validation-block conventions apply throughout;
add an explicit category to every touched warnings.warn.

Defensive copies (shared-object mutation)

  • data_handling/_load_dataset.py — copy the cached frame before the
    non-canonical-AA gap substitution (mirror the Overview branch's .copy()).
  • data_handling/_load_features.py — return df_feat.copy().
  • feature_engineering/_backend/check_feature.py (check_match_df_parts_df_scales)
    — copy df_parts before the gap str.replace, so get_df_feat no longer
    mutates the caller.

Crash-on-edge-input

  • data_handling/_backend/parse_fasta.py — skip blank lines / raise a clear
    ValueError when a sequence line precedes the first header.
  • protein_design/_seqmut.py (mutate) — re-join scored results on
    (entry, mutation) (or original row order), not the entry-less mutation label.
  • feature_engineering/_backend/cpp/cpp_eval.py (get_best_n_clusters) —
    clamp max_n = max(1, min(100, n_features-1)) and short-circuit n_features<=1.

Numeric

  • seq_analysis_pro/_backend/comp_seq_sim.py — self-similarity diagonal → 100
    (matches the [0,100] scale of every off-diagonal cell), not 1.

config.options validation (CONFIRM-FIRST — see checklist)

  • config.py — validate the incoming candidate for verbose/random_state
    (not the current global); validate name_tmd; stop leaving
    LOKY_MAX_CPU_COUNT=1 set permanently once allow_multiprocessing is re-enabled.

Plotting / display dead-branch & label bugs

  • _utils/plotting.py (_check_marker_size) — move the length check into the
    list branch so a wrong-length list raises ValueError, not a bare IndexError.
  • show_html/_display_df.py (_check_show) — max_val = n-1 for the row/col selector.
  • plotting/_plot_settings.py — set axes.linewidth/tick widths in the
    weight_bold branch too (bold must not render thinner than normal).
  • _backend/cpp/cpp_plot_feature_map.pyshow_only_max = imp_bar_label_type != "long"
    (currently a bool-vs-str compare, always True).
  • _backend/cpp/cpp_plot_profile.pymax(0, …) y-padding (currently min(0, …), always 0).
  • plotting/_plot_legend.py — fix the name= / dict keys so validation errors
    report the real parameter.

Validator messages / warnings hygiene

  • _aaclust.py (check_match_X_n_clusters) — correct the reversed inequality in the message.
  • _backend/check_aaclust.py (check_metric) — drop "None or" from the message
    (or accept None; see the AAclust decision issue).
  • data_handling/_seq_preproc.pymin_window_size = slide_stop - slide_start
    (typo slide_stop - slide_stop makes the guard dead).
  • data_handling/_load_scales.pyname="unclassified_out" (currently unclassified_in).
  • seq_analysis_pro/_comp_seq_sim.py — close the paren in the name= f-string.
  • Add explicit categories to the bare warnings.warn in _load_dataset.py,
    _read_fasta.py.

KPIs / Acceptance criteria

  • Each fix carries ≥1 regression test (identity / mutation-leak test for the
    copy fixes; crash→clean-error tests for the edge inputs; guard-fires tests
    for the dead branches).
  • Default numeric output byte-identical to current (regression-tested); no
    regolden; full pytest tests green on the Linux + Windows floors.
  • No print(); bare ValueError/RuntimeError; no aaanalysis._utils.*
    imports outside utils.py.

Scope / non-goals

  • Output-changing defects are excluded and tracked separately (CPP redundancy
    filter, TreeModel per-round seeding, BH p-value monotonicity) — those need an
    ADR-0032 tier + regolden.
  • No performance changes — the performance program is closed (ADR-0033).

Standards checklist

  • Frontend/backend split honored; backend trusts frontend
  • CONFIRM-FIRST: config.py is touched — the options validation fixes
    require maintainer sign-off before edit.
  • tests (unit; identity/golden where apt); reproducibility contract respected
  • no print(); bare exceptions; no cross-package _utils imports
  • no public-symbol / __all__ change

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions