Problem
A full read-only audit (July 2026) surfaced a cluster of independent, low-risk
correctness defects that share one property: none change published numeric
output, so all can be fixed and regression-tested without a regolden. Left in
place they cause real harm:
- Silent data corruption (High).
load_dataset(non_canonical_aa="gap") and
load_features() return the shared @lru_cache DataFrame without a defensive
copy, so an in-place edit — or the gap substitution itself — poisons the cache
for every later caller in the process. SequenceFeature.get_df_feat(accept_gaps=True)
similarly rewrites the caller's df_parts in place.
- Cryptic crashes on valid edge inputs.
read_fasta raises
KeyError('sequence') on a leading blank line; SeqMut.mutate(df_feat=…)
raises a length-mismatch ValueError when two entries share a mutation label;
CPP.eval raises KMeans(n_clusters=0) on a single-feature set.
- Dead validation / wrong output. Several
options/plotting guards never
fire (or always fire) due to bool-vs-str, off-by-one, or reachability bugs, and
several validators report the wrong parameter name.
Goal
Fix all items below with a regression test each; default output byte-identical
(regression-tested), no regolden, no new dependencies.
Requirements
Backend-trusts-frontend and the validation-block conventions apply throughout;
add an explicit category to every touched warnings.warn.
Defensive copies (shared-object mutation)
Crash-on-edge-input
Numeric
config.options validation (CONFIRM-FIRST — see checklist)
Plotting / display dead-branch & label bugs
Validator messages / warnings hygiene
KPIs / Acceptance criteria
Scope / non-goals
- Output-changing defects are excluded and tracked separately (CPP redundancy
filter, TreeModel per-round seeding, BH p-value monotonicity) — those need an
ADR-0032 tier + regolden.
- No performance changes — the performance program is closed (ADR-0033).
Standards checklist
Problem
A full read-only audit (July 2026) surfaced a cluster of independent, low-risk
correctness defects that share one property: none change published numeric
output, so all can be fixed and regression-tested without a regolden. Left in
place they cause real harm:
load_dataset(non_canonical_aa="gap")andload_features()return the shared@lru_cacheDataFrame without a defensivecopy, so an in-place edit — or the gap substitution itself — poisons the cache
for every later caller in the process.
SequenceFeature.get_df_feat(accept_gaps=True)similarly rewrites the caller's
df_partsin place.read_fastaraisesKeyError('sequence')on a leading blank line;SeqMut.mutate(df_feat=…)raises a length-mismatch
ValueErrorwhen two entries share a mutation label;CPP.evalraisesKMeans(n_clusters=0)on a single-feature set.options/plotting guards neverfire (or always fire) due to bool-vs-str, off-by-one, or reachability bugs, and
several validators report the wrong parameter name.
Goal
Fix all items below with a regression test each; default output byte-identical
(regression-tested), no regolden, no new dependencies.
Requirements
Backend-trusts-frontend and the validation-block conventions apply throughout;
add an explicit category to every touched
warnings.warn.Defensive copies (shared-object mutation)
data_handling/_load_dataset.py— copy the cached frame before thenon-canonical-AA gap substitution (mirror the Overview branch's
.copy()).data_handling/_load_features.py— returndf_feat.copy().feature_engineering/_backend/check_feature.py(check_match_df_parts_df_scales)— copy
df_partsbefore the gapstr.replace, soget_df_featno longermutates the caller.
Crash-on-edge-input
data_handling/_backend/parse_fasta.py— skip blank lines / raise a clearValueErrorwhen a sequence line precedes the first header.protein_design/_seqmut.py(mutate) — re-join scored results on(entry, mutation)(or original row order), not the entry-less mutation label.feature_engineering/_backend/cpp/cpp_eval.py(get_best_n_clusters) —clamp
max_n = max(1, min(100, n_features-1))and short-circuitn_features<=1.Numeric
seq_analysis_pro/_backend/comp_seq_sim.py— self-similarity diagonal →100(matches the
[0,100]scale of every off-diagonal cell), not1.config.options validation (CONFIRM-FIRST — see checklist)
config.py— validate the incoming candidate forverbose/random_state(not the current global); validate
name_tmd; stop leavingLOKY_MAX_CPU_COUNT=1set permanently onceallow_multiprocessingis re-enabled.Plotting / display dead-branch & label bugs
_utils/plotting.py(_check_marker_size) — move the length check into thelist branch so a wrong-length list raises
ValueError, not a bareIndexError.show_html/_display_df.py(_check_show) —max_val = n-1for the row/col selector.plotting/_plot_settings.py— setaxes.linewidth/tick widths in theweight_boldbranch too (bold must not render thinner than normal)._backend/cpp/cpp_plot_feature_map.py—show_only_max = imp_bar_label_type != "long"(currently a bool-vs-str compare, always True).
_backend/cpp/cpp_plot_profile.py—max(0, …)y-padding (currentlymin(0, …), always 0).plotting/_plot_legend.py— fix thename=/ dict keys so validation errorsreport the real parameter.
Validator messages / warnings hygiene
_aaclust.py(check_match_X_n_clusters) — correct the reversed inequality in the message._backend/check_aaclust.py(check_metric) — drop "None or" from the message(or accept None; see the AAclust decision issue).
data_handling/_seq_preproc.py—min_window_size = slide_stop - slide_start(typo
slide_stop - slide_stopmakes the guard dead).data_handling/_load_scales.py—name="unclassified_out"(currentlyunclassified_in).seq_analysis_pro/_comp_seq_sim.py— close the paren in thename=f-string.warnings.warnin_load_dataset.py,_read_fasta.py.KPIs / Acceptance criteria
copy fixes; crash→clean-error tests for the edge inputs; guard-fires tests
for the dead branches).
regolden; full
pytest testsgreen on the Linux + Windows floors.print(); bareValueError/RuntimeError; noaaanalysis._utils.*imports outside
utils.py.Scope / non-goals
filter, TreeModel per-round seeding, BH p-value monotonicity) — those need an
ADR-0032 tier + regolden.
Standards checklist
config.pyis touched — theoptionsvalidation fixesrequire maintainer sign-off before edit.
print(); bare exceptions; no cross-package_utilsimports__all__change