ms609 · ms609 · Mar 28, 2026 · Mar 28, 2026 · Mar 28, 2026 · Mar 28, 2026
diff --git a/.AGENTS/memory/architecture.md b/.AGENTS/memory/architecture.md
@@ -0,0 +1,198 @@
+# Architecture Reference
+
+Load this when: editing `src/ts_*.cpp`/`.h`, adding Rcpp exports, reading
+the R-level API, or reviewing design decisions.
+
+---
+
+## R-level API
+
+| Function | Engine | Purpose |
+|----------|--------|---------|
+| `MaximizeParsimony()` | C++ driven search | Primary search (EW, IW, profile, constraints) |
+| `Morphy()` | R-loop + MorphyLib | Legacy search (custom stopping, per-iteration callbacks) |
+| `MaximizeParsimony2()` | — | Deprecated alias for `MaximizeParsimony()` |
+| `Resample()` | C++ | Jackknife/bootstrap resampling |
+| `SuccessiveApproximations()` | C++ | Successive approximations weighting |
+| `TreeLength()` | C++ `ts_fitch_score` | Score one or more trees |
+| `FastCharacterLength()` | C++ `ts_char_steps` | Per-character step counts |
+| `AdditionTree()` | C++ `ts_wagner_tree` | Wagner tree construction |
+| `RandomTreeScore()` | C++ (phyDat) or MorphyLib (morphyPtr) | Score a random tree |
+| `TaxonInfluence()` | C++ via `MaximizeParsimony()` | Per-taxon search |
+| `SearchControl()` | — | Expert parameter constructor for `MaximizeParsimony()` |
+| `ParsSim()` | Pure R | Simulate datasets under parsimony (EW/IW/profile) |
+
+`MaximizeParsimony()` has a backward-compatibility shim: passing old
+Morphy-style parameters (`ratchIter`, `tbrIter`, etc.) triggers a deprecation
+warning and delegates to `Morphy()`. Scheduled for removal in 2028.
+
+---
+
+## C++ module map
+
+| Module | Header/Source | Purpose |
+|--------|--------------|---------|
+| Fitch scoring | `ts_fitch.h/.cpp` | Downpass, uppass, incremental, indirect |
+| NA scoring | `ts_fitch_na.h` | Three-pass inapplicable algorithm (Brazeau et al. 2019) |
+| NA incremental | `ts_fitch_na_incr.h` | Incremental NA-aware scoring for TBR/drift |
+| SIMD | `ts_simd.h` | SSE2/NEON portability layer for bit-parallel ops |
+| Data | `ts_data.h/.cpp` | `DataSet`, `CharBlock`, `build_dataset`, simplification |
+| Tree | `ts_tree.h/.cpp` | `TreeState`, topology manipulation, `PreallocUndo` |
+| Constraint | `ts_constraint.h/.cpp` | Topological constraint enforcement |
+| TBR | `ts_tbr.h/.cpp` | TBR search (with sector_mask for CSS) |
+| SPR/NNI | `ts_search.h/.cpp` | SPR and NNI search (standalone, not in driven pipeline) |
+| Ratchet | `ts_ratchet.h/.cpp` | Perturbation (zero/upweight/mixed, adaptive) |
+| Drift | `ts_drift.h/.cpp` | Accept suboptimal moves within AFD/RFD limits |
+| Wagner | `ts_wagner.h/.cpp` | Greedy addition tree (incremental scoring, NA-aware) |
+| Sectorial | `ts_sector.h/.cpp` | RSS (conflict-guided), XSS, CSS; from-above HTU |
+| Fuse | `ts_fuse.h/.cpp` | Tree fusing (in-place exchange) |
+| Pool | `ts_pool.h/.cpp` | Dedup, eviction, consensus hash, split frequency table |
+| Splits | `ts_splits.h/.cpp` | Bipartition computation, comparison, `hash_single_split()` |
+| Driven | `ts_driven.h/.cpp` | Multi-replicate orchestrator |
+| Resample | `ts_resample.h/.cpp` | Jackknife, bootstrap, successive approximations |
+| Parallel | `ts_parallel.h/.cpp` | `std::thread` inter-replicate parallelism |
+| RNG | `ts_rng.h/.cpp` | Thread-safe RNG (`thread_local` dispatch) |
+| Simplify | `ts_simplify.h/.cpp` | Character compression and uninformativeness checks |
+| Collapsed | `ts_collapsed.h/.cpp` | Zero-length edge detection for clip skipping |
+| NNI perturb | `ts_nni_perturb.h/.cpp` | Stochastic NNI-perturbation (IQ-TREE-style topology escape) |
+| HSJ scoring | `ts_hsj.h/.cpp` | Hopkins & St. John hierarchy scoring |
+| Sankoff | `ts_sankoff.h/.cpp` | Sankoff step-matrix scoring (x-transform) |
+| Rcpp bridge | `ts_rcpp.cpp` | All Rcpp-exported functions |
+
+---
+
+## Scoring modes
+
+`ScoringMode` enum in `ts_data.h`: `EW`, `IW`, `PROFILE`, `XFORM`.
+- **EW**: standard Fitch parsimony
+- **IW**: implied weights via `e/(k+e)` where `e = steps - min_steps`
+- **PROFILE**: lookup in `info_amounts` table (structurally identical to IW pipeline)
+- **XFORM**: Fitch(non-hierarchy) + Sankoff(recoded composite characters)
+
+Profile mode sets `ds.concavity = 1.0` (finite sentinel) so existing
+`isfinite()` checks activate the weighted pipeline without code duplication.
+
+---
+
+## Parallelism design
+
+- `std::thread` (not OpenMP) to avoid R memory allocator conflicts
+- Per-thread: `DataSet` copy, `ConstraintData` copy, `std::mt19937` RNG
+- Shared: `ThreadSafePool` (mutex-guarded), atomic stop flag
+- Main thread: pre-generates seeds from R's RNG, polls
+  `R_CheckUserInterrupt()` and timeout every 200ms
+- Worker threads make no R API calls — `ts_rng.h` provides `thread_local`
+  dispatch (null → R API for serial; set → thread-local for parallel)
+
+---
+
+## Scoring notes
+
+- `.h` file changes (`ts_fitch_na.h`, `ts_fitch_na_incr.h`) may require
+  `touch src/ts_fitch.cpp` before rebuild if the build system doesn't track
+  header dependencies.
+- Incremental scoring is a **screening heuristic** for candidate selection;
+  `full_rescore()` / `score_tree()` is always authoritative.
+- See `.positai/expertise/fitch-scoring.md` for detailed invariants:
+  uppass correctness proof, NA staleness analysis, `upweight_mask` audit.
+
+---
+
+## Constraint enforcement
+
+- `build_constraint()` reads R split matrix with **column-major** indexing:
+  `split_matrix[s + n_splits * t]`.
+- Wagner uses LCA-based constraint mapping (`wagner_map_constraint_nodes`)
+  since splits aren't fully present during incremental construction.
+- Wagner has a posthoc retry loop (up to 100 random addition orders) as a
+  safety net for edge cases.
+
+---
+
+## Exported Rcpp functions
+
+All registered in `ts_rcpp.cpp` and `TreeSearch-init.c`. Run
+`Rscript check_init.R` to verify consistency.
+
+| Function | Module | Purpose |
+|----------|--------|---------|
+| `ts_fitch_score` | ts_fitch | Score a tree |
+| `ts_char_steps` | ts_rcpp | Per-pattern step counts (with simplification offsets) |
+| `ts_na_debug_char` | ts_fitch_na | Per-node debug for a single pattern |
+| `ts_na_char_steps` | ts_fitch_na | Per-pattern step counts (raw, no offsets) |
+| `ts_debug_clip` | ts_fitch | Debug SPR clip/regraft |
+| `ts_test_indirect` | ts_fitch | Debug indirect length |
+| `ts_nni_search` | ts_search | NNI hill-climbing |
+| `ts_spr_search` | ts_search | SPR hill-climbing |
+| `ts_tbr_search` | ts_tbr | TBR with plateau exploration |
+| `ts_ratchet_search` | ts_ratchet | Ratchet perturbation |
+| `ts_drift_search` | ts_drift | Drift search |
+| `ts_wagner_tree` | ts_wagner | Wagner tree (specified addition order) |
+| `ts_random_wagner_tree` | ts_wagner | Wagner tree (random order) |
+| `ts_compute_splits` | ts_splits | Bipartition splits from edge matrix |
+| `ts_trees_equal` | ts_splits | Compare two trees |
+| `ts_pool_test` | ts_pool | Pool deduplication test |
+| `ts_tree_fuse` | ts_fuse | Fuse two trees |
+| `ts_sector_diag` | ts_sector | Sectorial search diagnostics |
+| `ts_rss_search` | ts_sector | Random Sectorial Search |
+| `ts_xss_search` | ts_sector | Exclusive Sectorial Search |
+| `ts_driven_search` | ts_driven | Full driven search |
+| `ts_resample_search` | ts_resample | One jackknife/bootstrap replicate |
+| `ts_successive_approx` | ts_resample | Successive approximations |
+| `ts_parallel_resample` | ts_parallel | Batch resample with parallelism |
+| `ts_bench_tbr_phases` | ts_rcpp | TBR phase timing diagnostic |
+| `ts_hsj_score` | ts_hsj | HSJ hierarchy scoring |
+
+---
+
+## Key design decisions
+
+1. **PreallocUndo** (`ts_tree.h`): Pre-allocated flat buffers for TBR/drift
+   undo stack. Uses `grow()` to dynamically expand when capacity exceeded
+   (NA uppass saves both internal nodes and tips). Initial capacity `3 * n_node`.
+
+2. **TBR symmetry breaking** (`ts_tbr.cpp`): FNV-1a hash deduplication of
+   `virtual_prelim` vectors to skip redundant rerooting evaluations.
+
+3. **Bounded indirect scoring**: All search modules use `_bounded` variants
+   that bail out when accumulated score exceeds best candidate.
+
+4. **Profile parsimony**: Reuses IW indirect pipeline unchanged; only delta
+   precomputation differs. `ds.concavity = 1.0` sentinel activates weighted
+   path. Max 2 informative states per character; inapplicable → ambiguous.
+
+5. **MPT enumeration**: Post-search TBR plateau walk from all pool seeds.
+   `tbr_search()` accepts optional `TreePool* collect_pool` parameter.
+
+6. **All-ambiguous phyDat guard**: `TreeLength()` and `MaximizeParsimony()`
+   check for `levels = NULL` / 0-column contrast matrix before calling C++.
+
+7. **From-above HTU for sectorial search** (`ts_sector.cpp`):
+   `compute_from_above_for_sector()` computes `from_above[sector_root]` —
+   the Fitch state-set the rest of the tree sends *down* to the sector
+   boundary, excluding the sector's own contribution. Used instead of
+   `final_[parent]` in `build_reduced_dataset()`. O(depth × total_words).
+
+8. **Split frequency table** (`ts_pool.h/.cpp`): `SplitFrequencyTable` maps
+   per-split FNV-1a hash → occurrence count across best-score pool trees.
+   Used by conflict-guided RSS to weight sector selection. The same FNV-1a
+   hash (`hash_single_split()` in `ts_splits.h`) is used by consensus
+   hashing and split frequency counting — must stay consistent.
+
+9. **Consensus-stability hash** (`ts_pool.cpp`): XOR of FNV-1a hashes of
+   splits present in ALL best-score trees. Updated after each replicate.
+   Hash collision false-matches are conservative (over-count stability).
+
+10. **Diversity-aware pool eviction** (`ts_pool.cpp`): When the pool is full
+    and a new tree ties the worst score, the entry most similar to the new
+    tree (most shared splits, counted via per-split FNV-1a hash set
+    membership) is evicted. Falls back to arbitrary worst entry when the
+    new tree is strictly better.
+
+11. **Cross-replicate consensus constraint tightening** (`ts_driven.cpp`):
+    When `consensus_constrain = true` and no user constraint is supplied,
+    after ≥5 replicates, unanimous pool splits are extracted and enforced
+    as topological constraints via `build_constraint_from_bitsets()`. The
+    TBR/SPR search then avoids breaking established consensus clades.
+    Constraints are cleared and rebuilt whenever the best score changes.
+    Sector/fuse operations do not enforce auto-constraints.
diff --git a/.AGENTS/memory/benchmarking.md b/.AGENTS/memory/benchmarking.md
@@ -0,0 +1,174 @@
+# Benchmarks and Profiling
+
+Load this when: running benchmarks, interpreting benchmark results,
+doing VTune profiling, or selecting datasets for strategy validation.
+
+See also: `search-algorithms.md` (NNI, biased Wagner, outer cycles results),
+`search_strategy.md` (presets, ratchet tuning).
+
+---
+
+## VTune driver scripts — dry-run first
+
+**Always test a VTune driver script with plain `Rscript` before launching
+VTune.** Software-sampling overhead can be 5–20×; if the bare script takes
+30s, VTune may need 10 min. Target < 5s bare run for a lite driver.
+
+MaddisonSlatkin is exponential in tip count — even n=20 with k=3 can take
+seconds per call. Use small n (≤15 for k=3, ≤12 for k=4, ≤9 for k=5)
+and few iterations for VTune drivers.
+
+---
+
+## MorphoBank external benchmark corpus
+
+The neotrans repo (`../neotrans/inst/matrices/`) contains ~800 MorphoBank
+NEXUS matrices. Complement to the 14 bundled datasets and 1 large-tree dataset.
+
+**Catalogue:** `dev/benchmarks/mbank_catalogue.csv` (659 usable matrices
+after ntax≥20 filter and dedup). Regenerate with
+`Rscript dev/benchmarks/build_mbank_catalogue.R`.
+
+**Train/validation split:** Matrices whose MorphoBank project number is
+divisible by 5 are **validation** (124 matrices, ~19%). All others are
+**training** (535 matrices). The 7 `syab*` files are always training.
+
+**Dedup:** Multi-file projects with ≥95% character identity on shared taxa
+(≥80% taxon overlap) are flagged `dedup_drop = TRUE`. 24 near-duplicates excluded.
+
+**IMPORTANT:** Validation results must **never** be used to guide strategy
+tuning. They confirm generalization only. This is a one-way door.
+
+**Fixed 25-matrix training sample:** `MBANK_FIXED_SAMPLE` in
+`bench_datasets.R` — 7 small, 7 medium, 7 large, 4 xlarge. Selected via
+max-min distance on standardized features. **Do not modify.** Used by
+`benchmark_mbank_sample()`. Fitch track only.
+
+**Fixed 20-matrix Brazeau-track sample:** `MBANK_BRAZEAU_SAMPLE` in
+`bench_datasets.R` — 5 small, 6 medium, 6 large, 3 xlarge. Restricted to
+training matrices with **pct_inapp ≥ 4%**. **Do not modify.**
+
+**Key functions** (in `dev/benchmarks/bench_datasets.R`):
+- `load_mbank_catalogue()` — loads metadata CSV (excludes dedup by default)
+- `load_mbank_sample(cat, n, seed, split)` — stratified random sample
+- `load_mbank_datasets(cat, keys)` — load specific matrices by key
+- `load_mbank_brazeau_sample(cat)` — fixed 20-matrix Brazeau sample
+- `has_meaningful_inapp(cat, threshold)` — filter to pct_inapp ≥ threshold
+
+**Benchmark runners** (in `dev/benchmarks/bench_framework.R`):
+- `benchmark_mbank_sample()` — fixed 25-matrix training sample (routine)
+- `benchmark_mbank_sweep(split)` — full training or validation sweep
+- `benchmark_mbank_validation()` — validation sweep with prominent warning
+
+**Benchmark tracks:**
+
+| Track | Scoring | Datasets | Purpose |
+|-------|---------|----------|---------|
+| **Fitch** | `fitch_mode()` | 14 bundled + `MBANK_FIXED_SAMPLE` | TNT comparison, core search quality |
+| **Brazeau** | Default (Brazeau 2019) | `MBANK_BRAZEAU_SAMPLE` + bundled | NA-algorithm-specific strategy tuning |
+
+TNT comparisons are Fitch track only.
+
+**TNT comparison suite** lives in `../TS-TNT-bench/`. Key files:
+- `dev/benchmarks/bench_tnt_compare.R` — runner (smoke/medium/full)
+- `dev/benchmarks/tnt_comparison.qmd` — Quarto report
+- Requires TNT 1.6 at `C:/Programs/Phylogeny/tnt/TNT-bin/tnt.exe`
+
+Benchmark scripts in `dev/benchmarks/`. Key files:
+- `bench_regression.R` — CI regression test (score quality + timing bounds)
+- `bench_framework.R` — Dataset × strategy × replicate grid
+- `strategies.md` — Strategy space documentation
+
+---
+
+## Benchmarking methodology notes
+
+**Metric:** When comparing strategies with different time costs (e.g.
+NNI→TBR vs TBR), use **time-adjusted expected best** (TAEB) — the expected
+minimum score from k = budget / time_per_rep independent replicates. Median
+per-replicate score is adequate only when comparing parameter changes on a
+fixed pipeline (same time-per-rep). Bootstrap estimation: sample k scores
+with replacement, take the min, repeat 5000×, take the mean.
+
+**Brazeau vs EW scoring confound (T-265, 2026-03-26):** TreeSearch uses the
+Brazeau et al. (2019) inapplicable algorithm by default, which penalizes
+inapplicable-to-applicable transitions. TNT treats `-` as `?` (standard EW
+Fitch). On 11 gap datasets, the apparent mean gap was +17.8 steps; the
+actual EW-vs-EW gap is only +2.2 steps (5 datasets at 0 gap). **All TNT
+comparisons MUST use `fitch_mode()` to convert inapplicable to missing**
+for apples-to-apples scoring. `fitch_mode()` is defined in
+`bench_intra_fuse.R` and `bench_t265_regression.R`.
+
+**`maxTime` confound (2026-03-23):** `maxTime` (legacy Morphy parameter)
+silently delegates to the R-loop `Morphy()` engine. Use `maxSeconds` for
+the C++ driven search, which is ~10× faster at 180 tips.
+
+**Early vs late search:** Early replicates are dominated by initial descent
+quality (Wagner → local optimum); late replicates test ratchet/drift escape.
+At ≤88 tips, 20s gives 10–40 replicates spanning both regimes. At 180 tips,
+20s doesn't complete one replicate.
+
+---
+
+## Phase distribution baselines
+
+**T-290b (2026-03-28, Brazeau-sample datasets, 30s, post-T-255 no-drift presets):**
+
+| Phase | Fitch/EW/default | Fitch/EW/thorough | Brazeau/EW/default | Brazeau/EW/thorough |
+|-------|:---:|:---:|:---:|:---:|
+| Ratchet | 76% | 65% | 74% | 63% |
+| TBR | 8% | 5% | 7% | 4% |
+| XSS | 6% | 7% | 5% | 6% |
+| RSS | 3% | 10% | 3% | 10% |
+| CSS | — | 7% | — | 7% |
+| Wagner | 4% | 3% | 9% | 7% |
+| Final TBR | 2% | 2% | 2% | 2% |
+
+*(Drift has been 0% in all presets since T-255.)*
+
+**Brazeau / Fitch per-phase cost ratios (T-290b, EW):**
+
+| Phase | default | thorough |
+|-------|:-------:|:--------:|
+| Wagner | **3.6×** | **3.9×** |
+| Ratchet | 1.3× | 1.3× |
+| RSS/CSS | 1.3× | 1.3× |
+| TBR | 0.9× | 0.9× |
+
+Wagner is the outlier. All other phases are within 0.9–1.4× of Fitch cost.
+
+**wagnerStarts under Brazeau (T-290b/c, 2026-03-28):**
+- *Multiple reps/budget*: wagnerStarts=1 and 3 equivalent; w3 marginally better.
+- *~1 rep/budget* (60s at 86t/3660c): wagnerStarts=3 better by +564 steps.
+- *0 reps/budget* (30s at 86t/3660c): wagnerStarts=1 **better** — Brazeau
+  Wagner is expensive (~4×), 3 starts consume budget.
+Current presets correct: thorough (w3, gets ≥1 rep at 65–119t) ✓; large (w1) ✓.
+
+Per-candidate indirect scoring is at memory-throughput limit (~23 ns at 75 tips).
+
+---
+
+## Ratchet tuning validation (2026-03-22)
+
+Full 14-dataset comparison, optimized vs original defaults (10s budget, 3 seeds).
+
+| Dataset | Tips | Original | Optimized | Delta |
+|---------|:---:|:---:|:---:|:---:|
+| Longrich2010 | 20 | 131 | 131 | 0 |
+| Vinther2008 | 23 | 79 | 79 | 0 |
+| Sansom2010 | 23 | 189 | 189 | 0 |
+| DeAssis2011 | 33 | 64 | 64 | 0 |
+| Aria2015 | 35 | 143 | 143 | 0 |
+| Wortley2006 | 37 | 494 | 491 | +3 |
+| Griswold1999 | 43 | 408 | 407 | +1 |
+| Schulze2007 | 52 | 165 | 164 | +1 |
+| Eklund2004 | 54 | 442 | 441 | +1 |
+| Agnarsson2004 | 62 | 778 | 778 | 0 |
+| Zanol2014 | 74 | 1338 | 1331 | +7 |
+| Zhu2013 | 75 | 649 | 650 | −1 |
+| Giles2015 | 78 | 720 | 716 | +4 |
+| Dikow2009 | 88 | 1614 | 1614 | 0 |
+
+Zhu2013 marginal regression at 10s resolves at 20s (median 649→644).
+At 20s with 5 seeds: Zhu2013 645/643, Giles2015 712/710, Dikow2009
+1611/1611 (all improvements).