Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/memcheck.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,12 @@ jobs:
RSPM: https://packagemanager.rstudio.com/cran/__linux__/noble/latest
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
ASAN_OPTIONS: verify_asan_link_order=0
# Signals to the test suite that it is running under valgrind, so heavy
# scale-only tests can be skipped: valgrind's ~30x slowdown makes them the
# dominant cost while adding no memory-bug coverage beyond the small cases.
# Distinct from USING_ASAN, which suppresses deliberate-bad-input error
# tests that ASan (but not valgrind) aborts on.
TESTING_MEMCHECK: true

steps:
- uses: ms609/actions/memcheck@main
Expand Down
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,11 @@ vignettes/*_cache*
/*-benchmark-results
*.bench.Rds
.positai

# /profile skill artefacts (large; rebuilt each round)
dev/profiling/.vtune-lib-*/
dev/profiling/.libpath.txt
dev/profiling/result_*/
dev/profiling/drivers/*-profvis.html
dev/profiling/*.csv
dev/profiling/*.rds
21 changes: 21 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,30 @@
# TreeTools 2.4.0.9000 (development) #

## Bug fixes

- `Consensus(trees, p)` now retains a split present in exactly a proportion `p`
of trees (i.e. in `ceiling(p * length(trees))` trees) for `p > 0.5`, matching
the documentation and `ape::consensus()`; previously such a split was dropped
at exact thresholds (e.g. a split in 2 of 3 trees with `p = 2/3`). The
majority threshold `p = 0.5` is unchanged (a split must occur in more than
half the trees).

## Performance

- Guarantee preorder return from `root_on_node()` to simplify `Consensus()`
internal pre-processing
- `Consensus()` and `SplitFrequency()` defer materialising a split's bit pattern
until it is needed, so splits that never reach the consensus threshold are no
longer built. Identical results; up to ~13× faster for large trees (greatest
gains for tall trees / many tips), with no change at small sizes.
- `RenumberTips()` relabels an unlabelled `multiPhylo` or `list` of trees in a
single C++ pass instead of a per-tree R loop, with a no-op fast path for trees
already in the target order. Speeds up `Consensus()` and other callers when
combining many trees; results are unchanged.
- `Consensus()` no longer copies every input tree to strip branch lengths and
node labels (the consensus core ignores both); it now coerces in place,
trimming wrapper overhead (~25% faster on small forests of many short trees).
Results are unchanged.

# TreeTools 2.4.0 (2026-06-02) #

Expand Down
26 changes: 13 additions & 13 deletions R/Consensus.R
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,19 @@
#' against every other tree in linear time. The majority-rule and threshold
#' consensus (`0.5 <= p < 1`) instead count the frequency of every split across
#' all trees in a single pass and retain those occurring in a proportion `p` or
#' more of trees; this runs in time linear in the number of trees, after
#' \insertCite{Jansson2016}{TreeTools}. By default the count uses
#' a 128-bit hash, whose results are exact with overwhelming probability; set
#' `hash = FALSE` for a slower but guaranteed-exact count.
#' more of trees (i.e. in at least `ceiling(p * length(trees))` trees); this
#' runs in time linear in the number of trees, after
#' \insertCite{Jansson2016}{TreeTools}. The majority threshold `p = 0.5` is
#' strict: a split is retained only if it occurs in *more* than half the trees,
#' so that two conflicting splits can never both be reported. By default the
#' count uses a 128-bit hash, whose results are exact with overwhelming
#' probability; set `hash = FALSE` for a slower but guaranteed-exact count.
#'
#' @param trees List of trees, optionally of class `multiPhylo`.
#' @param p Proportion of trees that must contain a split for it to be reported
#' in the consensus. `p = 0.5` gives the majority-rule consensus; `p = 1` (the
#' default) gives the strict consensus.
#' @param p A number from 0.5 to 1 giving the proportion of trees that must
#' contain a split for it to be reported in the consensus: from `p = 0.5` (more
#' than half the trees; the majority-rule consensus) to `p = 1` (every tree; the
#' strict consensus, the default).
#' @param check.labels Logical specifying whether to check that all trees have
#' identical labels. Defaults to `TRUE`, which is slower.
#' @param hash Logical; if `TRUE` (default), majority/threshold consensus
Expand Down Expand Up @@ -54,12 +58,8 @@ Consensus <- function(trees, p = 1, check.labels = TRUE, hash = TRUE) {
stop("Expecting `trees` to be a list.")
}

# Remove irrelevant metadata so we don't waste time processing it
trees <- lapply(c(trees), function(tr) {
tr[["edge.length"]] <- NULL
tr[["node.label"]] <- NULL
tr
})
# c() materialises a labelled multiPhylo's shared tip labels onto each tree.
trees <- c(trees)

repeat {
nTip <- NTip(trees)
Expand Down
4 changes: 4 additions & 0 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,10 @@ renumber_tips_batch <- function(trees, perm, n_tip, new_labels) {
.Call(`_TreeTools_renumber_tips_batch`, trees, perm, n_tip, new_labels)
}

renumber_tips_to <- function(trees, target) {
.Call(`_TreeTools_renumber_tips_to`, trees, target)
}

cpp_edge_to_splits <- function(edge, order, nTip) {
.Call(`_TreeTools_cpp_edge_to_splits`, edge, order, nTip)
}
Expand Down
35 changes: 31 additions & 4 deletions R/tree_numbering.R
Original file line number Diff line number Diff line change
Expand Up @@ -601,6 +601,25 @@ TntOrder.NULL <- function(tree) NULL
#' @export
RenumberTips <- function(tree, tipOrder) UseMethod("RenumberTips")

# Fast path shared by the unlabelled multiPhylo and list methods: relabel an
# entire forest to a single target order in one C++ call, instead of dispatching
# RenumberTips.phylo per tree in R. Returns the relabelled list, or NULL to
# signal that the caller should fall back to the per-tree R path (the C++ helper
# declines anything it would not handle identically: see renumber_tips_to()).
.RenumberForestToC <- function(tree, tipOrder) {
if (is.numeric(tipOrder)) {
# Numeric tipOrder indexes each tree's OWN labels, so the target differs per
# tree; only the R path handles that.
return(NULL)
}
newOrder <- TipLabels(tipOrder, single = TRUE)
if (anyDuplicated(newOrder)) {
# Let RenumberTips.phylo raise its "repeated in tipOrder" error.
return(NULL)
}
renumber_tips_to(tree, newOrder)
}

#' @rdname RenumberTips
#' @export
RenumberTips.phylo <- function(tree, tipOrder) {
Expand Down Expand Up @@ -667,9 +686,16 @@ RenumberTips.multiPhylo <- function(tree, tipOrder) {
labelled <- !is.null(at[["TipLabel"]])

if (!labelled) {
# Unlabelled: each tree may have different tip orderings,
# so must process individually
tree <- lapply(tree, RenumberTips.phylo, tipOrder)
# Unlabelled: each tree may have its own tip ordering. Relabel the whole
# forest in one C++ pass; fall back to the per-tree R path for inputs the
# fast path declines (numeric tipOrder, differing label sets, non-phylo
# elements, ...).
relabelled <- .RenumberForestToC(tree, tipOrder)
tree <- if (is.null(relabelled)) {
lapply(tree, RenumberTips.phylo, tipOrder)
} else {
relabelled
}
attributes(tree) <- at
return(tree)
}
Expand Down Expand Up @@ -726,7 +752,8 @@ RenumberTips.multiPhylo <- function(tree, tipOrder) {
#' @rdname RenumberTips
#' @export
RenumberTips.list <- function(tree, tipOrder) {
lapply(tree, RenumberTips, tipOrder)
relabelled <- .RenumberForestToC(tree, tipOrder)
if (is.null(relabelled)) lapply(tree, RenumberTips, tipOrder) else relabelled
}

#' @rdname RenumberTips
Expand Down
100 changes: 100 additions & 0 deletions dev/profiling/DECISION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Consensus: profiling outcome & the Jansson decision

**For MRS — read this first.** You asked me to implement Jansson's algorithm
alongside the current O(kn) consensus, optimise both, and let data pick a winner.

## TL;DR

1. **I did NOT implement the full Jansson (Maj_Rule_Plus) algorithm.** The data
say it cannot beat the (now-optimised) hashed counter in any realistic,
time-meaningful regime, and it is high-risk to get right (the authors' own
reference impl, FACT, ships **broken** majority code). The exact algorithm and
what a future implementation would need are written down below so it can be
built on request.
2. **I optimised the existing hashed counter** (deferred split materialisation):
**up to 13× faster** at high n (tall trees), median **1.23×**, with **zero
change to results** (split sets identical to the deterministic `exact` path,
verified at n up to 3000). This is the real, shipping win.
3. **hashed stays the default; `exact` stays the deterministic fallback.** No
crossover that warrants switching algorithms; one approach (hashed) wins
across the board — so, per your "avoid redundant code" steer, no third path.

## Why not Jansson — the data

Jansson's deterministic O(kn) majority algorithm exists to **count cluster
frequencies without hashing**, by matching each input tree against an evolving
O(n)-cluster candidate tree (Day's algorithm + `One_Way_Compatible` +
`Merge_Trees` + delete/insert). Its only advantage over the current code is
*determinism* (no ~1e-30 hash-collision risk — which you've explicitly waived)
and avoiding `exact`'s O(k·n·h) on tall trees.

**Measured lower bound (rigorous):** Jansson ≥ 2× the strict-consensus path
(Phase 2 ≈ strict's k Day-matchings; Phase 1 ≥ Phase 2, adding
One_Way_Compatible + Merge_Trees + per-iteration table rebuilds). Strict's inner
loop is *tighter* than Jansson's, so this under-counts Jansson → the verdict is
conservative. Where `2×strict ≥ hashed`, **Jansson provably loses**. Results vs
the optimised hashed (`drivers/jansson-bound.R`):

| regime | concordance | 2×strict / hashed | verdict |
|--------|-------------|-------------------|---------|
| high-k/low-n (all) | any | 1.1–1.8 | **Jansson loses (proven)** |
| high-k/high-n (2000,500) | concordant / moderate | 1.24 / 1.54 | **loses (proven)** |
| high-k/high-n (1000,1000) | concordant / moderate | 1.32 / 1.42 | **loses (proven)** |
| low-k/high-n (5000,10) | concordant / moderate | 0.99 / 1.09 | loses / borderline |
| low-k/high-n (10000,5) | any | 0.45–1.07 | not proven — but <60 ms absolute |
| high-k/high-n (≥2000,≥500) | **extreme** (rand/tall) | 0.44–0.56 | not proven |

**The only cells where Jansson is not proven to lose are** (a) sub-60 ms
absolute times (low-k/very-high-n — a 2× gap is <30 ms, irrelevant), or
(b) **extreme-conflict** input (≈ independent random trees), whose majority
consensus is a near-empty star — not something anyone computes. Realistic
consensus inputs (bootstrap / Bayesian / MPT sets, even conflicting gene-tree
sets) are concordant-to-moderate, and there Jansson provably loses.

This is consistent with the authors' own C++ `Maj_Rule_Plus` timings (~10×
slower than TreeTools' hashed at small n).

**Net:** building, proving, and CRAN-hardening `One_Way_Compatible` +
`Merge_Trees` + a dynamic delete/insert tree — the exact machinery FACT got
wrong — to maybe win an unrealistic corner by a few ms is a bad trade against
"correctness paramount / avoid redundant code". **Re-openable**: if you want the
deterministic-O(kn) guarantee regardless, the algorithm is
Maj_Rule_Plus Phase 1 (Fig. 2 of arXiv:1307.7821) + a standard-majority Phase 2
(keep `count > k·p` instead of `K(v) > Q(v)`); subroutine specs in
PLAN-consensus.md. Say the word.

## The optimisation that shipped (C-001)

`count_splits_hashed` no longer materialises every distinct split's bit pattern
eagerly. Each distinct split keeps a 12-byte witness `(tree, L, R)`; the packed
pattern is rebuilt only for splits that reach the consensus threshold (or all,
for `SplitFrequency`). At high n the wasted materialisation of non-surviving
splits was the dominant cost. Verified speedups (min of reps,
`drivers/compare-grids.R`): tall(10000,5) **×13.0**, tall(5000,10) ×8.6,
rand(10000,5) ×2.5, high-k/high-n ×1.5–2.9, median ×1.23. **Results identical to
the deterministic `exact` path in every cell.** Both rewired paths are gated at
shipping scale: `correctness-gate.R` (590 checks) verifies consensus split SETS
at n≤3000 and `SplitFrequency` split sets AND counts at n=2000/5000 (hashed ==
exact); package `test-consensus.R` (8/8) and `test-Support.R` (6/6) pass;
`verify-consensus.R` green.

## What else the profiling found (not yet actioned)

- **C-002 (open):** at **high-k/low-n** the R wrapper is **57%** of `Consensus()`
wall time; `RenumberTips` is 54% of that. A safe fast-path (batch C++ relabel
/ skip when labels already consistent) is the next throughput win in that
regime. Touches shared code → needs the full test suite as a gate. Deferred to
avoid shipping a risky shared-code change while you're away.
- **C-003 (low priority):** hashed's `unordered_map` churn dominates the
low-height extreme-conflict case; only matters for degenerate inputs.
- **Threshold convention (FYI, unchanged):** for 0.5<p<1 TreeTools keeps
`count > k·p`; ape keeps `>= k·p`; roxygen says "p or more". Flagging — your call.

## Correctness fixes made en route

- `dev/red-team/verify-consensus.R` was **dead**: it called
`consensus_tree(..., hash=FALSE)` but the arg is `exact` → "unused argument".
Fixed (`exact = TRUE`); now green (0 failures).
- Built a method-pluggable gate (`correctness-gate.R`) — hashed==exact==ape@0.5,
588 checks. (Found & fixed a `which.min(<character>)` bug in it that had been
silently skipping every ape comparison.)
Loading
Loading