Clarification on `get_good_genes.py` sorting behavior and tree length direction

Hi,

First of all, thank you for making **SortaDate** available. I have been using it to explore a phylogenomic dataset for divergence-time analyses, and it has been very helpful.

I am opening this issue mostly as a request for clarification, because I may be misunderstanding either the intended behavior of the scripts or the relationship between the implementation and the description in **Smith et al. (2018)**. While applying SortaDate to my own dataset and inspecting the source code, I noticed two points that I could not fully reconcile.

## 1. Effect of secondary sorting criteria

My understanding from the README is that the `--order` argument defines the priority of the sorting criteria. For example, `--order 3,1,2` means sorting first by bipartition, then by root-to-tip variance, and then by tree length.

Looking at `get_good_genes.py`, it seems that the script builds tuples like:

```python
(name, root_to_tip_var, treelength, -bipartition)
```

and then applies:

```python
sorted(tuples, key=operator.itemgetter(first, second, third))
```

If I am reading this correctly, the sorting is a standard lexicographic sort. In that case, the second and third criteria only affect the ranking when there are exact ties in the previous criterion. For example, if the first criterion is root-to-tip variance, then tree length would only affect the order of genes/trees with exactly identical root-to-tip variance values.

In my dataset, root-to-tip variance values are essentially unique when enough decimal precision is retained. As a result, sorting first by root-to-tip variance appears to make tree length almost irrelevant to the final ranking, even when differences in tree length are large.

Is this the intended behavior of `get_good_genes.py`? Or was the sorting procedure intended to implement a more joint/multicriteria ranking, where later criteria can still influence the ranking among genes with similar, but not exactly identical, values for the first criterion?

If the current behavior is intended, perhaps it would be useful to clarify in the documentation that `--order` implements a lexicographic priority order, rather than a weighted or joint ranking across criteria.

## 2. Direction of tree length sorting

The second point concerns the direction in which tree length is sorted.

In the paper, total tree length is described as representing **“discernible information content”**, and the empirical filtering procedure is described as selecting genes with **“discernible amounts of molecular evolution (i.e., greater tree length)”**. Based on this, I initially expected genes/trees with greater tree length to be preferred, after considering clock-likeness and/or bipartition concordance.

However, in `get_good_genes.py`, tree length appears to be included directly as:

```python
float(spls[2])
```

whereas bipartition is multiplied by `-1` so that higher bipartition concordance is sorted first. This seems to mean that, whenever tree length is used as a sorting criterion, smaller tree length values are preferred over larger ones.

I may be missing some intended rationale here. Is the ascending sorting of tree length intentional? For example, is the goal to avoid very long trees because they may reflect excessive rate heterogeneity, long-branch problems, or other issues? Or should tree length perhaps be sorted in descending order, consistent with the wording in the paper about **“greater tree length”**?

## Possible clarification

To summarize, my questions are:

1. Is `get_good_genes.py` intended to perform a strictly lexicographic sort, where secondary criteria only matter in the case of exact ties in previous criteria?
2. Is tree length intentionally sorted in ascending order?
3. If both behaviors are intentional, could the README perhaps be expanded to clarify the rationale?

Thank you again for developing and sharing this tool. I am asking because these details make a substantial difference in how I interpret the ranked gene/tree list from SortaDate, especially when trying to prioritize genes with low root-to-tip variance but still enough phylogenetic information for dating analyses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on `get_good_genes.py` sorting behavior and tree length direction #9

1. Effect of secondary sorting criteria

2. Direction of tree length sorting

Possible clarification

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Clarification on get_good_genes.py sorting behavior and tree length direction #9

Description

1. Effect of secondary sorting criteria

2. Direction of tree length sorting

Possible clarification

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Clarification on `get_good_genes.py` sorting behavior and tree length direction #9