Skip to content

Clarification on get_good_genes.py sorting behavior and tree length direction #9

@ericopolo

Description

@ericopolo

Hi,

First of all, thank you for making SortaDate available. I have been using it to explore a phylogenomic dataset for divergence-time analyses, and it has been very helpful.

I am opening this issue mostly as a request for clarification, because I may be misunderstanding either the intended behavior of the scripts or the relationship between the implementation and the description in Smith et al. (2018). While applying SortaDate to my own dataset and inspecting the source code, I noticed two points that I could not fully reconcile.

1. Effect of secondary sorting criteria

My understanding from the README is that the --order argument defines the priority of the sorting criteria. For example, --order 3,1,2 means sorting first by bipartition, then by root-to-tip variance, and then by tree length.

Looking at get_good_genes.py, it seems that the script builds tuples like:

(name, root_to_tip_var, treelength, -bipartition)

and then applies:

sorted(tuples, key=operator.itemgetter(first, second, third))

If I am reading this correctly, the sorting is a standard lexicographic sort. In that case, the second and third criteria only affect the ranking when there are exact ties in the previous criterion. For example, if the first criterion is root-to-tip variance, then tree length would only affect the order of genes/trees with exactly identical root-to-tip variance values.

In my dataset, root-to-tip variance values are essentially unique when enough decimal precision is retained. As a result, sorting first by root-to-tip variance appears to make tree length almost irrelevant to the final ranking, even when differences in tree length are large.

Is this the intended behavior of get_good_genes.py? Or was the sorting procedure intended to implement a more joint/multicriteria ranking, where later criteria can still influence the ranking among genes with similar, but not exactly identical, values for the first criterion?

If the current behavior is intended, perhaps it would be useful to clarify in the documentation that --order implements a lexicographic priority order, rather than a weighted or joint ranking across criteria.

2. Direction of tree length sorting

The second point concerns the direction in which tree length is sorted.

In the paper, total tree length is described as representing “discernible information content”, and the empirical filtering procedure is described as selecting genes with “discernible amounts of molecular evolution (i.e., greater tree length)”. Based on this, I initially expected genes/trees with greater tree length to be preferred, after considering clock-likeness and/or bipartition concordance.

However, in get_good_genes.py, tree length appears to be included directly as:

float(spls[2])

whereas bipartition is multiplied by -1 so that higher bipartition concordance is sorted first. This seems to mean that, whenever tree length is used as a sorting criterion, smaller tree length values are preferred over larger ones.

I may be missing some intended rationale here. Is the ascending sorting of tree length intentional? For example, is the goal to avoid very long trees because they may reflect excessive rate heterogeneity, long-branch problems, or other issues? Or should tree length perhaps be sorted in descending order, consistent with the wording in the paper about “greater tree length”?

Possible clarification

To summarize, my questions are:

  1. Is get_good_genes.py intended to perform a strictly lexicographic sort, where secondary criteria only matter in the case of exact ties in previous criteria?
  2. Is tree length intentionally sorted in ascending order?
  3. If both behaviors are intentional, could the README perhaps be expanded to clarify the rationale?

Thank you again for developing and sharing this tool. I am asking because these details make a substantial difference in how I interpret the ranked gene/tree list from SortaDate, especially when trying to prioritize genes with low root-to-tip variance but still enough phylogenetic information for dating analyses.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions