Skip to content

Rework tree parts of pipeline, using dated tree as source #131

@lentinj

Description

@lentinj

Replacement issue for #120, which is ultimately not the right approach.

Preliminary stages

Both stages stop worrying about only working with OTTs on the tree, and just work for all OTTs:

  1. ID lookup resolution stage (Build table of OTT -> mappings, no trees involved):
  • For each (filtered wikipedia dump) entry, get OTT ID, find associated OT ID tables (GBIF, NCBI, ...)
  • If there's no overlap between the WD ids and OT ids, then bin row. We should have at least one corresponding ID.
  • Merge sets of IDs. If WD has multiple IDs, take intersection with OT. If OT doesn't have corresponding entry, choose first (or whatever, point is we don't have multiple IDs beyond here)
  • Output table of all IDs
  1. Raw popularity resolution stage (OTT -> popularity):
  • For each OTT in (1), dig out relevant page form wikipedia dumps, do stuff, assign popularity

New tree pipeline

  1. Download trees that Jonthan needs using similar pipeline to current tree. tree_loading.load_metadata() will need to be fed same taxonomy file as we're using later in the pipeline. Consider having the OT & chronosynth as one download step in DVC, so we can just re-run both and hope they use the same tree.
  2. Start with Jonathan's tree, run his pipeline bar the interpolation stage. Hope that polytomy nodes are already labelled as either virtual-nodes-to-be-ignored-in-polytomy-view polytomies (the current ones) or independently-reattached-nodes-that-remain-visible-but-have-a-grey-branch nodes (the new ones where nodes have been re-arranged). different types as above. Stripping birds/turtles isn't a problem, because they get replaced anyway.
  3. Graft bespoke trees together
  4. Polytomy resolution using existing random resolution. Mark nodes as (virtual-nodes-to-be-split-in-polytomy-view) polytomies
  5. Resolve branch lengths to ages bottom-up. Remove (or not care about) branch lengths
  6. Top-down conflict resolution in bespoke tree, delete entries that conflict with higher ages
  7. (No chronosynth internal date-adding to bespoke tree. we either have NHX:age=x or branch length.)
  8. Graft already OT subtrees onto our trees. Already polytomy-resolved / date pins from chronosynth applied.
  9. Popularity calculations for entire tree (including any remaining subspecies from bespoke tree, Jonathan's will have them already removed) (stop caring about polytomy vs. popularity calculations, and just apply them post-resolution). Apply popularity based on OTT -> popularity map, percolate using existing rules (which preserves popularity from removed subspecies)
  10. Remove subspecies
  11. Top-down conflict resolution. If there's conflict with higher ages, remove ages until conflict goes away
  12. Remove unary nodes (they are likely uninteresting, and make a mess of the tree rendering)
  13. Re-interpoltate missing datesOnly store in DB only dates that are trusted, not interpolated.Interpolated only gets used for sliding_window, etc. (NB: biota is hard-coded as interpolated, even though it isn't).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions