Replacement issue for #120, which is ultimately not the right approach.
Preliminary stages
Both stages stop worrying about only working with OTTs on the tree, and just work for all OTTs:
- ID lookup resolution stage (Build table of OTT -> mappings, no trees involved):
- For each (filtered wikipedia dump) entry, get OTT ID, find associated OT ID tables (GBIF, NCBI, ...)
- If there's no overlap between the WD ids and OT ids, then bin row. We should have at least one corresponding ID.
- Merge sets of IDs. If WD has multiple IDs, take intersection with OT. If OT doesn't have corresponding entry, choose first (or whatever, point is we don't have multiple IDs beyond here)
- Output table of all IDs
- Raw popularity resolution stage (OTT -> popularity):
- For each OTT in (1), dig out relevant page form wikipedia dumps, do stuff, assign popularity
New tree pipeline
- Download trees that Jonthan needs using similar pipeline to current tree.
tree_loading.load_metadata() will need to be fed same taxonomy file as we're using later in the pipeline. Consider having the OT & chronosynth as one download step in DVC, so we can just re-run both and hope they use the same tree.
- Start with Jonathan's tree, run his pipeline bar the interpolation stage. Hope that polytomy nodes are already labelled as either virtual-nodes-to-be-ignored-in-polytomy-view polytomies (the current ones) or independently-reattached-nodes-that-remain-visible-but-have-a-grey-branch nodes (the new ones where nodes have been re-arranged). different types as above. Stripping birds/turtles isn't a problem, because they get replaced anyway.
- Graft bespoke trees together
- Polytomy resolution using existing random resolution. Mark nodes as (virtual-nodes-to-be-split-in-polytomy-view) polytomies
- Resolve branch lengths to ages bottom-up. Remove (or not care about) branch lengths
- Top-down conflict resolution in bespoke tree, delete entries that conflict with higher ages
- (No chronosynth internal date-adding to bespoke tree. we either have
NHX:age=x or branch length.)
- Graft already OT subtrees onto our trees. Already polytomy-resolved / date pins from chronosynth applied.
- Popularity calculations for entire tree (including any remaining subspecies from bespoke tree, Jonathan's will have them already removed) (stop caring about polytomy vs. popularity calculations, and just apply them post-resolution). Apply popularity based on OTT -> popularity map, percolate using existing rules (which preserves popularity from removed subspecies)
- Remove subspecies
- Top-down conflict resolution. If there's conflict with higher ages, remove ages until conflict goes away
- Remove unary nodes (they are likely uninteresting, and make a mess of the tree rendering)
- Re-interpoltate missing datesOnly store in DB only dates that are trusted, not interpolated.Interpolated only gets used for sliding_window, etc. (NB: biota is hard-coded as interpolated, even though it isn't).
Replacement issue for #120, which is ultimately not the right approach.
Preliminary stages
Both stages stop worrying about only working with OTTs on the tree, and just work for all OTTs:
New tree pipeline
tree_loading.load_metadata()will need to be fed same taxonomy file as we're using later in the pipeline. Consider having the OT & chronosynth as one download step in DVC, so we can just re-run both and hope they use the same tree.NHX:age=xor branch length.)