including full annotations at nodes and exporting to ncbi, json, newick formats by rbaldwin-bugseq · Pull Request #73 · onecodex/taxonomy

rbaldwin-bugseq · 2026-02-20T23:42:37Z

This PR should address comments made in previous PR (closed) .

When importing taxonomy in ncbi (names.dmp, nodes.dmp) format all annotations, including alternate names, will be put inside nodes with convenient methods for accessing these attributes.

For output, exporting in ncbi, json, and newick formats, should be straight forward. A new json export function to_json_tree_streaming was added to replace the existing to_json_tree function as the latter OOMed by instance (15G) with the new nodes and should probably be replaced the streaming version. to_json_node_link does not appear to have a memory problem. newick format uses the taxonomy ID for node names (e.g, for E. coli the node name is 562) which was existing behavior.

More tests were also added.

Overview of Changes Made

TaxonomyNode Python Class Enhancements
Added new methods to the TaxonomyNode class in python.rs:

    get_data(): Returns all extra data fields as a Python dictionary
    get_data_keys(): Returns a list of all available data field keys
    get(key, default=None): Gets a data field with optional default value

    Property accessors for common NCBI fields:
        genetic_code_id
        embl_code
        division_id
        mitochondrial_genetic_code_id

Taxonomy Class Enhancements
Added new methods to the Taxonomy class:
data(tax_id): Returns all extra data fields for a given taxid as a dictionary
get_field(tax_id, field, default=None): Gets a specific field for a taxid with optional default

Updated Documentation
Updated the from_ncbi docstring to document that all NCBI fields are now loaded.

How to Access Data from Python
Method 1: Via TaxonomyNode object

from taxonomy import Taxonomy

tax = Taxonomy.from_ncbi("/path/to/ncbi")
node = tax["562"] # E. coli

Access via properties
print(node.genetic_code_id)
print(node.embl_code)
Access all data
data = node.get_data()
print(data.keys())
Access specific field with default
value = node.get("name_common_name", "N/A")

Method 2: Via dict-like interface

Already existed, now includes all new fields
node = tax["562"]
print(node["genetic_code_id"])
print(node["name_common_name"])

Method 3: Via Taxonomy object directly

Get all data for a taxid
data = tax.data("562")
Get specific field
genetic_code = tax.get_field("562", "genetic_code_id")
Available Data Fields

    From nodes.dmp:
        embl_code
        division_id
        inherited_div_flag
        genetic_code_id
        inherited_GC_flag
        mitochondrial_genetic_code_id
        inherited_MGC_flag
        GenBank_hidden_flag
        hidden_subtree_root_flag
        comments

    From names.dmp:
        name_scientific_name
        name_common_name
        name_synonym
        name_authority
        name_blast_name
        name_genbank_common_name
        unique_name_* (for any name type that has a unique name)

And many others depending on what's in the names.dmp file
All attributes are now fully accessible from Python through multiple convenient interfaces!

The best way to handle missing data fields is to use the .get() method instead of the dict-like bracket notation.
I already implemented this in the Python bindings. Here are the recommended approaches:

Use .get() method with default value (Recommended)

Safe - returns None or your default if key doesn't exist
common_name = node.get("name_common_name", "N/A")
genetic_code = node.get("genetic_code_id", "1")
Or check if None
common_name = node.get("name_common_name")
if common_name is not None:
print(f"Common name: {common_name}")

Use try-except for bracket notation

try:
common_name = node["name_common_name"]
except KeyError:
common_name = "N/A"

Check data keys first

data = node.get_data()
if "name_common_name" in data:
common_name = data["name_common_name"]
else:
common_name = "N/A"

Use the Taxonomy object's get_field method

Also supports defaults
common_name = tax.get_field("562", "name_common_name", "N/A")
Best Practice Summary
For optional fields (like alternative names):

✓ node.get("name_common_name", "N/A") # Good
✗ node["name_common_name"] # Bad - will error if missing

For required fields (like genetic_code_id):

✓ node["genetic_code_id"]               # OK - should always exist
✓ node.genetic_code_id                  # OK - convenience property
✓ node.get("genetic_code_id", "1")      # Also OK - extra safe with default
The .get() method follows Python's dict interface convention and is the idiomatic way to handle potentially missing keys.

rbaldwin-bugseq added 8 commits February 20, 2026 18:25

a

0d1746e

added helper

e1fe38f

a

00963e6

added json support

bd9f14e

comments

374dcf1

resolved newick format

b143758

a

03b308e

fixed json teest

4bd9740

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

including full annotations at nodes and exporting to ncbi, json, newick formats#73

including full annotations at nodes and exporting to ncbi, json, newick formats#73
rbaldwin-bugseq wants to merge 8 commits intoonecodex:masterfrom
rbaldwin-bugseq:add_data_to_taxonomy_nodes

rbaldwin-bugseq commented Feb 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

rbaldwin-bugseq commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rbaldwin-bugseq commented Feb 20, 2026 •

edited

Loading