Skip to content

Comments

including full annotations at nodes and exporting to ncbi, json, newick formats#73

Open
rbaldwin-bugseq wants to merge 8 commits intoonecodex:masterfrom
rbaldwin-bugseq:add_data_to_taxonomy_nodes
Open

including full annotations at nodes and exporting to ncbi, json, newick formats#73
rbaldwin-bugseq wants to merge 8 commits intoonecodex:masterfrom
rbaldwin-bugseq:add_data_to_taxonomy_nodes

Conversation

@rbaldwin-bugseq
Copy link
Contributor

@rbaldwin-bugseq rbaldwin-bugseq commented Feb 20, 2026

This PR should address comments made in previous PR (closed) .

When importing taxonomy in ncbi (names.dmp, nodes.dmp) format all annotations, including alternate names, will be put inside nodes with convenient methods for accessing these attributes.

For output, exporting in ncbi, json, and newick formats, should be straight forward. A new json export function to_json_tree_streaming was added to replace the existing to_json_tree function as the latter OOMed by instance (15G) with the new nodes and should probably be replaced the streaming version. to_json_node_link does not appear to have a memory problem. newick format uses the taxonomy ID for node names (e.g, for E. coli the node name is 562) which was existing behavior.

More tests were also added.

Overview of Changes Made

  1. TaxonomyNode Python Class Enhancements
    Added new methods to the TaxonomyNode class in python.rs:

    get_data(): Returns all extra data fields as a Python dictionary
    get_data_keys(): Returns a list of all available data field keys
    get(key, default=None): Gets a data field with optional default value

    Property accessors for common NCBI fields:
        genetic_code_id
        embl_code
        division_id
        mitochondrial_genetic_code_id

Taxonomy Class Enhancements
Added new methods to the Taxonomy class:
    data(tax_id): Returns all extra data fields for a given taxid as a dictionary
    get_field(tax_id, field, default=None): Gets a specific field for a taxid with optional default

Updated Documentation
Updated the from_ncbi docstring to document that all NCBI fields are now loaded.

How to Access Data from Python
Method 1: Via TaxonomyNode object

from taxonomy import Taxonomy

tax = Taxonomy.from_ncbi("/path/to/ncbi")
node = tax["562"]  # E. coli

  • Access via properties
    print(node.genetic_code_id)
    print(node.embl_code)

  • Access all data
    data = node.get_data()
    print(data.keys())

  • Access specific field with default
    value = node.get("name_common_name", "N/A")

Method 2: Via dict-like interface

  • Already existed, now includes all new fields
    node = tax["562"]
    print(node["genetic_code_id"])
    print(node["name_common_name"])

Method 3: Via Taxonomy object directly

  • Get all data for a taxid
    data = tax.data("562")

  • Get specific field
    genetic_code = tax.get_field("562", "genetic_code_id")
    Available Data Fields

    From nodes.dmp:
        embl_code
        division_id
        inherited_div_flag
        genetic_code_id
        inherited_GC_flag
        mitochondrial_genetic_code_id
        inherited_MGC_flag
        GenBank_hidden_flag
        hidden_subtree_root_flag
        comments

    From names.dmp:
        name_scientific_name
        name_common_name
        name_synonym
        name_authority
        name_blast_name
        name_genbank_common_name
        unique_name_* (for any name type that has a unique name)

    And many others depending on what's in the names.dmp file
    All attributes are now fully accessible from Python through multiple convenient interfaces!

The best way to handle missing data fields is to use the .get() method instead of the dict-like bracket notation.
I already implemented this in the Python bindings. Here are the recommended approaches:

  1. Use .get() method with default value (Recommended)
  • Safe - returns None or your default if key doesn't exist
    common_name = node.get("name_common_name", "N/A")
    genetic_code = node.get("genetic_code_id", "1")

  • Or check if None
    common_name = node.get("name_common_name")
    if common_name is not None:
        print(f"Common name: {common_name}")

  1. Use try-except for bracket notation

try:
    common_name = node["name_common_name"]
except KeyError:
    common_name = "N/A"

  1. Check data keys first

data = node.get_data()
if "name_common_name" in data:
    common_name = data["name_common_name"]
else:
    common_name = "N/A"

  1. Use the Taxonomy object's get_field method
  • Also supports defaults
    common_name = tax.get_field("562", "name_common_name", "N/A")
    Best Practice Summary
    For optional fields (like alternative names):

✓ node.get("name_common_name", "N/A")  # Good
✗ node["name_common_name"]              # Bad - will error if missing

For required fields (like genetic_code_id):

✓ node["genetic_code_id"]               # OK - should always exist
✓ node.genetic_code_id                  # OK - convenience property
✓ node.get("genetic_code_id", "1")      # Also OK - extra safe with default
The .get() method follows Python's dict interface convention and is the idiomatic way to handle potentially missing keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant