including full annotations at nodes and exporting to ncbi, json, newick formats#73
Open
rbaldwin-bugseq wants to merge 8 commits intoonecodex:masterfrom
Open
including full annotations at nodes and exporting to ncbi, json, newick formats#73rbaldwin-bugseq wants to merge 8 commits intoonecodex:masterfrom
rbaldwin-bugseq wants to merge 8 commits intoonecodex:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR should address comments made in previous PR (closed) .
When importing taxonomy in ncbi (names.dmp, nodes.dmp) format all annotations, including alternate names, will be put inside nodes with convenient methods for accessing these attributes.
For output, exporting in ncbi, json, and newick formats, should be straight forward. A new json export function to_json_tree_streaming was added to replace the existing to_json_tree function as the latter OOMed by instance (15G) with the new nodes and should probably be replaced the streaming version. to_json_node_link does not appear to have a memory problem. newick format uses the taxonomy ID for node names (e.g, for E. coli the node name is 562) which was existing behavior.
More tests were also added.
Overview of Changes Made
Added new methods to the TaxonomyNode class in python.rs:
get_data(): Returns all extra data fields as a Python dictionary
get_data_keys(): Returns a list of all available data field keys
get(key, default=None): Gets a data field with optional default value
Property accessors for common NCBI fields:
genetic_code_id
embl_code
division_id
mitochondrial_genetic_code_id
Taxonomy Class Enhancements
Added new methods to the Taxonomy class:
data(tax_id): Returns all extra data fields for a given taxid as a dictionary
get_field(tax_id, field, default=None): Gets a specific field for a taxid with optional default
Updated Documentation
Updated the from_ncbi docstring to document that all NCBI fields are now loaded.
How to Access Data from Python
Method 1: Via TaxonomyNode object
from taxonomy import Taxonomy
tax = Taxonomy.from_ncbi("/path/to/ncbi")
node = tax["562"] # E. coli
Access via properties
print(node.genetic_code_id)
print(node.embl_code)
Access all data
data = node.get_data()
print(data.keys())
Access specific field with default
value = node.get("name_common_name", "N/A")
Method 2: Via dict-like interface
node = tax["562"]
print(node["genetic_code_id"])
print(node["name_common_name"])
Method 3: Via Taxonomy object directly
Get all data for a taxid
data = tax.data("562")
Get specific field
genetic_code = tax.get_field("562", "genetic_code_id")
Available Data Fields
From nodes.dmp:
embl_code
division_id
inherited_div_flag
genetic_code_id
inherited_GC_flag
mitochondrial_genetic_code_id
inherited_MGC_flag
GenBank_hidden_flag
hidden_subtree_root_flag
comments
From names.dmp:
name_scientific_name
name_common_name
name_synonym
name_authority
name_blast_name
name_genbank_common_name
unique_name_* (for any name type that has a unique name)
And many others depending on what's in the names.dmp file
All attributes are now fully accessible from Python through multiple convenient interfaces!
The best way to handle missing data fields is to use the .get() method instead of the dict-like bracket notation.
I already implemented this in the Python bindings. Here are the recommended approaches:
Safe - returns None or your default if key doesn't exist
common_name = node.get("name_common_name", "N/A")
genetic_code = node.get("genetic_code_id", "1")
Or check if None
common_name = node.get("name_common_name")
if common_name is not None:
print(f"Common name: {common_name}")
try:
common_name = node["name_common_name"]
except KeyError:
common_name = "N/A"
data = node.get_data()
if "name_common_name" in data:
common_name = data["name_common_name"]
else:
common_name = "N/A"
common_name = tax.get_field("562", "name_common_name", "N/A")
Best Practice Summary
For optional fields (like alternative names):
✓ node.get("name_common_name", "N/A") # Good
✗ node["name_common_name"] # Bad - will error if missing
For required fields (like genetic_code_id):
✓ node["genetic_code_id"] # OK - should always exist
✓ node.genetic_code_id # OK - convenience property
✓ node.get("genetic_code_id", "1") # Also OK - extra safe with default
The .get() method follows Python's dict interface convention and is the idiomatic way to handle potentially missing keys.