Skip to content

Frequently Asked Questions

Connor Morgan-Lang edited this page Aug 8, 2022 · 2 revisions

Why is TreeSAPP reporting an error where an argument is unrecognized?

This could be due to several reasons. The most common is a spelling mistake in the argument name, like --fasta_input instead of the correct --fastx_input.

If the argument names all look correct you should check the position of backslashes at the end of lines if you are inputting a multi-line command. There must be a space between a previous argument and a backslash for the command to be interpreted correctly. For example, the following would fail with TreeSAPP complaining ‘output_directory’ is not recognized:

treesapp assign \
-i sample.fasta\
-o output_directory

Do I need to run commands like treesapp assign and treesapp abundance for each reference package separately?

Generally no, all the reference packages should be processed together. This can be accomplished by copying all the reference packages you’re analyzing in a single directory and pointing the treesapp assign and any other commands to it with the --refpkg_dir option. Note that commands that have the argument --refpkg_path, such as treesapp update and treesapp colour, DO need to be ran separately for each reference package.

Why is TreeSAPP complaining that a file already exists?

This error crops up when the output directory already exists from a previous run. You should include the --overwrite flag in your TreeSAPP command and try again.

What do I need to know about the error “Clade exclusion analysis could not be performed for training the reference package models” from treesapp create?

This error can arise when the reference package being created is very small or if the reference sequences lack species-level taxonomic labels. The reference package .pkl file can normally be used but treesapp assign will be more prone to over-classification where the assigned taxonomic labels may be too resolved (closer to species) than they should be for their evolutionary distance.

Is it okay if treesapp purity fails because no alignments were found?

Yes, this is fine. The TIGRFAM seed reference database used is small and not comprehensive. Therefore, it is entirely expected that reference packages for many protein families will not be able to recruit alignments. Still, this result also indicates that the reference package doesn’t contain misclassified sequences related to orthologs in the database, which is comforting to know.

What do the abundance values in the iTOL figures represent? Are they pulled from the "Abundance" column in the classification table or derived from other values?

The rendered abundance values mapping to each leaf in the iTOL phylogeny come from the abundance column in the classification table. The abundances in the table are set to 1.0 for each classified protein if FASTQs weren't provided (indicating an observation), and are otherwise TPM or FPKM (depending on the options provided) if FASTQs were provided to treesapp assign or treesapp abundance.

The values displayed in the iTOL histogram are the cumulative values for each leaf in the tree. These are cumulative values because the abundance values of sequences that map to an internal edge in the tree (i.e. not a leaf) are evenly divided among descendant leaves.

How do I guarantee sequences are included in a reference package if I'm dereplicating?

For treesapp create, there is the --guarantee argument. Its input is a FASTA file containing the sequences you would like to ensure are included in the reference package. These can be already present in the general input FASTA file intended for --fastx_input but do not need to be.

Once sequences are included in a reference package, they will never be excluded from it upon updating with treesapp update -- the initial reference sequences are always retained from one version to the next. The only exception is when the --resolve flag is included, perhaps when updating a reference package with highly-trusted complete genomes that may have a more resolved lineage than the existing set of reference sequences.

How do I ensure only full-length sequences are added to a reference package when running treesapp update?

First, full-length is tricky to define as there are many ways one could arrive at this number. Lengths will vary (sometimes greatly) between orthologs, but a more reliable reference point is the length inferred by a profile Hidden Markov Model (HMM). With this in mind, the --profile argument of treesapp create will accept a profile HMM (this could be downloaded from EggNOG, for example) and a side-effect is all sequences shorter than 66% of the profile length will be excluded. A more precise alternative is simply using the -w/--min_seq_length argument and setting a hard length treshold.

How do I select specific lineages when creating a reference package?

-s/--screen and -f/--filter arguments. See usage for details.

How can I gather sequences to build a reference package?

This is a complicated task and we don't have a general answer; it truly depends on the target ortholog. That being said, EggNOG and UniProt are very comprehensive resources that can help with gathering an initial set.

Clone this wiki locally