Write a tutorial that explains how nucleotide distance matrices are generated


## Aim

Write a tutorial (with pictures) that explains how nucleotide distance matrices are generated

## Background 

Since it is best to start simple and get more complex, we are starting by understanding how the difference (aka distance) between two nucleotide strings (.fasta files) is calculated. This can also be called the 'hamming distance' or sequence identity. 

Sequence identity is calculated by going along two aligned nucleotide strings and asking "are these nucleotides matched or mismatched?" If they are a match, they score 1, if they are not matched, they score 0. The sum of scores is then divided by the number of nucleotides in the aligned sequence. The distance (or difference) is 1 minus the identity. Some packages take the square root of (1-identity). 

The calculation above is performed between all possible sequence pairs in an alignment and the values are put into a table (often referred to as a matrix by R people). This table is called a distance matrix and it is used to generate lineage trees (aka hierarchical clusters or dendrograms).

We need to see what calculations are used by commonly used packages and decide which one we want to use. 

Eventually we will figure out a way to calculate distances between .vcf files. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a tutorial that explains how nucleotide distance matrices are generated #12

Aim

Background

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Write a tutorial that explains how nucleotide distance matrices are generated #12

Description

Aim

Background

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions