Aim
Write a tutorial (with pictures) that explains how nucleotide distance matrices are generated
Background
Since it is best to start simple and get more complex, we are starting by understanding how the difference (aka distance) between two nucleotide strings (.fasta files) is calculated. This can also be called the 'hamming distance' or sequence identity.
Sequence identity is calculated by going along two aligned nucleotide strings and asking "are these nucleotides matched or mismatched?" If they are a match, they score 1, if they are not matched, they score 0. The sum of scores is then divided by the number of nucleotides in the aligned sequence. The distance (or difference) is 1 minus the identity. Some packages take the square root of (1-identity).
The calculation above is performed between all possible sequence pairs in an alignment and the values are put into a table (often referred to as a matrix by R people). This table is called a distance matrix and it is used to generate lineage trees (aka hierarchical clusters or dendrograms).
We need to see what calculations are used by commonly used packages and decide which one we want to use.
Eventually we will figure out a way to calculate distances between .vcf files.
Aim
Write a tutorial (with pictures) that explains how nucleotide distance matrices are generated
Background
Since it is best to start simple and get more complex, we are starting by understanding how the difference (aka distance) between two nucleotide strings (.fasta files) is calculated. This can also be called the 'hamming distance' or sequence identity.
Sequence identity is calculated by going along two aligned nucleotide strings and asking "are these nucleotides matched or mismatched?" If they are a match, they score 1, if they are not matched, they score 0. The sum of scores is then divided by the number of nucleotides in the aligned sequence. The distance (or difference) is 1 minus the identity. Some packages take the square root of (1-identity).
The calculation above is performed between all possible sequence pairs in an alignment and the values are put into a table (often referred to as a matrix by R people). This table is called a distance matrix and it is used to generate lineage trees (aka hierarchical clusters or dendrograms).
We need to see what calculations are used by commonly used packages and decide which one we want to use.
Eventually we will figure out a way to calculate distances between .vcf files.