Skip to content

Write a tutorial that explains how nucleotide distance matrices are generated #12

@Deena-B

Description

@Deena-B

Aim

Write a tutorial (with pictures) that explains how nucleotide distance matrices are generated

Background

Since it is best to start simple and get more complex, we are starting by understanding how the difference (aka distance) between two nucleotide strings (.fasta files) is calculated. This can also be called the 'hamming distance' or sequence identity.

Sequence identity is calculated by going along two aligned nucleotide strings and asking "are these nucleotides matched or mismatched?" If they are a match, they score 1, if they are not matched, they score 0. The sum of scores is then divided by the number of nucleotides in the aligned sequence. The distance (or difference) is 1 minus the identity. Some packages take the square root of (1-identity).

The calculation above is performed between all possible sequence pairs in an alignment and the values are put into a table (often referred to as a matrix by R people). This table is called a distance matrix and it is used to generate lineage trees (aka hierarchical clusters or dendrograms).

We need to see what calculations are used by commonly used packages and decide which one we want to use.

Eventually we will figure out a way to calculate distances between .vcf files.

Metadata

Metadata

Assignees

Labels

4-12 hoursThis task will probably take about 4-12 hoursbasic bash or pythonThis problem can be solved with basic bash or pythongood first issueGood for newcomersquirks of the fieldYou will learn things that most people outside the field don't grasp

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions