In this early version of the project, BioPython's global alingmnent function (pairwise2) as well as a handmade Needleman-Wuntsch algorithm are used to find the best matches for each gene. This was later replaced by BLASTn, which dramatically increased the speed of alignment.
The Needleman-Wunsch algorithm is an algorithm that performs a maximum global alignment of two DNA sequences. It is commonly used in bioinformatics to align protein or nucleotide sequences. The Needleman-Wunsch algorithm is an example of dynamic programming, it guarantees to find the maximum score alignment. To determine the maximum score alignment, a two-dimensional array, or matrix, is used. There is one row for each character in sequence A, and one column for each character in sequence B. So, if we align sequences of size n and m, the execution time of the algorithm is O(nm), and the memory space used is O(nm) too.
The algorithm is executed in three steps:
-
Calculation of the similarity matrix, matrix S. For each cell of the matrix we assign a score Z if the nucleotide of sequence A is equal to that of sequence B and a score Y if there is a substitution
-
Then we calculate the optimal matrix, matrix M. For each cell of M(i,j), we take the maximum between 3 cases: M(i,j) = max (M(i-1, j-1)+S(i,j), M(i-1, j)+g, M(i, j-1) + g) with g being the weight assigned to the gap.
- We look at the matrix M from the last cell to the first. For each cell we go to the best score between the 3 before. So for the cell M(i, j), we look at M(i, j-1), M(i-1, j) and M(i-1, j-1) and we keep only the best score.
Finally this gives us an optimal alignment of the two sequences with gaps and shifts.
Library: Only three library are used: Numpy, Pandas and Biopython. From Biopython we only use SeqIO to parse the fasta files and pairwise2 to align two DNA sequences.
utils.py : All the function use to preprocess the file :
- parse_sequence : Parse the database in a dictionnary, Sequence information are the keys and DNA sequence are the values.
- parse_database : Parse the sequence in a list.
- replace_nucleotide : Verify if the sequence provide is completely correct, if not change the nucleotide by a '-'. This choice is a personal choice to make the computation easier and reduce the runtime. Could be change to improve the accuracy.
- nucleotide_uppercase : Put all the sequence in uppercase
algo.py : All the function about the algorithm
- global_align : Needleman-Wunsch algorithm, compare two sequence to find the alignment and the matching score. Could be a bit long. Work with score_align() to compute the alignment score
- bioalign : Needleman-Wunsch algorithm coding in the library biopython, compare the sequence with all the database to find the sequence with the best alignment and the best matching score. It's the fastest method.
- align : Compute the alignment and the score using the different function above
Identification_GR.py : main function to run the programm.
To use this project you have to clone the main repository and use the following command line in the terminal.
\python Identification_GR.py [OPTION]
OPTION = 3 possible choices of algorithm :
- [1] : Sequence identification, identifies a sequence within the database, the 10 best matching scores will return.
- [2] : Sequence alignment, returns the best alignment and the matching score (in percent). Uses global_algo, a Needleman-Wunsch Algorithm coded internally.
- [3] : Sequence alignment, returns the best alignment and the matching score (in percent). Uses a Needleman-Wunsch Algorithm with the librart Biopython.
\python Identification_GR.py [option = 1] [sequence_path] [gene_name] [--family_name]
- SEQUENCE_PATH = The path of your sequence (must be a fasta file, with extension ".fasta")
- GENE_NAME = Name of the gene, choices: "matk", "rbcl", "psbA-trnh", "its"
- --FAMILY_NAME = Optional, name of the Family taxon (if known)
\python Identification_GR.py [option = 2 or 3] [template_path] [consensus_path]
- TEMPLATE_PATH = The path of your template sequence (must be a fasta file)
- CONSESUS_PATH = The path of your consensus sequence (must be a fasta file)
This is an example result using a test file.
Result for matK with option 1:

