Scientists are looking for new ways of efficient data storage and a method of interest is the use of DNA as a storage medium. With the right encoding, one cubic centimeter of DNA can store 1016 bits data, which means that you can store all the world's data in one pound DNA.
But, how can we achieve this? To do so we take the contents of the file and we encode it using huffman coding. In this huffman coding we will use trits(0,1,2) instead of bits. So, the huffman encoding is created throught the routes that connect the root to each and every leaf node of our concievable ternary tree. If our file has an odd amount of unique characters we add an imaginary character that has a frequency of zero. After coding the text to its ternary coding, we are gonna encode the ternary coding using the DNA bases, A (adenine), C (cytocine), G (guanine), T (thymine). We shall encode using the following table. By doing the opposite we can achieve the decoding of our already encoded file.
| Previous base | Current trit | ||
|---|---|---|---|
| 0 | 1 | 2 | |
| A | C | G | T |
| C | G | T | A |
| G | T | A | C |
| T | A | C | G |
In order to run this file from the terminal we have to give the following order
python dna_store.py [-d] input output huffman
Depending on the system where the program is run, you may need to write python3.
We can observe that the program receives four parameters:
- If
dis not given the program will code theinputfile to the output file using huffman and will save the huffman map to the givenhuffmancsv. Ifdis given we will do the opposite and decoded theinputfile using the huffman.csv file we have given. - The parameter
inputrepresents either a coded input file or a normal text file depending on whether the parameter d is given. - The parameter
outputrepresents either the encoded output file or the decoded output file depending on whether the parameter d is given. huffmanrepresents a csv file that maps each character in its ternary code