MGCP — MGC+ coding for DNA and binary channels with insertion, deletion, and substitution (IDS) errors
MGCP is a Python package implementing the Marker Guess & Checl Plus (MGC+) family of encoders and decoders for both binary and DNA sequences. It contains:
- Encoders/decoders for binary and DNA sequences (
mgcp.binary,mgcp.dna). - File-level codec that encodes a binary file into a collection of DNA sequences and decodes it back from noisy DNA reads (
mgcp.dna.codec). - Utility modules for simulation, error models, and plotting (
mgcp.utils). - Command-line interface (
mgcp/cli) that exposes the main workflows. - Demos under
demo/that show end-to-end examples (these require optional external tools).
This README documents installation, usage, CLI commands, demos and publishing guidance.
- Features
- Installation (Python + optional system deps)
- Quickstart (import & CLI examples)
- Detailed module overview
- Demo & external tools
- Citing this work
- License
- Encode/decode at the bit level and the DNA level using MGC+ codes.
- Plotting helpers to benchmark frame error rate (FER) vs code rate or channel error rate (both DNA and binary).
- A single CLI entrypoint (
mgcp) with subcommands for DNA, binary, and file codec flows. - Demo scripts showing how to run full pipelines, clustering and consensus building.
Prerequisites:
- Python 3.10 or newer.
- A working C/Python toolchain only if you need to build some optional native deps.
clone and install locally:
git clone https://github.com/ramy-khabbaz/mgcp.git
cd mgcp
pip install -e .The package declares all runtime dependencies in setup.cfg. Some demo scripts require external, non-Python tools (listed below). If you only want to use the core library/CLI you can install as above. To enable demo-only features (clustering, MSA), install the demo extras:
pip install -e '.[demo]'What the demo extras install:
kalign— Python wrapper for Kalign MSApycdhit— lightweight wrapper for CD-HITpsutil,tqdm
Note: kalign and pycdhit requires system packages or binaries (see Demo & external tools below).
import mgcp
print(mgcp.__version__)
# programmatic usage example (file codec)
from mgcp.dna.codec import encode as codec_encode
codec_encode(file_name='data.bin', max_length=120, inner_redundancy=4, outer_redundancy=200)MGCP exposes a single CLI entrypoint mgcp that groups subcommands.
Top-level help:
mgcp --help- DNA-level:
mgcp dna ... - Binary-level:
mgcp binary ...(encode/decode and plotting) - File-level codec:
mgcp codec ...(encode/decode files to/from DNA sequences)
Output a detailed list of the MGC+ encoding parameters for different modules:
mgcp binary encode --help
mgcp dna encode --help
mgcp codec encode --helpEncode a single binary message into a binary codeword, with the block size (symbol length) set to 4 bits, 4 guess parities added, and the marker period set to 2:
mgcp binary encode "0101010011110110" 4 4 2Recover the binary message from a corrupted sequence (the 6th and 7th bits of the codeword are deleted and the 20th is substituted):
mgcp binary decode "0101000011111011010111110111001010101100010010100101101100100011"Encode a single binary message into a single DNA sequence, with the block size (symbol length) set to 4 bits, 4 guess parities added, and the marker period set to 0 (no markers):
mgcp dna encode "0101010011110110" 4 4 0Recover the binary message from a corrupted DNA sequence (substitutions: 2nd (T->A) and 17th (C->T) bases, deletions: 10th and 11th bases, insertion: 'G' is inserted at the 4th position):
mgcp dna decode "TATGAGGTCGGTTTCTCTGATTGTGTT"Encode a binary file into a collection of DNA sequences (file-level DNA-MGC+ codec). Here, the file is encoded with a target oligo length of 120, the inner code has 4 guess parities and doesn't include markers, and the outer code adds 200 redundant sequences:
mgcp codec encode "data.bin" 120 4 200 --input-path ./ --no-markerRecover the binary file from noisy DNA reads (reads.txt) using 4 CPU cores for parallel decoding:
mgcp codec decode "reads.txt" --input-path ./ --processes 4Both mgcp dna and mgcp binary include plot subcommands to generate FER vs code rate or FER vs channel error rate. Example:
mgcp dna plot fer-vs-coderate 256 8 2 6,8,10,12,14 --pe 0.01 --num-iterations 1000mgcp.binary— binary-level encoding/decoding primitives and utilities.mgcp.dna— binary input to DNA sequence encoding, decoding, and helper pipelines.mgcp.dna.codec— high-level file codec (binary file -> DNA sequence and Noisy reads -> binary file).mgcp.utils— helper modules:tools.py(random file generation, error models),loader.py,binary_channel.py, and plotting utilities.mgcp.cli—main.pyregisters Typer application and subcommands implemented indna_cli.py,binary_cli.py,codec_cli.py.
For programmatic use, import the submodule you need and call the functions directly. Examples can be found in demo/.
The demo/ folder demonstrates the full pipeline: encode a file, simulate sequencing errors, cluster reads (CD-HIT), align clusters (Kalign), generate consensus sequences, and decode back.
- CD-HIT (cd-hit-est) — clustering. Install the executable and ensure
cd-hit-estis on PATH. The demo extras install only Python wrappers; the native binary must be installed separately. - Kalign — multiple sequence aligner. Install the Kalign binary (or Kalign3) and ensure it is on PATH. The Python wrapper may still require the Kalign executable.
References and links
- CD-HIT — Li, W. & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659. DOI: https://doi.org/10.1093/bioinformatics/btl158. Project: https://github.com/weizhongli/cdhit
- Kalign — Lassmann, T. & Sonnhammer, E. L. L. (2005). Kalign—an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6:298. DOI: https://doi.org/10.1186/1471-2105-6-298. Kalign homepage: https://msa.sbc.su.se/kalign/ · Kalign3: https://github.com/TimoLassmann/kalign3
Note: check each tool's README for platform-specific dependencies and recommended installation methods.
mgcp.dna.codec.encodeto generateencoded_file.txt(oligos list).mgcp.utils.tools.error_generatorto generate reads with IDS errors.- Run CD-HIT on the reads to cluster similar reads together (
cd-hit-estor thepycdhithelper). - For each cluster, run Kalign to apply multiple sequence alignment to the reads and produce a consensus sequence.
- Feed consensus sequences into
mgcp.dna.codec.decodeto recover the original file.
The demo scripts in demo/ show concrete invocations. To run demos, install the demo extras and ensure cd-hit-est and kalign are installed on your system.
If you use MGCP in your research, please cite the following paper.
BibTeX example:
@article{mgcp,
title = {{DNA-MGC+}: A versatile codec for reliable and resource-efficient data storage on synthetic {DNA}},
author = {Khabbaz, Ramy and Mateos, J{\'e}r{\'e}my and Antonini, Marc and {Kas Hanna}, Serge},
journal = {bioRxiv preprint,},
publisher = {Cold Spring Harbor Laboratory},
year = {2026},
doi = {10.64898/2026.03.11.711016},
}MIT — see LICENSE.