Skip to content

ramy-khabbaz/MGCP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MGCP — MGC+ coding for DNA and binary channels with insertion, deletion, and substitution (IDS) errors

MGCP is a Python package implementing the Marker Guess & Checl Plus (MGC+) family of encoders and decoders for both binary and DNA sequences. It contains:

  • Encoders/decoders for binary and DNA sequences (mgcp.binary, mgcp.dna).
  • File-level codec that encodes a binary file into a collection of DNA sequences and decodes it back from noisy DNA reads (mgcp.dna.codec).
  • Utility modules for simulation, error models, and plotting (mgcp.utils).
  • Command-line interface (mgcp/cli) that exposes the main workflows.
  • Demos under demo/ that show end-to-end examples (these require optional external tools).

This README documents installation, usage, CLI commands, demos and publishing guidance.

Table of contents

  • Features
  • Installation (Python + optional system deps)
  • Quickstart (import & CLI examples)
  • Detailed module overview
  • Demo & external tools
  • Citing this work
  • License

Features

  • Encode/decode at the bit level and the DNA level using MGC+ codes.
  • Plotting helpers to benchmark frame error rate (FER) vs code rate or channel error rate (both DNA and binary).
  • A single CLI entrypoint (mgcp) with subcommands for DNA, binary, and file codec flows.
  • Demo scripts showing how to run full pipelines, clustering and consensus building.

Installation

Prerequisites:

  • Python 3.10 or newer.
  • A working C/Python toolchain only if you need to build some optional native deps.

clone and install locally:

git clone https://github.com/ramy-khabbaz/mgcp.git
cd mgcp
pip install -e .

Optional demo/system dependencies

The package declares all runtime dependencies in setup.cfg. Some demo scripts require external, non-Python tools (listed below). If you only want to use the core library/CLI you can install as above. To enable demo-only features (clustering, MSA), install the demo extras:

pip install -e '.[demo]'

What the demo extras install:

  • kalign — Python wrapper for Kalign MSA
  • pycdhit — lightweight wrapper for CD-HIT
  • psutil, tqdm

Note: kalign and pycdhit requires system packages or binaries (see Demo & external tools below).

Quickstart

Importing the library

import mgcp
print(mgcp.__version__)

# programmatic usage example (file codec)
from mgcp.dna.codec import encode as codec_encode
codec_encode(file_name='data.bin', max_length=120, inner_redundancy=4, outer_redundancy=200)

Command-line interface

MGCP exposes a single CLI entrypoint mgcp that groups subcommands.

Top-level help:

mgcp --help

Subcommands

  • DNA-level: mgcp dna ...
  • Binary-level: mgcp binary ... (encode/decode and plotting)
  • File-level codec: mgcp codec ... (encode/decode files to/from DNA sequences)

Examples

Output a detailed list of the MGC+ encoding parameters for different modules:

mgcp binary encode --help
mgcp dna encode --help
mgcp codec encode --help

Encode a single binary message into a binary codeword, with the block size (symbol length) set to 4 bits, 4 guess parities added, and the marker period set to 2:

mgcp binary encode "0101010011110110" 4 4 2

Recover the binary message from a corrupted sequence (the 6th and 7th bits of the codeword are deleted and the 20th is substituted):

mgcp binary decode "0101000011111011010111110111001010101100010010100101101100100011"

Encode a single binary message into a single DNA sequence, with the block size (symbol length) set to 4 bits, 4 guess parities added, and the marker period set to 0 (no markers):

mgcp dna encode "0101010011110110" 4 4 0

Recover the binary message from a corrupted DNA sequence (substitutions: 2nd (T->A) and 17th (C->T) bases, deletions: 10th and 11th bases, insertion: 'G' is inserted at the 4th position):

mgcp dna decode "TATGAGGTCGGTTTCTCTGATTGTGTT"

Encode a binary file into a collection of DNA sequences (file-level DNA-MGC+ codec). Here, the file is encoded with a target oligo length of 120, the inner code has 4 guess parities and doesn't include markers, and the outer code adds 200 redundant sequences:

mgcp codec encode "data.bin" 120 4 200 --input-path ./ --no-marker

Recover the binary file from noisy DNA reads (reads.txt) using 4 CPU cores for parallel decoding:

mgcp codec decode "reads.txt" --input-path ./ --processes 4

Plotting

Both mgcp dna and mgcp binary include plot subcommands to generate FER vs code rate or FER vs channel error rate. Example:

mgcp dna plot fer-vs-coderate 256 8 2 6,8,10,12,14 --pe 0.01 --num-iterations 1000

Detailed module overview

  • mgcp.binary — binary-level encoding/decoding primitives and utilities.
  • mgcp.dna — binary input to DNA sequence encoding, decoding, and helper pipelines.
  • mgcp.dna.codec — high-level file codec (binary file -> DNA sequence and Noisy reads -> binary file).
  • mgcp.utils — helper modules: tools.py (random file generation, error models), loader.py, binary_channel.py, and plotting utilities.
  • mgcp.climain.py registers Typer application and subcommands implemented in dna_cli.py, binary_cli.py, codec_cli.py.

For programmatic use, import the submodule you need and call the functions directly. Examples can be found in demo/.

Demo & external tools

The demo/ folder demonstrates the full pipeline: encode a file, simulate sequencing errors, cluster reads (CD-HIT), align clusters (Kalign), generate consensus sequences, and decode back.

External tools used by demos

  • CD-HIT (cd-hit-est) — clustering. Install the executable and ensure cd-hit-est is on PATH. The demo extras install only Python wrappers; the native binary must be installed separately.
  • Kalign — multiple sequence aligner. Install the Kalign binary (or Kalign3) and ensure it is on PATH. The Python wrapper may still require the Kalign executable.

References and links

Note: check each tool's README for platform-specific dependencies and recommended installation methods.

Example demo outline

  1. mgcp.dna.codec.encode to generate encoded_file.txt (oligos list).
  2. mgcp.utils.tools.error_generator to generate reads with IDS errors.
  3. Run CD-HIT on the reads to cluster similar reads together (cd-hit-est or the pycdhit helper).
  4. For each cluster, run Kalign to apply multiple sequence alignment to the reads and produce a consensus sequence.
  5. Feed consensus sequences into mgcp.dna.codec.decode to recover the original file.

The demo scripts in demo/ show concrete invocations. To run demos, install the demo extras and ensure cd-hit-est and kalign are installed on your system.

Citing this work

If you use MGCP in your research, please cite the following paper.

BibTeX example:

@article{mgcp,
  title   = {{DNA-MGC+}: A versatile codec for reliable and resource-efficient data storage on synthetic {DNA}},
  author  = {Khabbaz, Ramy and Mateos, J{\'e}r{\'e}my and Antonini, Marc and {Kas Hanna}, Serge},
  journal = {bioRxiv preprint,},
  publisher = {Cold Spring Harbor Laboratory},
  year    = {2026},
  doi     = {10.64898/2026.03.11.711016},
}

License

MIT — see LICENSE.

About

Python implementation of MGC+ channel coding for binary and DNA data storage, including encoding, decoding, simulations, and a DNA file storage codec.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages