This project started off as a clone of https://github.com/shaobohou/madios. The goal was to experiment using a LLM to refactor and modernize the codebase, improving readability, maintainability, and performance while preserving the original functionality. The regression tests are passing and I think the end result is a more modern, readable, and maintainable codebase that still retains the original functionality of the ADIOS algorithm as intended by the original authors -- at least according to the LLM :).
Madios is a modern C++ implementation of the ADIOS grammar induction algorithm.
include/madios/— Public headers for library and CLIsrc/— Implementation files (C++ sources)tests/— Test sources, test data, and scriptstest/— Test data for legacy/test compatibilityext/— External dependenciesbuild/— Build artifacts (ignored by git)utils/— Utility implementationsmaths/— Math-specific code (now underinclude/madios/maths/)
mkdir -p build
cd build
cmake ..
make -j$(nproc)
ctest --output-on-failureSee CONTRIBUTING.md for guidelines and rationale.
Purpose: This project implements the ADIOS (Automatic DIstillation Of Structure) algorithm for unsupervised grammar induction from text corpora. It reads a set of sentences and induces grammatical patterns, which can be used for natural language processing and parser development.
src/– All implementation files (C++ sources)include/madios/– All public headersinclude/madios/utils/– Utility headersinclude/madios/maths/– Math headersinclude/madios/maths/tnt/– Math library headerstests/– All test code and test data
mkdir -p build
cd build
cmake ..
makeThe resulting executable will be build/madios.
./build/madios <filename> <eta> <alpha> <context_size> <coverage> [-o <outputfile>] [--format <format>] [number_of_new_sequences]
<filename>: Path to the input corpus file (see below for format).<eta>: Threshold for detecting divergence in the RDS graph (e.g., 0.9).<alpha>: Significance test threshold (e.g., 0.01 or 0.1).<context_size>: Size of the context window for equivalence class search (e.g., 5).<coverage>: Threshold for bootstrapping equivalence classes (e.g., 0.65).-o <outputfile>: (Optional) Write all output to the specified file instead of stdout.--format <format>: Output format:json,pcfg, ortext(default:text).[number_of_new_sequences]: (Optional) Number of new sequences to generate from the grammar.
| Option/Arg | Description | Default |
|---|---|---|
<filename> |
Input corpus file (required) | — |
<eta> |
Divergence threshold (e.g., 0.9, required) | 0.9 |
<alpha> |
Significance threshold (e.g., 0.01, required) | 0.01 |
<context_size> |
Context window size (e.g., 5, required) | 5 |
<coverage> |
Coverage threshold (e.g., 0.65, required) | 0.65 |
[number_of_new_sequences] |
Number of new sequences to generate from the grammar | 0 |
-o, --output |
Output file (writes all output to file instead of stdout) | stdout |
--format <format> |
Output format: json, pcfg, or text (default: text) | text |
--verbose |
Enable verbose progress/info output | off |
--quiet |
Suppress all non-error output (overrides --verbose) | off |
- Output is printed to stdout or the file specified by
-o/--output, regardless of format. - Progress and info messages are printed only if
--verboseis set and--quietis not set. - Errors are always printed to stderr.
./build/madios test/test_madios.txt 0.9 0.01 5 0.65 --format json
./build/madios test/test_madios.txt 0.9 0.01 5 0.65 --format pcfg -o grammar.txt
./build/madios test/test_madios.txt 0.9 0.01 5 0.65 --verboseEach line is a sentence. ADIOS-style input uses * and # as start/end markers, but plain space-separated text is also accepted. Example:
* Cindy believes that Joe believes that to please is easy #
* Cindy believes that Cindy believes that to read is tough #
* Pam thinks that Jim believes that to please is tough #
...
When run with --format pcfg, the output will look like:
E27 -> believes [0.730769]
E27 -> thinks [0.269231]
P28 -> E27 that [1]
E29 -> read [0.5]
E29 -> please [0.5]
P30 -> P28 to E29 is [1]
E31 -> Cindy [0.4]
E31 -> Jim [0.1]
E31 -> Beth [0.1]
E31 -> Joe [0.4]
P32 -> P28 E31 P30 [1]
E33 -> cat [0.3]
E33 -> cow [0.1]
E33 -> horse [0.4]
E33 -> dog [0.2]
P34 -> the E33 [1]
P35 -> to E29 [1]
E36 -> easy [0.666667]
E36 -> eager [0.333333]
P37 -> is E36 P35 [1]
S -> Beth P28 George P30 easy [0.05]
S -> Beth P32 tough [0.15]
S -> Cindy P28 George P30 easy [0.05]
S -> Cindy P32 easy [0.15]
S -> Cindy P32 tough [0.05]
S -> Pam P28 Pam P30 easy [0.05]
S -> Pam P32 easy [0.1]
S -> Pam P32 tough [0.05]
S -> that Cindy P37 annoys Cindy [0.05]
S -> that Cindy P37 bothers P34 [0.05]
S -> that P34 P37 annoys Jim [0.05]
S -> that P34 P37 annoys P34 [0.05]
S -> that P34 P37 disturbs P34 [0.05]
S -> that P34 P37 worries P34 [0.05]
S -> that P34 is tough P35 disturbs P34 [0.05]
Probabilities are normalized for each LHS (left-hand side).
- Default output: Human-readable, includes corpus, lexicon, search paths, and grammar.
- -o : All output is written to the specified file instead of stdout.
- --format json: Machine-readable JSON with all results.
- --format pcfg: Only the PCFG rules, suitable for downstream parsing tools.
This project uses Catch2 for unit testing. To set up and run tests:
- Download the Catch2 single-header file (if not already present):
curl -L -o tests/catch.hpp https://github.com/catchorg/Catch2/releases/download/v2.13.10/catch.hpp- Build and run tests:
./build_and_install.sh
./run_all_tests.shOr, to run the main C++ test binary directly:
./build/tests_basicOr, to run all tests with CTest:
cd build
ctest --output-on-failureA shell script test is provided to check the CLI --pcfg output:
cd tests
./test_cli_pcfg_output.shThis checks that the CLI produces valid PCFG output and that probabilities are normalized.
See License.txt for details. This project is provided under a permissive license (original and MIT). Use at your own risk. The code has not been extensively tested and may contain bugs or undiscovered regressions. I are not responsible for any issues that may arise from using this code.