MioFFAn (Math identifier-oriented Formula Formalization Annotator) is a tool for the annotation of symbolic code representing specific mathematical expressions within STEM documents. The framework makes the process intuitive and fine-grained, annotating first themeaning and properties of the symbols within the formula, grounding them to the rest of the document's content and finally writing the symbolic code according to the annotated variables and a customizable grammar of operators. We refer to this complete task as Formula Formalization.
Customization capabilities in terms of the taxonomy of concepts to annotate, their properties and the operators to use make this software applicable to field-specific, real-world research documents.
The framework supports automation of certain tasks via an interface to local LLM server. Provisional automation approaches are available, although they are a work in progress.
MioFFAn builds off from the MioGatto annotation tool (github.com/wtsnjp/MioGatto)
- Python3 (3.9 or later)
- A Web Browser with MathML support (for the GUI annotation system)
- Firefox is recommended
The dependencies will be all installed with one shot:
python -m pip install -r requirements.txtIn case you don't want to install the dependencies into your system, please consider using venv.
The client is developed with TypeScript. To compile it run:
cd client
npm install
npm run buildTo obtain samples to work on (only ScienceDirect papers at the moment), find their PII identifier and write them within sourcing_info/sources_config.json. Additionally, the file credentials.distr.json needs to be copied as credentials.json within the same directory and the field for "key" needs to be changed by the user's ScienceDirect API key. Then, from the root directory, run:
python -m tools.source_samplesYou may check complete options for this tool via
python -m tools.source_samples -hFinally, to start the MioFFAn server, run:
python -m serverAnd access the client via web browser at http://localhost:4100/
The sample preprocessing routine accepts papers in PDF format. It converts them to html format via OCR (https://huggingface.co/datalab-to/chandra-ocr-2), and may require some manual tweaking on the resulting HTML file. To use this functionality, a vLLM server needs to be hosted with the chandra model. Run:
vllm serve datalab-to/chandra-ocr-2 --served-model-name=chandraThen, place the corresponding PDF file in the manual_sources/ directory and specify the sample info within key "manual_pdf" of the sources_config.json file. Finally, run the source samplic tool normaly.
To perform evaluation it is important that relevant annotation files should be checkpointed using the tags schema within EVALUATION_SCHEMA in ./tools/evaluate_llm.py. Once the relevant checkpoints for the tasks to evaluate have been created, the user may proceed by running:
python -m tools.evaluate_llmBy default it will evaluate all possible tasks for all possible samples. Otherwise the user may indicate specific a task and sample to evaluate (check help page for the tool with -h option). The output is a stringified JSON with the different results.
This project has been funded by Joan Oró 2024 (2024 FI-1 00089) scholarship from AGAUR (Catalonia, Spain).
Copyright 2026 Nicolas Sibuet Ruiz (NicolasSR) This software is licensed under the MIT license.
- MioGatto: Copyright 2021 Takuto Asakura (wtsnjp). Licensed under the MIT license.
- jQuery: Copyright JS Foundation and other contributors. Licensed under the MIT license.
- jQuery UI: Copyright jQuery Foundation and other contributors. Licensed under the MIT license.