Skip to content

NicolasSR/MioFFAn

Repository files navigation

MioFFAn: An annotation software for Formula Formalization with partial automation.

MioFFAn (Math identifier-oriented Formula Formalization Annotator) is a tool for the annotation of symbolic code representing specific mathematical expressions within STEM documents. The framework makes the process intuitive and fine-grained, annotating first themeaning and properties of the symbols within the formula, grounding them to the rest of the document's content and finally writing the symbolic code according to the annotated variables and a customizable grammar of operators. We refer to this complete task as Formula Formalization.

Customization capabilities in terms of the taxonomy of concepts to annotate, their properties and the operators to use make this software applicable to field-specific, real-world research documents.

The framework supports automation of certain tasks via an interface to local LLM server. Provisional automation approaches are available, although they are a work in progress.

MioFFAn builds off from the MioGatto annotation tool (github.com/wtsnjp/MioGatto)

System requirements

  • Python3 (3.9 or later)
  • A Web Browser with MathML support (for the GUI annotation system)

Installation

The dependencies will be all installed with one shot:

python -m pip install -r requirements.txt

In case you don't want to install the dependencies into your system, please consider using venv.

Usage

The client is developed with TypeScript. To compile it run:

cd client
npm install
npm run build

To obtain samples to work on (only ScienceDirect papers at the moment), find their PII identifier and write them within sourcing_info/sources_config.json. Additionally, the file credentials.distr.json needs to be copied as credentials.json within the same directory and the field for "key" needs to be changed by the user's ScienceDirect API key. Then, from the root directory, run:

python -m tools.source_samples

You may check complete options for this tool via

python -m tools.source_samples -h

Finally, to start the MioFFAn server, run:

python -m server

And access the client via web browser at http://localhost:4100/

OCR for PDF files

The sample preprocessing routine accepts papers in PDF format. It converts them to html format via OCR (https://huggingface.co/datalab-to/chandra-ocr-2), and may require some manual tweaking on the resulting HTML file. To use this functionality, a vLLM server needs to be hosted with the chandra model. Run:

vllm serve datalab-to/chandra-ocr-2 --served-model-name=chandra

Then, place the corresponding PDF file in the manual_sources/ directory and specify the sample info within key "manual_pdf" of the sources_config.json file. Finally, run the source samplic tool normaly.

Evaluation

To perform evaluation it is important that relevant annotation files should be checkpointed using the tags schema within EVALUATION_SCHEMA in ./tools/evaluate_llm.py. Once the relevant checkpoints for the tasks to evaluate have been created, the user may proceed by running:

python -m tools.evaluate_llm

By default it will evaluate all possible tasks for all possible samples. Otherwise the user may indicate specific a task and sample to evaluate (check help page for the tool with -h option). The output is a stringified JSON with the different results.

Acknowledgements

This project has been funded by Joan Oró 2024 (2024 FI-1 00089) scholarship from AGAUR (Catalonia, Spain).

License

Copyright 2026 Nicolas Sibuet Ruiz (NicolasSR) This software is licensed under the MIT license.

Third-party software

About

Annotation software for Formula Formalization with partial automation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors