MioFFAn: An annotation software for Formula Formalization with partial automation.

MioFFAn (Math identifier-oriented Formula Formalization Annotator) is a tool for the annotation of symbolic code representing specific mathematical expressions within STEM documents. The framework makes the process intuitive and fine-grained, annotating first themeaning and properties of the symbols within the formula, grounding them to the rest of the document's content and finally writing the symbolic code according to the annotated variables and a customizable grammar of operators. We refer to this complete task as Formula Formalization.

Customization capabilities in terms of the taxonomy of concepts to annotate, their properties and the operators to use make this software applicable to field-specific, real-world research documents.

The framework supports automation of certain tasks via an interface to local LLM server. Provisional automation approaches are available, although they are a work in progress.

MioFFAn builds off from the MioGatto annotation tool (github.com/wtsnjp/MioGatto)

System requirements

Python3 (3.9 or later)
A Web Browser with MathML support (for the GUI annotation system)
- Firefox is recommended

Installation

The dependencies will be all installed with one shot:

python -m pip install -r requirements.txt

In case you don't want to install the dependencies into your system, please consider using venv.

Usage

The client is developed with TypeScript. To compile it run:

cd client
npm install
npm run build

To obtain samples to work on (only ScienceDirect papers at the moment), find their PII identifier and write them within sourcing_info/sources_config.json. Additionally, the file credentials.distr.json needs to be copied as credentials.json within the same directory and the field for "key" needs to be changed by the user's ScienceDirect API key. Then, from the root directory, run:

python -m tools.source_samples

You may check complete options for this tool via

python -m tools.source_samples -h

Finally, to start the MioFFAn server, run:

python -m server

And access the client via web browser at http://localhost:4100/

OCR for PDF files

The sample preprocessing routine accepts papers in PDF format. It converts them to html format via OCR (https://huggingface.co/datalab-to/chandra-ocr-2), and may require some manual tweaking on the resulting HTML file. To use this functionality, a vLLM server needs to be hosted with the chandra model. Run:

vllm serve datalab-to/chandra-ocr-2 --served-model-name=chandra

Then, place the corresponding PDF file in the manual_sources/ directory and specify the sample info within key "manual_pdf" of the sources_config.json file. Finally, run the source samplic tool normaly.

Evaluation

To perform evaluation it is important that relevant annotation files should be checkpointed using the tags schema within EVALUATION_SCHEMA in ./tools/evaluate_llm.py. Once the relevant checkpoints for the tasks to evaluate have been created, the user may proceed by running:

python -m tools.evaluate_llm

By default it will evaluate all possible tasks for all possible samples. Otherwise the user may indicate specific a task and sample to evaluate (check help page for the tool with -h option). The output is a stringified JSON with the different results.

Acknowledgements

This project has been funded by Joan Oró 2024 (2024 FI-1 00089) scholarship from AGAUR (Catalonia, Spain).

License

Third-party software

MioGatto: Copyright 2021 Takuto Asakura (wtsnjp). Licensed under the MIT license.
jQuery: Copyright JS Foundation and other contributors. Licensed under the MIT license.
jQuery UI: Copyright jQuery Foundation and other contributors. Licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 308 Commits
client		client
examples		examples
lib		lib
llm_implementation		llm_implementation
server		server
sourcing_info		sourcing_info
static		static
tools		tools
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json		config.json
credentials.distr.json		credentials.distr.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MioFFAn: An annotation software for Formula Formalization with partial automation.

System requirements

Installation

Usage

OCR for PDF files

Evaluation

Acknowledgements

License

Third-party software

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MioFFAn: An annotation software for Formula Formalization with partial automation.

System requirements

Installation

Usage

OCR for PDF files

Evaluation

Acknowledgements

License

Third-party software

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages