Suppose we have the acronym CSS. It can mean "Cascading Style Sheets" or "Chirp Spread Spectrum" depending on the context. This work trains an ML model to disambiguate acronyms based on the context.
This work is derived from the work of Varma and Gardner (2017) with some updates done by Reddy (2020). In particular, this work is concerned with domains relevant to Devopedia (Computer Science, Electronics, Telecommunications).
If running in Google Colab, installation is part of the Jupyter Notebook Acronyms.ipynb.
If not Google Colab, run pip install -r requirements.txt to set up Python dependencies.
- Data Collection:
- Wikipedia is used as the data source. We start with a seed in
data/seed.json. - Articles in the seed have been pre-selected from a Wikipedia page on Computing & IT abbreviations.
- From the seed, obtain Wikipedia page titles and URLs:
python get_urls.py. Output is saved indata/data.csv. - For all URLs in
data/data.csv, download and save content:python download.py. Downloaded files are saved indata/trainanddata/testfolders.
- Wikipedia is used as the data source. We start with a seed in
- Data Pre-processing:
- Extract acronym definitions and context by treating this as a Constraint Satisfaction Problem (CSP). In folder
csp, callpython main.py. - Extracted data is saved in
data/definitions.csv. - TODO: This extraction is not working very well at the moment. File
data/definitions.csvhas been manually edited. - DB is updated with
python add2db.py. This uses the filedata/definitions.csv.
- Extract acronym definitions and context by treating this as a Constraint Satisfaction Problem (CSP). In folder
- Model Training and Validation:
- Data is read from database. Downloaded content is also used.
- Train by calling
python train.py. Multiple classifier models are saved astrained_models/*.pklfiles.
- Model Use:
- Call
python serve.py {model} {some string with acronym}, such aspython serve.py svc 'ALU is an essential part of a computer along with memory and peripherals.'
- Call