Overview

Suppose we have the acronym CSS. It can mean "Cascading Style Sheets" or "Chirp Spread Spectrum" depending on the context. This work trains an ML model to disambiguate acronyms based on the context.

This work is derived from the work of Varma and Gardner (2017) with some updates done by Reddy (2020). In particular, this work is concerned with domains relevant to Devopedia (Computer Science, Electronics, Telecommunications).

Installation

If running in Google Colab, installation is part of the Jupyter Notebook Acronyms.ipynb.

If not Google Colab, run pip install -r requirements.txt to set up Python dependencies.

Process

Data Collection:
- Wikipedia is used as the data source. We start with a seed in data/seed.json.
- Articles in the seed have been pre-selected from a Wikipedia page on Computing & IT abbreviations.
- From the seed, obtain Wikipedia page titles and URLs: python get_urls.py. Output is saved in data/data.csv.
- For all URLs in data/data.csv, download and save content: python download.py. Downloaded files are saved in data/train and data/test folders.
Data Pre-processing:
- Extract acronym definitions and context by treating this as a Constraint Satisfaction Problem (CSP). In folder csp, call python main.py.
- Extracted data is saved in data/definitions.csv.
- TODO: This extraction is not working very well at the moment. File data/definitions.csv has been manually edited.
- DB is updated with python add2db.py. This uses the file data/definitions.csv.
Model Training and Validation:
- Data is read from database. Downloaded content is also used.
- Train by calling python train.py. Multiple classifier models are saved as trained_models/*.pkl files.
Model Use:
- Call python serve.py {model} {some string with acronym}, such as python serve.py svc 'ALU is an essential part of a computer along with memory and peripherals.'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Installation

Process

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
csp		csp
data		data
database		database
.gitignore		.gitignore
Acronyms.ipynb		Acronyms.ipynb
README.md		README.md
download.py		download.py
get_urls.py		get_urls.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
serve.py		serve.py
train.py		train.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Overview

Installation

Process

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages