Skip to content

DevopediaOrg/acronym-lookup

 
 

Repository files navigation

Overview

Suppose we have the acronym CSS. It can mean "Cascading Style Sheets" or "Chirp Spread Spectrum" depending on the context. This work trains an ML model to disambiguate acronyms based on the context.

This work is derived from the work of Varma and Gardner (2017) with some updates done by Reddy (2020). In particular, this work is concerned with domains relevant to Devopedia (Computer Science, Electronics, Telecommunications).

Installation

If running in Google Colab, installation is part of the Jupyter Notebook Acronyms.ipynb.

If not Google Colab, run pip install -r requirements.txt to set up Python dependencies.

Process

  • Data Collection:
    • Wikipedia is used as the data source. We start with a seed in data/seed.json.
    • Articles in the seed have been pre-selected from a Wikipedia page on Computing & IT abbreviations.
    • From the seed, obtain Wikipedia page titles and URLs: python get_urls.py. Output is saved in data/data.csv.
    • For all URLs in data/data.csv, download and save content: python download.py. Downloaded files are saved in data/train and data/test folders.
  • Data Pre-processing:
    • Extract acronym definitions and context by treating this as a Constraint Satisfaction Problem (CSP). In folder csp, call python main.py.
    • Extracted data is saved in data/definitions.csv.
    • TODO: This extraction is not working very well at the moment. File data/definitions.csv has been manually edited.
    • DB is updated with python add2db.py. This uses the file data/definitions.csv.
  • Model Training and Validation:
    • Data is read from database. Downloaded content is also used.
    • Train by calling python train.py. Multiple classifier models are saved as trained_models/*.pkl files.
  • Model Use:
    • Call python serve.py {model} {some string with acronym}, such as python serve.py svc 'ALU is an essential part of a computer along with memory and peripherals.'

About

A machine-learning classifier to identify acronym definitions from context

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 89.5%
  • Jupyter Notebook 10.5%