BT4103 Business Analytics Capstone Project

Developing a Smart Search System for Document Retrieval and Information Extraction

This project aims to develop a smart search system for enhanced document retrieval and information extraction, by leveraging on and integrating state-of-the-art Natural Language Processing (NLP) techniques.

Dataset Used

The primary dataset of this project is the Stanford Question Answering Dataset 2.0 (SQuAD 2.0).

SQuAD is a reading comprehension dataset, consisting of questions queried to a set of Wikipedia compositions. The answer to each question is a snippet of text or span, from the corresponding context paragraph. Alternatively, the question might be unanswerable. SQuAD 2.0 combines the 107,786 questions in the first version of SQuAD 1.1 with over 50,000 unanswerable questions written to look identical to the answerable ones. There are a total of 19,035 context paragraphs to which the questions are being queried to.

An example of a context paragraph would look like this:
North Carolina consists of three main geographic sections: the Atlantic Coastal Plain, which occupies the eastern 45% of the state; the Piedmont region, which contains the middle 35%; and the Appalachian Mountains and foothills. The extreme eastern section of the state contains the Outer Banks, a string of sandy, narrow barrier islands between the Atlantic Ocean and two inland waterways or "sounds": Albemarle Sound in the north and Pamlico Sound in the south. They are the two largest landlocked sounds in the United States.

Some examples of the question-answer pair that are queried to this context paragraph are:

How many main geographical sections make up North Carolina? A: three
What section of North Carolina makes up 45% of the state? A: the Atlantic Coastal Plain
What is the section in the middle 35% of North Carolina called? A: the Piedmont region

Frontend Preview:

Home Page:

Changes to Home Page after asking a question:

Performance Page:

About Page:

Before running any files, kindly install the relevant dependencies

pip install -r requirements.txt

FrontEnd Workflow

Simply run app.py in the console

python app.py

BiDAF Workflow

All user-tunable parameters can be found in config.py

Firstly, ensure that the GloVe corpus has been downloaded and placed in the the BiDAF directory, the GloVe corpus can be downloaded from : https://nlp.stanford.edu/projects/glove/

Secondly, ensure that the data folder that has been submitted seperately (certain files had size exceeding the threshold on Github). Another way of getting the processed data would be to run the python script make_dataset.py, however since there are many implementations that deviates slightly, there is a need to made slight adjustments to return all data files. The make_dataset.py is now configured to produced default GloVe Embeddings and trained character embeddings.

python make_dataset.py

With regard to training the model, you may just run the python script train.py. Similar to the case of dataset creation, the train.py is configured to train the Hybrid Model (GloVe Embeddings + Trained Character Embeddings + BERT Embeddings)

python train.py

To test the performance of the model on the test/ Dev set ; the test set is not configured to test the the Hybrid Model (GloVe Embeddings + Trained Character Embeddings + BERT Embeddings)

python test.py

Lastly to use the model, you may want to take a look into eval.py. It has a function that takes in a context paragraph as well as a question and it returns the model's prediction. The eval.py will also be the Python script that is called by the Frontend

python eval.py

To create the Embeddings that combines the GloVe embeddings with the BERT embeddings : Kindly refer to combine_embeddings.ipynb

BERT Model Workflow

BERT_Train.ipynb contains the codes to train the BERT model with the tuned parameters.

BERT_Test.ipynb contains the codes to test the BERT model to produce an answer when passing a context and query, after the model weights has been inserted.

bert.h5 contains the model weights.

bert_optimizer_tunning_visualization.ipynb contains the visualization for the optimizer parameters tuning process.

Usage

Upload both Jupyter Notebooks to Google Collaboratory
Change the runtime type under the tab runtime to TPU

Run the BERT_Train notebook by clicking on Run all under runtime tab
Save the bert.h5 file from the google collaboratory files on the left of the webpage

Upload the bert.h5 file into google drive files by uploading to the session storage
Run the BERT_Test notebook by clicking on Run all under runtime

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
.vscode		.vscode
BERT_Final		BERT_Final
BiDAF		BiDAF
__pycache__		__pycache__
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BT4103 Business Analytics Capstone Project

Developing a Smart Search System for Document Retrieval and Information Extraction

Dataset Used

Frontend Preview:

Before running any files, kindly install the relevant dependencies

FrontEnd Workflow

BiDAF Workflow

BERT Model Workflow

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BT4103 Business Analytics Capstone Project

Developing a Smart Search System for Document Retrieval and Information Extraction

Dataset Used

Frontend Preview:

Before running any files, kindly install the relevant dependencies

FrontEnd Workflow

BiDAF Workflow

BERT Model Workflow

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages