InstructionLLM-QnA

Aims on building a LLM that is task specific and helpful in QnA based tasks

About

This repo contains the code based on research on Large Learning models on QnA based data. In the research, we have used the public dataset from Taskmaster repo. The data is preprocessed and later train on Flan-T5 model.

Data

The data for this research has been added to /data folder.

Requirement

python 3.11 or above
pytorch
Necessary packages to be installed using the requirements.txt file. using-

pip install -r requirements.txt

Approaches to train the data

Code folder

In the present study, our focus was primarily on the assistant's (AI Agent's) dialogue within conversations. As a result, the dataset was arranged with the following approach: for every set of 3 conversations, the dialogue from the assistant is masked.

Inside the code folder, the data is trained in following format -

If the Conversation is C1, C2, C3, C4, C5, C6, C7 ...

Instance	Inputs	Labels
First Instance	C1 <mask> C3	C1 C2 C3
Second Instance	C3 <mask> C5	C3 C4 C5
Third Instance	C5 <mask> C7	C5 C6 C7

📖 It is to note that only the conversation by the bot in the Taskmaster Dataset is captured.

Dialog Inpainting

The data utilized for training in the Dialog Inpainting paper research involves selecting a random conversation from the dialogues and then masking it. This prepared data is subsequently provided to the T5 Model. The screenshot below gives the glimpse of a random conversation being masked during training with Taskmaster dataset.

If there are 3 Conversations is C1, C2, C3, C4, C5 ... ; D1, D2, D3, D4, D5 .... and E1, E2, E3, E4, E5 ...

Instance	Inputs	Labels
First Instance	C1 <mask> C3 C4 C5 .....	C1 C2 C3 C4 C5 ....
Second Instance	D1 D2 D3 <mask> D5 .....	D1 D2 D3 D4 D5 ....
Third Instance	E1 <mask> E3 E4 E5 .....	E1 E2 E3 E4 E5 .....

How to Run

Navigate to code folder and run the data.py file.

python data.py

Notice that three_sentenced_data.csv file should have been created.
Next run the model.py file.

python model.py

You must note the following files created
1. saved_model - This file will contain the flan_t5 model
2. results - This file contains the predictions

Acknowledgments

My sincere gratitude to my professor Procheta Sen for actively participating in my research. Her support, guidance, and expertise have shaped its success, inspiring me to strive for academic excellence. I am deeply grateful for their invaluable mentorship, fostering my passion and personal growth.

This project uses code from the following sources:

task2kb-resource by [Xi Wang], licensed under [MIT License].

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
code		code
data		data
dialogInpainting		dialogInpainting
images		images
LICENSE		LICENSE
README.md		README.md
gpt2-training.py		gpt2-training.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InstructionLLM-QnA

Table of Contents

About

Data

Requirement

Approaches to train the data

Code folder

Dialog Inpainting

How to Run

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InstructionLLM-QnA

Table of Contents

About

Data

Requirement

Approaches to train the data

Code folder

Dialog Inpainting

How to Run

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages