Aims on building a LLM that is task specific and helpful in QnA based tasks
This repo contains the code based on research on Large Learning models on QnA based data. In the research, we have used the public dataset from Taskmaster repo. The data is preprocessed and later train on Flan-T5 model.
The data for this research has been added to /data folder.
- python 3.11 or above
- pytorch
- Necessary packages to be installed using the requirements.txt file. using-
pip install -r requirements.txtIn the present study, our focus was primarily on the assistant's (AI Agent's) dialogue within conversations. As a result, the dataset was arranged with the following approach: for every set of 3 conversations, the dialogue from the assistant is masked.
Inside the code folder, the data is trained in following format -
If the Conversation is C1, C2, C3, C4, C5, C6, C7 ...
| Instance | Inputs | Labels |
|---|---|---|
| First Instance | C1 <mask> C3 | C1 C2 C3 |
| Second Instance | C3 <mask> C5 | C3 C4 C5 |
| Third Instance | C5 <mask> C7 | C5 C6 C7 |
📖 It is to note that only the conversation by the bot in the Taskmaster Dataset is captured.
The data utilized for training in the Dialog Inpainting paper research involves selecting a random conversation from the dialogues and then masking it. This prepared data is subsequently provided to the T5 Model. The screenshot below gives the glimpse of a random conversation being masked during training with Taskmaster dataset.
If there are 3 Conversations is C1, C2, C3, C4, C5 ... ; D1, D2, D3, D4, D5 .... and E1, E2, E3, E4, E5 ...
| Instance | Inputs | Labels |
|---|---|---|
| First Instance | C1 <mask> C3 C4 C5 ..... | C1 C2 C3 C4 C5 .... |
| Second Instance | D1 D2 D3 <mask> D5 ..... | D1 D2 D3 D4 D5 .... |
| Third Instance | E1 <mask> E3 E4 E5 ..... | E1 E2 E3 E4 E5 ..... |
- Navigate to code folder and run the data.py file.
python data.py- Notice that three_sentenced_data.csv file should have been created.
- Next run the model.py file.
python model.py- You must note the following files created
- saved_model - This file will contain the flan_t5 model
- results - This file contains the predictions
My sincere gratitude to my professor Procheta Sen for actively participating in my research. Her support, guidance, and expertise have shaped its success, inspiring me to strive for academic excellence. I am deeply grateful for their invaluable mentorship, fostering my passion and personal growth.
This project uses code from the following sources:
- task2kb-resource by [Xi Wang], licensed under [MIT License].

