Directories:
- Data Eval- Data evaluation directory contains all of the script to measure data quality metrics on synthetic and real data
- Phi3_Generations- Directory contains the scripts to run inference on Phi model and synthetic datasets
- Gemma- Directory contains code and generated datasets related to the Gemma model
- Tinyllama- Directory contains code and generated datasets related to the Tinyllama model
- Opt_finetuning- Code to initially finetune the Opt model. Additional scripts are hosted in the model folder
- Opt_inference- Code to generate outputs from the finetuned Opt model
- Qwen folder contains code to run inference on Qwen and some sample outputs
- synthetic_training_data- contains some of our synthetic training data but most is hosted on Google Drive (see below)
Misc files:
- Data cleaning notebook for preprocessing the CNN dataset
- Loss calculation script for calculating cross-entropy loss on each interation of our models
Models and Large Files:
- Models are stored on Google Drive here: https://drive.google.com/drive/folders/1Tt969nXSbSrpvLrTdMy5PI90ZEAvKXcw?usp=drive_link, https://drive.google.com/drive/folders/1vKgcGXMMV2Jr3s-72XFr3gfzjJ6UV8GI?usp=drive_link
- Synthetic training data stored here: https://drive.google.com/drive/folders/1SEwU3maTAVwUAtTE6R2-x5bGxxJvK98P?usp=drive_link