Data and code of Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA
The WILOVA-QA dataset and the data used for generate prompts have been compressed as .zip files to prevent direct leakage. The password for decompressing the files is: wilovaqa
Generate prompts -> Run the LLM generation -> Run the LLM-as-a-judge -> Evaluation
Run python generate_prompts.py <language_id> <source_type> to generate a .pkl file for prompts, which is is a dictionary of the form: dict[str, dict[str, dict]].
'<language_id>' can be 'deu' (deu-bar) or 'zho' (cmn-yue).
'<source_type>' can be 'dialectqa' or 'eclektic'. Please manually modify the list of settings inside generate_prompts.py to include the desired settings as prompt settings.
Usage example: python generate_prompts.py zho dialectqa
Run python3 -u dialectqa.py <GPU_id(s)> <path_to_pkl_file_of_prompts> <model_name> <tokenizer_path> <model_path> to run the LLM generation. The results will be saved as a .pkl file, which is a dictionary of the form: dict[str, dict[str, dict]].
<GPU_id(s)> may be 1 GPU id for smaller models, or 4 GPU ids for larger models like llama3_70b and qwen2.5_72b.
After obtaining the results generated by the LLM, run python3 -u dialectqa.py <GPU_id(s)> <path_to_pkl_file_of_results> <model_name> <tokenizer_path> <model_path> to use another LLM to evaluate the generated results. The LLM-as-a-judge evaluation results will be appended to the existing results, and be saved as a .pkl file, which is a dictionary of the form: dict[str, dict[str, dict]].
After obtaining the results of LLM-as-a-judge, run python evaluation.py <path_to_pkl_file_of_LLM-as-a-judge_results> to evaluate all the results (including metrics other than LLM-as-a-judge). The evaluation scores will be printed on the screen.