Large language models (LLMs) perform well on standard inputs but struggle with non-standard or transformed language found in real-world digital communication. This paper introduces a multilingual benchmark and evaluation framework designed to assess LLMs' ability to restore original forms and meanings from heavily modified or playful text in both English and Korean. Our dataset includes over 17,000 instances spanning visual and phonetic substitutions, abbreviations, and character-level manipulations. We conduct a comprehensive evaluation of multiple widely used LLMs under three prompting paradigms (Zero-Shot, CoT, CoT+ICL) and explore advanced verification approaches. Results reveal persistent challenges even with sophisticated methods, particularly in Korean tasks involving perceptual and cultural variations. This work establishes a challenging benchmark that remains difficult for current LLMs.
To run the experiment, install the required packages.
Python==3.12.3
pip install -r requirements.txtUse the Augmentation folder inside Dataset Building to augment data using the provided pre-augmentation dataset and related scripts. The final dataset is created by merging manually collected and augmented data.
The raw dataset may change depending on the words used, so it is not provided. The sources of the raw dataset are documented in the research paper.
Augmentation Tasks:
- Kor Letter Rotation
- Kor Visual Transform
Dataset_Building/Augmentation/kor_letter_rotation_rule.ipynbApplies transformation rules for the Korean Letter Rotation Task dataset augmentation.
Dataset_Building/Augmentation/kor_visual_transform_rule.ipynbApplies transformation rules for the Korean Visual Transform Task dataset augmentation. Manual inspection ensures that the same transformation rule is not applied redundantly.
Conclusively, the collected and augmented data are stored separately within the Dataset Building/Augmentation/Data directory, and their combination has been utilized as the final merged dataset.
To finalize the dataset for the experiment, preprocess the collected and augmented raw data. The sources of the raw dataset are documented in the research paper.
Dataset_Building/eng_phonetic_preprocessing.ipynbFilters phonetic similarity data for the English Phonetic Substitution task using the
double metaphonealgorithm.
Dataset_Building/eng_split_letters.ipynbSplits collected English words into consonants and vowels for the English Consonant & Vowel Combination task.
Dataset_Building/kor_split_letters.ipynbSplits collected Korean words into consonants and vowels for the Korean Consonant & Vowel Combination task.
The constructed dataset is stored in the Dataset folder. Use this dataset along with the models in the Task folder to evaluate LLMs' word restoration capabilities.
📌 GPT o3-mini is only used for error case analysis in this study.
The prompts.json file contains all prompts for Zero-shot, CoT (Chain-of-Thought), and CoT+ICL (In-Context Learning) settings.
# Examples of prompts
"eng_visual_transform": {
"zrs_prompt": "The given words are leetspeak, which is transformed into a letter similar to the original alphabet.\n\nAnswer format:\nAnswer: [Original word]",
"cot_prompt": "The given words are leetspeak, which is transformed into a letter similar to the original alphabet. \nFollow these steps to decode the term and find the correct answer:\n1. Analyze each syllable.\n2. Guess visually similar alphabets.\n3. Combine the original words, including the syllables you guessed.\n\nAnswer format:\nProcessing: [Brief decoding steps]\nAnswer: [Original word]",
"icl_prompt": "The given words are leetspeak, which is transformed into a letter similar to the original alphabet. \nFollow these steps to decode the term and find the correct answer:\n1. Analyze each syllable.\n2. Guess visually similar alphabets.\n3. Combine the original words, including the syllables you guessed.\n\nExample : \"H3ll0, w0rld!\" \n- \"3\" is converted to \"e\". Reason: \"3\" is visually similar to \"E\". \n- \"0\" is converted to \"o.\" Reason: \"0\" is visually similar to \"o.\" \nOutput: \"Hello, world!\"\n\nAnswer format:\nProcessing: [Brief decoding steps]\nAnswer: [Original word]"Task/gpt4o_batch.ipynb
Task/gemini.ipynb
Task/claude_batch.ipynb
Task/gpto3_mini.ipynbBefore evaluating LLM responses, preprocessing is performed using the scripts in the Preprocessing folder.
Preprocessing/abbreviation_preprocessing.ipynbFor English and Korean abbreviation tasks, LLMs generate five responses. This script selects the most similar response to the ground truth for evaluation.
Assess the performance of different models and analyze the results.
Evaluating/evaluation_main_task.ipynbEvaluates model performance for the main task.
Evaluating/evaluation_main_task_eng_consonant_vowel.ipynbUses a different evaluation standard for the English Consonant & Vowel Combination task.
Failure Case Analysis can be performed using the following scripts:
Evaluating/evaluation_failure_case.ipynbAnalyzes GPT o3-mini failure cases and compares them with other models.
Evaluating/evaluation_failure_case_eng_consonant_vowel.ipynbUses a different evaluation standard for failure cases in the English Consonant & Vowel Combination task.