Evaluating LLMs Beyond Standard Text

: A Benchmark on Non-Traditional Text Variations

Large language models (LLMs) perform well on standard inputs but struggle with non-standard or transformed language found in real-world digital communication. This paper introduces a multilingual benchmark and evaluation framework designed to assess LLMs' ability to restore original forms and meanings from heavily modified or playful text in both English and Korean. Our dataset includes over 17,000 instances spanning visual and phonetic substitutions, abbreviations, and character-level manipulations. We conduct a comprehensive evaluation of multiple widely used LLMs under three prompting paradigms (Zero-Shot, CoT, CoT+ICL) and explore advanced verification approaches. Results reveal persistent challenges even with sophisticated methods, particularly in Korean tasks involving perceptual and cultural variations. This work establishes a challenging benchmark that remains difficult for current LLMs.

Setup

To run the experiment, install the required packages.

Python==3.12.3

pip install -r requirements.txt

Data preperation

1) Augmentation

Use the Augmentation folder inside Dataset Building to augment data using the provided pre-augmentation dataset and related scripts. The final dataset is created by merging manually collected and augmented data.
The raw dataset may change depending on the words used, so it is not provided. The sources of the raw dataset are documented in the research paper.

Augmentation Tasks:

Kor Letter Rotation
Kor Visual Transform

Dataset_Building/Augmentation/kor_letter_rotation_rule.ipynb

Applies transformation rules for the Korean Letter Rotation Task dataset augmentation.

Dataset_Building/Augmentation/kor_visual_transform_rule.ipynb

Applies transformation rules for the Korean Visual Transform Task dataset augmentation. Manual inspection ensures that the same transformation rule is not applied redundantly.

Conclusively, the collected and augmented data are stored separately within the Dataset Building/Augmentation/Data directory, and their combination has been utilized as the final merged dataset.

2) Preprocessing

To finalize the dataset for the experiment, preprocess the collected and augmented raw data. The sources of the raw dataset are documented in the research paper.

Dataset_Building/eng_phonetic_preprocessing.ipynb

Filters phonetic similarity data for the English Phonetic Substitution task using the double metaphone algorithm.

Dataset_Building/eng_split_letters.ipynb

Splits collected English words into consonants and vowels for the English Consonant & Vowel Combination task.

Dataset_Building/kor_split_letters.ipynb

Splits collected Korean words into consonants and vowels for the Korean Consonant & Vowel Combination task.

Run

The constructed dataset is stored in the Dataset folder. Use this dataset along with the models in the Task folder to evaluate LLMs' word restoration capabilities. 📌 GPT o3-mini is only used for error case analysis in this study.

Available Prompts

The prompts.json file contains all prompts for Zero-shot, CoT (Chain-of-Thought), and CoT+ICL (In-Context Learning) settings.

# Examples of prompts
  "eng_visual_transform": {
    "zrs_prompt": "The given words are leetspeak, which is transformed into a letter similar to the original alphabet.\n\nAnswer format:\nAnswer: [Original word]",
    "cot_prompt": "The given words are leetspeak, which is transformed into a letter similar to the original alphabet. \nFollow these steps to decode the term and find the correct answer:\n1. Analyze each syllable.\n2. Guess visually similar alphabets.\n3. Combine the original words, including the syllables you guessed.\n\nAnswer format:\nProcessing: [Brief decoding steps]\nAnswer: [Original word]",
    "icl_prompt": "The given words are leetspeak, which is transformed into a letter similar to the original alphabet. \nFollow these steps to decode the term and find the correct answer:\n1. Analyze each syllable.\n2. Guess visually similar alphabets.\n3. Combine the original words, including the syllables you guessed.\n\nExample : \"H3ll0, w0rld!\" \n- \"3\" is converted to \"e\". Reason: \"3\" is visually similar to \"E\". \n- \"0\" is converted to \"o.\" Reason: \"0\" is visually similar to \"o.\" \nOutput: \"Hello, world!\"\n\nAnswer format:\nProcessing: [Brief decoding steps]\nAnswer: [Original word]"

Available Models

Task/gpt4o_batch.ipynb
Task/gemini.ipynb
Task/claude_batch.ipynb
Task/gpto3_mini.ipynb

Evaluation Preprocessing

Before evaluating LLM responses, preprocessing is performed using the scripts in the Preprocessing folder.

Preprocessing/abbreviation_preprocessing.ipynb

For English and Korean abbreviation tasks, LLMs generate five responses. This script selects the most similar response to the ground truth for evaluation.

Evaluation

Assess the performance of different models and analyze the results.

Evaluating/evaluation_main_task.ipynb

Evaluates model performance for the main task.

Evaluating/evaluation_main_task_eng_consonant_vowel.ipynb

Uses a different evaluation standard for the English Consonant & Vowel Combination task.

Failure Case Analysis can be performed using the following scripts:

Evaluating/evaluation_failure_case.ipynb

Analyzes GPT o3-mini failure cases and compares them with other models.

Evaluating/evaluation_failure_case_eng_consonant_vowel.ipynb

Uses a different evaluation standard for failure cases in the English Consonant & Vowel Combination task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluating LLMs Beyond Standard Text

: A Benchmark on Non-Traditional Text Variations

Setup

Data preperation

1) Augmentation

2) Preprocessing

Run

Available Prompts

Available Models

Evaluation Preprocessing

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Dataset Building		Dataset Building
Dataset		Dataset
Evaluating		Evaluating
Preprocessing		Preprocessing
Task		Task
README.md		README.md
requirements.txt		requirements.txt

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Evaluating LLMs Beyond Standard Text

: A Benchmark on Non-Traditional Text Variations

Setup

Data preperation

1) Augmentation

2) Preprocessing

Run

Available Prompts

Available Models

Evaluation Preprocessing

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages