Fine-tuning AraGPT-2 for dialect-to-MSA generation and translation detection
| Resource | Link |
|---|---|
| 🤗 Live Demo | SlangGPT Space |
| 📄 Paper | SlangGPT Report |
| 📦 Training Dataset | egyptian-2-arabic |
| 💬 Feedback Dataset | slanggpt-feedback-dataset |
| 📦 Kaggle | egyptian-2-arabic |
Over 100 million Egyptians speak Egyptian Arabic daily — yet most Arabic NLP systems are trained almost entirely on Modern Standard Arabic (MSA). Egyptian Arabic is not simply a simplified form of MSA; it carries distinct vocabulary, grammar, and culturally loaded expressions that MSA-trained models consistently fail to handle. Words like جدع carry meanings of honor, loyalty, and social standing that have no direct MSA equivalent, yet they appear constantly in everyday Egyptian speech.
When we set out to build SlangGPT, we discovered there was no large, clean, publicly available Egyptian Arabic ↔ MSA parallel dataset suitable for this task. So we built one ourselves.
SlangGPT fine-tunes AraGPT-2 on an 18,250-sentence parallel Egyptian Arabic / MSA corpus to solve two tasks:
- Generation — Given an Egyptian Arabic sentence, generate the equivalent MSA translation
- Detection — Given an (Egyptian, MSA) pair, classify whether the translation is correct
This project extends the methodology of Hernandez & Naik (Stanford CS224N, 2025) — who fine-tuned GPT-2 for Gen-Z English slang understanding — to the Arabic dialect setting, replacing the English backbone with AraGPT-2 and the slang dataset with a parallel Egyptian–MSA corpus.
| Task | Model | chrF | BLEU | Accuracy |
|---|---|---|---|---|
| Generation | Zero-shot AraGPT-2 | 10.62 | 0.02 | — |
| Generation | Fine-tuned AraGPT-2 | 29.08 | 6.63 | — |
| Detection | Zero-shot AraGPT-2 | — | — | 0.500 |
| Detection | Fine-tuned AraGPT-2 | — | — | 0.956 |
Fine-tuning improves detection accuracy by +45.6 points and generation chrF by +18.5 points over zero-shot baselines.
| Input (Egyptian) | Zero-shot Output | Fine-tuned Output |
|---|---|---|
| يلا فين؟ | مالذي جاء به من خير... (forum drift) | هيا، أين أنت؟ |
| أنا محتاج أتكلم معاكي | ياام.. يآآإك ياروحيتي... (social media drift) | أحتاج أن أتحدث معك |
| كنت فاكرك مش جاية | يااللي ما انخطبتك... (forum content) | كنت أذكرك، لستِ قادمة |
- Generation quality is still limited. 18K sentence pairs is relatively small for a full translation task. BLEU of 6.63, while a significant improvement over zero-shot, indicates the model is not yet production-ready for translation.
- AraGPT-2 was not pretrained on dialectal Arabic. The backbone was trained on MSA and online Arabic text, which creates an inherent ceiling on how well it can generate fluent dialectal-to-MSA translations without a much larger fine-tuning corpus.
- Culturally loaded words are hard to translate. Terms like جدع, أوي, or خلاص carry meanings that are difficult to capture in MSA even for humans. The model often produces technically correct but culturally flat translations.
- Dataset coverage is uneven. The 18K corpus skews toward conversational sentences and may not generalize well to domain-specific Egyptian Arabic (e.g., medical, legal, or technical speech).
We deployed the demo publicly with a community feedback loop specifically to address these limitations over time.
The live demo includes a feedback system where users can rate translations and submit corrections. Every submission is automatically saved to the open feedback dataset and will be used for periodic retraining to improve generation quality continuously.
from datasets import load_dataset
df = load_dataset("AdhamAshraf/slanggpt-feedback-dataset", split="train").to_pandas()
# High quality confirmed translations
high_quality = df[df["user_rating"] >= 4]
# Human corrections for fine-tuning
corrections = df[df["corrected_msa"] != ""]18,250 parallel Egyptian Arabic / MSA sentence pairs used to train both the generation and detection models.
| Split | Generation Pairs | Detection Examples |
|---|---|---|
| Train (80%) | 14,600 | 29,200 |
| Dev (10%) | 1,825 | 3,650 |
| Test (10%) | 1,825 | 3,650 |
🔗 huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic
from datasets import load_dataset
dataset = load_dataset("AdhamAshraf/egyptian-2-arabic", split="train")
df = dataset.to_pandas()Source & Derivation
Derived from Abdalrahmankamel/Egyptian_2_English. Modifications:
- Removed English translation column
- Added Modern Standard Arabic translations
- Applied Arabic normalization and diacritic removal
- Reformatted for dialect-to-MSA tasks
Human feedback collected from the live Space. Users rate model translations and provide corrections, forming a growing dataset for future fine-tuning.
| Field | Type | Description |
|---|---|---|
egyptian_arabic |
string | Original Egyptian Arabic input |
generated_msa |
string | SlangGPT's generated translation |
user_label |
string | correct or incorrect |
user_rating |
int64 | Quality score 0–5 |
corrected_msa |
string | Human correction (required if incorrect or rating ≤ 2) |
timestamp |
string | ISO 8601 UTC timestamp |
Note: GPU required. CPU inference is too slow for practical use. Training was done on a T4 GPU via Google Colab.
git clone https://github.com/adhamashraf7788/SlangGPT.git
cd SlangGPT
pip install -r requirements.txtpython scripts/download_weights.pyDownloads weights to:
model/weights/detection/best_model.pt (~527 MB)
model/weights/generation/best/model.safetensors (~1.37 GB)
from datasets import load_dataset
dataset = load_dataset("AdhamAshraf/egyptian-2-arabic", split="train")
df = dataset.to_pandas()
df.to_csv("data/raw/NLP.csv", index=False, encoding="utf-8-sig")python data/prepare_data.py --raw_csv data/raw/NLP.csvpython app/app.pyOpen http://localhost:5000
Training was done on Google Colab (T4 GPU). Open the notebooks in order:
| Notebook | Description |
|---|---|
01_preprocessing.ipynb |
Download dataset, clean, split, build detection pairs |
02_train_generation.ipynb |
Fine-tune AraGPT-2 medium for generation |
03_train_detection.ipynb |
Fine-tune AraGPT-2 base for detection |
| Generation | Detection | |
|---|---|---|
| Base model | aragpt2-medium | aragpt2-base |
| Parameters | ~355M | ~135M |
| Learning rate | 5e-5 | 2e-5 |
| Batch size | 8 (eff. 32) | 16 |
| LR schedule | Cosine | Linear |
| Warmup ratio | 10% | 10% |
| Weight decay | 0.01 | 0.01 |
| Epochs | 5 (best at ep. 3) | 8 |
| Train loss (start → end) | 2.50 → 0.76 | 0.71 → 0.10 |
# Zero-shot baseline
python evaluation/baseline.py
# Fine-tuned model evaluation (chrF, BLEU, PPL, accuracy)
python evaluation/evaluate.py
# Error analysis (false positives / false negatives)
python evaluation/error_analysis.py| Count | Rate | |
|---|---|---|
| Total test examples | 3,650 | — |
| Correct predictions | 3,491 | 95.6% |
| False Positives | 101 | 2.8% |
| False Negatives | 58 | 1.6% |
| Task | Base Model | Link |
|---|---|---|
| Generation | AraGPT-2 Medium | aubmindlab/aragpt2-medium |
| Detection | AraGPT-2 Base | aubmindlab/aragpt2-base |
Generation uses causal language modeling with prompt masking — only the MSA target tokens contribute to the training loss. Inference uses greedy decoding with repetition penalty.
Detection encodes a cloze-style Arabic prompt through AraGPT-2 and passes the last-token hidden state through a linear classifier head trained with binary cross-entropy.
SlangGPT/
├── app/ # Flask web application
│ ├── app.py
│ ├── model.py
│ ├── templates/index.html
│ └── static/style.css
├── data/
│ ├── prepare_data.py
│ ├── raw/NLP.csv # [git-ignored]
│ └── processed/ # [git-ignored]
├── model/
│ ├── config.py
│ ├── train_generation.py
│ ├── train_detection.py
│ └── weights/ # [git-ignored]
├── evaluation/
│ ├── baseline.py
│ ├── evaluate.py
│ ├── error_analysis.py
│ └── plots/
├── notebooks/ # [git-ignored]
├── scripts/
│ └── download_weights.py
├── report/
│ ├── main.tex
│ └── references.bib
├── requirements.txt
└── .gitignore
This project extends:
Hernandez & Naik, Extending GPT-2 for Informal and Slang Aware Language Understanding, Stanford CS224N, 2025
Which builds on:
- Antoun et al., AraGPT2, 2021
- Sun et al., Toward Informal Language Processing, 2024
- Radford et al., GPT-2, 2019
@misc{slanggpt2026,
title={SlangGPT: Fine-tuning AraGPT-2 for Egyptian Arabic to MSA Generation and Detection},
author={Abdelrahman Ahmed and Adham Ashraf and Ahmed Fekry},
year={2026},
url={https://github.com/adhamashraf7788/SlangGPT}
}This project is licensed under the MIT License — see the LICENSE file for details.
