A clean, minimal encoder–decoder Transformer for English → German translation, following Attention Is All You Need. This project focuses on explainability and readability: each module mirrors the paper’s components, with clear training and inference utilities.
Note: Training a high‑quality EN→DE model requires significant compute and data. This repo provides a faithful implementation and a training pipeline, but the bundled model is not pretrained.
- Encoder–decoder Transformer with multi‑head self‑attention and cross‑attention.
- SentencePiece BPE tokenization with joint EN/DE vocabulary.
- Padding‑aware loss (ignore
<pad>tokens) and attention masking. - Mixed‑precision training and checkpointing.
- Simple greedy decoding for inference.
Transformer/
README.md
config.py
main.py
spm_joint_32k.model
modules/
attention.py
encoder.py
decoder.py
ff.py
transformer.py
utils/
data.py
inference.py
train.py
train_Spm.py
- Python 3.10+
- PyTorch
- sentencepiece
- tqdm
- matplotlib
Place your parallel corpus under a DATA/ folder next to main.py:
Transformer/
DATA/
train.en
train.de
all.txt
train.en,train.de: aligned sentence pairsall.txt: concatenation of EN + DE for SentencePiece training
The training script will automatically train spm_joint_32k.model if it does not exist.
Run the training entrypoint:
python main.pyCheckpoints are saved to checkpoints/ every epoch.
Use the helper in utils/inference.py:
from utils.inference import translate
translation = translate(model, sp, "This is a test sentence.")
print(translation)All hyperparameters live in config.py:
n_layer,n_head,n_embdblock_sizedropoutbatch_size,lr,epochsdevice,fp16
The default settings follow the Transformer‑Base configuration.
Placeholder for diagram (add your encoder–decoder image here):
For query
- Scaling by
$\sqrt{d_k}$ stabilizes gradients. - Padding positions are masked to prevent the model from attending to
<pad>.
Multiple heads capture different alignment patterns:
Applied independently at each time step:
This project uses GELU for
Decoder self‑attention is masked so each position attends only to past tokens:
Cross‑entropy is computed while ignoring <pad> tokens, so the model isn’t penalized for padding predictions.
- Optimizer: AdamW with StepLR
- Mixed precision via
torch.amp - Gradient clipping to stabilize training
Because EN→DE translation is resource‑intensive, expect slow convergence without substantial GPU time.
- Vaswani et al., Attention Is All You Need, NeurIPS 2017.
- SentencePiece: https://github.com/google/sentencepiece
This project is a learning‑oriented reimplementation intended for clarity and experimentation. Contributions and improvements are welcome.
