Transformer (Attention Is All You Need) — PyTorch Reimplementation

A clean, minimal encoder–decoder Transformer for English → German translation, following Attention Is All You Need. This project focuses on explainability and readability: each module mirrors the paper’s components, with clear training and inference utilities.

Note: Training a high‑quality EN→DE model requires significant compute and data. This repo provides a faithful implementation and a training pipeline, but the bundled model is not pretrained.

Highlights

Encoder–decoder Transformer with multi‑head self‑attention and cross‑attention.
SentencePiece BPE tokenization with joint EN/DE vocabulary.
Padding‑aware loss (ignore <pad> tokens) and attention masking.
Mixed‑precision training and checkpointing.
Simple greedy decoding for inference.

Project Structure

Transformer/
  README.md
  config.py
  main.py
  spm_joint_32k.model
  modules/
    attention.py
    encoder.py
    decoder.py
    ff.py
    transformer.py
  utils/
    data.py
    inference.py
    train.py
    train_Spm.py

Setup

Requirements

Python 3.10+
PyTorch
sentencepiece
tqdm
matplotlib

Expected Data Layout

Place your parallel corpus under a DATA/ folder next to main.py:

Transformer/
  DATA/
    train.en
    train.de
    all.txt

train.en, train.de: aligned sentence pairs
all.txt: concatenation of EN + DE for SentencePiece training

How to Run

1) Train SentencePiece (auto)

The training script will automatically train spm_joint_32k.model if it does not exist.

2) Train the Transformer

Run the training entrypoint:

python main.py

Checkpoints are saved to checkpoints/ every epoch.

3) Translate (greedy decoding)

Use the helper in utils/inference.py:

from utils.inference import translate

translation = translate(model, sp, "This is a test sentence.")
print(translation)

Configuration

All hyperparameters live in config.py:

n_layer, n_head, n_embd
block_size
dropout
batch_size, lr, epochs
device, fp16

The default settings follow the Transformer‑Base configuration.

Architecture

Placeholder for diagram (add your encoder–decoder image here):

Core Mathematics (Intuition + Formulas)

1) Scaled Dot‑Product Attention

For query $Q$, key $K$, value $V$:

$$ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V $$

Scaling by $\sqrt{d_k}$ stabilizes gradients.
Padding positions are masked to prevent the model from attending to <pad>.

2) Multi‑Head Attention

Multiple heads capture different alignment patterns:

$$ \text{MultiHead}(Q,K,V) = \text{Concat}(head_1,\dots,head_h)W^O $$

$$ head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

3) Position‑wise Feed‑Forward Network

Applied independently at each time step:

$$ \text{FFN}(x) = W_2,\sigma(W_1x + b_1) + b_2 $$

This project uses GELU for $\sigma$.

4) Decoder Causality

Decoder self‑attention is masked so each position attends only to past tokens:

$$ \text{mask}_{t,\tau} = -\infty\ \text{for}\ \tau>t $$

5) Loss with Padding Ignore

Cross‑entropy is computed while ignoring <pad> tokens, so the model isn’t penalized for padding predictions.

Training Notes

Optimizer: AdamW with StepLR
Mixed precision via torch.amp
Gradient clipping to stabilize training

Because EN→DE translation is resource‑intensive, expect slow convergence without substantial GPU time.

References

Vaswani et al., Attention Is All You Need, NeurIPS 2017.
SentencePiece: https://github.com/google/sentencepiece

Acknowledgements

This project is a learning‑oriented reimplementation intended for clarity and experimentation. Contributions and improvements are welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer (Attention Is All You Need) — PyTorch Reimplementation

Highlights

Project Structure

Setup

Requirements

Expected Data Layout

How to Run

1) Train SentencePiece (auto)

2) Train the Transformer

3) Translate (greedy decoding)

Configuration

Architecture

Core Mathematics (Intuition + Formulas)

1) Scaled Dot‑Product Attention

2) Multi‑Head Attention

3) Position‑wise Feed‑Forward Network

4) Decoder Causality

5) Loss with Padding Ignore

Training Notes

References

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
modules		modules
utils		utils
README.md		README.md
config.py		config.py
main.py		main.py
spm_joint_32k.model		spm_joint_32k.model

Folders and files

Latest commit

History

Repository files navigation

Transformer (Attention Is All You Need) — PyTorch Reimplementation

Highlights

Project Structure

Setup

Requirements

Expected Data Layout

How to Run

1) Train SentencePiece (auto)

2) Train the Transformer

3) Translate (greedy decoding)

Configuration

Architecture

Core Mathematics (Intuition + Formulas)

1) Scaled Dot‑Product Attention

2) Multi‑Head Attention

3) Position‑wise Feed‑Forward Network

4) Decoder Causality

5) Loss with Padding Ignore

Training Notes

References

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages