Skip to content

AshutoshKumar1007/Transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transformer (Attention Is All You Need) — PyTorch Reimplementation

A clean, minimal encoder–decoder Transformer for English → German translation, following Attention Is All You Need. This project focuses on explainability and readability: each module mirrors the paper’s components, with clear training and inference utilities.

Note: Training a high‑quality EN→DE model requires significant compute and data. This repo provides a faithful implementation and a training pipeline, but the bundled model is not pretrained.


Highlights

  • Encoder–decoder Transformer with multi‑head self‑attention and cross‑attention.
  • SentencePiece BPE tokenization with joint EN/DE vocabulary.
  • Padding‑aware loss (ignore <pad> tokens) and attention masking.
  • Mixed‑precision training and checkpointing.
  • Simple greedy decoding for inference.

Project Structure

Transformer/
  README.md
  config.py
  main.py
  spm_joint_32k.model
  modules/
    attention.py
    encoder.py
    decoder.py
    ff.py
    transformer.py
  utils/
    data.py
    inference.py
    train.py
    train_Spm.py

Setup

Requirements

  • Python 3.10+
  • PyTorch
  • sentencepiece
  • tqdm
  • matplotlib

Expected Data Layout

Place your parallel corpus under a DATA/ folder next to main.py:

Transformer/
  DATA/
    train.en
    train.de
    all.txt
  • train.en, train.de: aligned sentence pairs
  • all.txt: concatenation of EN + DE for SentencePiece training

How to Run

1) Train SentencePiece (auto)

The training script will automatically train spm_joint_32k.model if it does not exist.

2) Train the Transformer

Run the training entrypoint:

python main.py

Checkpoints are saved to checkpoints/ every epoch.

3) Translate (greedy decoding)

Use the helper in utils/inference.py:

from utils.inference import translate

translation = translate(model, sp, "This is a test sentence.")
print(translation)

Configuration

All hyperparameters live in config.py:

  • n_layer, n_head, n_embd
  • block_size
  • dropout
  • batch_size, lr, epochs
  • device, fp16

The default settings follow the Transformer‑Base configuration.


Architecture

Placeholder for diagram (add your encoder–decoder image here):

Transformer architecture


Core Mathematics (Intuition + Formulas)

1) Scaled Dot‑Product Attention

For query $Q$, key $K$, value $V$:

$$ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V $$

  • Scaling by $\sqrt{d_k}$ stabilizes gradients.
  • Padding positions are masked to prevent the model from attending to <pad>.

2) Multi‑Head Attention

Multiple heads capture different alignment patterns:

$$ \text{MultiHead}(Q,K,V) = \text{Concat}(head_1,\dots,head_h)W^O $$

$$ head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

3) Position‑wise Feed‑Forward Network

Applied independently at each time step:

$$ \text{FFN}(x) = W_2,\sigma(W_1x + b_1) + b_2 $$

This project uses GELU for $\sigma$.

4) Decoder Causality

Decoder self‑attention is masked so each position attends only to past tokens:

$$ \text{mask}_{t,\tau} = -\infty\ \text{for}\ \tau>t $$

5) Loss with Padding Ignore

Cross‑entropy is computed while ignoring <pad> tokens, so the model isn’t penalized for padding predictions.


Training Notes

  • Optimizer: AdamW with StepLR
  • Mixed precision via torch.amp
  • Gradient clipping to stabilize training

Because EN→DE translation is resource‑intensive, expect slow convergence without substantial GPU time.


References


Acknowledgements

This project is a learning‑oriented reimplementation intended for clarity and experimentation. Contributions and improvements are welcome.

About

implementation of Transformer architecture . "Attention all you need."

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages