A minimal implementation of a transformer-based Large Language Model (LLM) inspired by modern architectures like Deepseek V2, minGPT, and nanoGPT. This project includes features like low-rank attention compression, SwiGLU activation, and rotary positional embeddings.
- Multi-head latent attention with low-rank compression for keys, values, and queries.
- SwiGLU activation for improved gating mechanisms. (SiLU currently used instead)
- Rotary Positional Embeddings (RoPE) for better positional encoding.
- Lightweight and modular design for easy experimentation.
Here's an overview of the key files and directories in this project:
src/model.py: Core implementation of the model, including attention mechanisms, feed-forward layers, and the transformer architecture.src/trainer.py: Training loop for the model.src/main.py: Entry point for running the model (inference).src/config.py: Configuration utilities for model hyperparameters and training config.
- Clone the repository:
git clone https://github.com/clement-cvll/open-large-language-model cd open-large-language-model - Install dependencies:
uv sync
To train the model, use the provided script:
python src/trainer.pyGenerate text with the trained model:
python src/main.pyModify src/config.py to adjust hyperparameters like:
embed_dim: Embedding dimension.num_attention_heads: Number of attention heads.num_layers: Number of transformer layers.device: Torch device (mpsis default).
The tokenizer used in this project is the Pleias-350m-Preview tokenizer from Hugging Face (link).
The model is designed to work with the Common Corpus dataset (link), a large, open, and permissively licensed multilingual dataset.
This project draws inspiration from:
- minGPT: A minimal PyTorch re-implementation of GPT by Andrej Karpathy.
- nanoGPT: A simplified and efficient GPT implementation, also by Andrej Karpathy.
- Deepseek V2: For modern architectural choices like low-rank attention and SwiGLU.