Skip to content

clement-cvll/open-large-language-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Custom Large Language Model (LLM) Implementation

A minimal implementation of a transformer-based Large Language Model (LLM) inspired by modern architectures like Deepseek V2, minGPT, and nanoGPT. This project includes features like low-rank attention compression, SwiGLU activation, and rotary positional embeddings.

Features

  • Multi-head latent attention with low-rank compression for keys, values, and queries.
  • SwiGLU activation for improved gating mechanisms. (SiLU currently used instead)
  • Rotary Positional Embeddings (RoPE) for better positional encoding.
  • Lightweight and modular design for easy experimentation.

Project Files

Here's an overview of the key files and directories in this project:

  • src/model.py: Core implementation of the model, including attention mechanisms, feed-forward layers, and the transformer architecture.
  • src/trainer.py: Training loop for the model.
  • src/main.py: Entry point for running the model (inference).
  • src/config.py: Configuration utilities for model hyperparameters and training config.

Installation

  1. Clone the repository:
    git clone https://github.com/clement-cvll/open-large-language-model
    cd open-large-language-model
  2. Install dependencies:
    uv sync

Usage

Training

To train the model, use the provided script:

python src/trainer.py

Inference

Generate text with the trained model:

python src/main.py

Configuration

Modify src/config.py to adjust hyperparameters like:

  • embed_dim: Embedding dimension.
  • num_attention_heads: Number of attention heads.
  • num_layers: Number of transformer layers.
  • device: Torch device (mps is default).

Tokenizer and Dataset

Tokenizer

The tokenizer used in this project is the Pleias-350m-Preview tokenizer from Hugging Face (link).

Dataset

The model is designed to work with the Common Corpus dataset (link), a large, open, and permissively licensed multilingual dataset.

Inspiration

This project draws inspiration from:

  • minGPT: A minimal PyTorch re-implementation of GPT by Andrej Karpathy.
  • nanoGPT: A simplified and efficient GPT implementation, also by Andrej Karpathy.
  • Deepseek V2: For modern architectural choices like low-rank attention and SwiGLU.

About

Implementation of a large language model in pytorch, training on an open-source copyright-free dataset

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages