LLM Foundry is the source repository for the development of models, datasets, and accompanying artifacts of the Polyglot project at the University of Bonn. It bundles training, evaluation, post-training, data processing, and tokenization pipelines into a single, cluster-ready code base.
- Overview
- Code of Conduct
- How to Train a Model
- Repository Structure
- Installation
- Running the Tests
- Contributing
- License
- Acknowledgments
This repository contains all source code used for the development of the models, datasets, and all other accompanying artifacts tied to the Polyglot project at the University of Bonn. It is designed to run on both the Marvin cluster and Bender (University of Bonn), which have dual software stacks (AMD and Intel) that the code base is aware of.
This project adheres to a Contributor Covenant Code of Conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to polyglot@uni-bonn.de.
For a step-by-step walkthrough of the LLM Foundry—covering data collection, tokenization, evaluation harness setup, pretraining, and post-training/alignment—see HOWTO.md.
The code base is organized into the following main folders:
alignment/— Implementation of post-training techniques for alignment, including both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), Reward Model training, and Group Relative Policy Optimization (GRPO) with verifier-based rewards.alignment/gym/— Scripts for training and evaluating language models on custom environments.
data/— Scripts for working with text preprocessing (i.e., filtering, tokenization, etc.).data/cc/— Scripts for working with Common Crawl data.data/filters/— Dataset filtering and annotation pipelines for text corpus curation.data/parsers/— Parsers for converting raw datasets into a standardized format or to perform stratification for evaluation.data/tokenization/— Tokenization, packing, decontamination, and validation split utilities for pretraining and SFT datasets.
distributed/— Scripts for training and evaluating language models with DDP and FSDP.evals/— Scripts for evaluating language models via thelm-evaluation-harness.merge/— Scripts for running different merging techniques viamergekit.synthetic/— Scripts for generating synthetic datasets with vLLM.tests/— Unit and integration tests for our code base.tokenizer/— Scripts for training and evaluating tokenizers.utils/— Miscellaneous utilities for our code base.
All of our codebase is designed to run on Marvin or Bender, i.e., the University of Bonn HPC clusters. You will only need to set things up on the cluster itself - not on your local machine. For your local machine, you can just clone the repository and work with the files (e.g., editing code, writing new scripts, etc.) without worrying too much about dual stack setups or module loading.
On Marvin, we work with workspaces that are allocated with a specific file system.
Use utils/marvin_create_workspace.sh to allocate a workspace, clone the repository, and prepare the directory layout. Open the script first and edit the user customization section at the top (username, file_system, work_group, email, workspace_name) to match your account, then run it from a Marvin login node:
bash utils/marvin_create_workspace.shFor Bender users, /home/$USER is the default workspace directory, so you can just clone the repository there and start working.
Marvin and Bender have a dual software stack (AMD and Intel). The single .modules.sh file at the repository root loads the right build for you. It auto-detects the stack from the SLURM environment, so most of the time you can just source it and forget about it:
# Marvin:
# - Partitions with "gpu" in the name (e.g. sgpu, mlgpu) -> AMD stack
# - All other partitions -> Intel stack
#
# Bender:
# - Partition "a100" -> AMD stack
# - Partition "a40" -> Intel stack
source "$workdir/.modules.sh"You can also force a specific stack by setting the LLM_FOUNDRY_STACK environment variable before sourcing:
LLM_FOUNDRY_STACK=amd source "$workdir/.modules.sh" # GPU/training stack
LLM_FOUNDRY_STACK=intel source "$workdir/.modules.sh" # CPU/data stackSourcing prints whose stack was selected, why, and the resulting module list, so your job logs always show the resolved environment.
If you are working on JSC Jupiter, see utils/jupiter/README.md for JSC-specific module and installation scripts.
Use the pyproject.toml to install a specific set of dependencies. The available extras are:
data— For downloading and preprocessing datasets.tokenizer— For training and evaluating tokenizers with the pinned SentencePiece-compatible stack.distributed— For training language models with our DDP and FSDP implementations.synth— For generating synthetic samples with vLLM.trl— For post-training and alignment with TRL.tests— For running our test suite.
For example:
pip install -e "./llm-foundry/.[distributed]" # for DDP/FSDP trainingInstall the test dependencies first:
pip install -e "./llm-foundry/.[tests]"Run all test scripts in sequence:
python tests/Or run a specific script (e.g., the distributed training tests):
python tests/tests_distributed.pyContributions are welcome! Please see CONTRIBUTING.md for details on how to set up your development environment, the contribution workflow (forking, branching, squashing commits, opening a pull request), and the project's style guide.
This project is licensed under the Apache License 2.0. See LICENSE for the full license text.
Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.
We also gratefully acknowledge access to the Marvin and Bender clusters, hosted by the University of Bonn, and maintained by the university's High Performance Computing Team.