LLM Foundry

LLM Foundry is the source repository for the development of models, datasets, and accompanying artifacts of the Polyglot project at the University of Bonn. It bundles training, evaluation, post-training, data processing, and tokenization pipelines into a single, cluster-ready code base.

Overview

This repository contains all source code used for the development of the models, datasets, and all other accompanying artifacts tied to the Polyglot project at the University of Bonn. It is designed to run on both the Marvin cluster and Bender (University of Bonn), which have dual software stacks (AMD and Intel) that the code base is aware of.

Code of Conduct

This project adheres to a Contributor Covenant Code of Conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to polyglot@uni-bonn.de.

How to Train a Model

For a step-by-step walkthrough of the LLM Foundry—covering data collection, tokenization, evaluation harness setup, pretraining, and post-training/alignment—see HOWTO.md.

Repository Structure

The code base is organized into the following main folders:

alignment/ — Implementation of post-training techniques for alignment, including both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), Reward Model training, and Group Relative Policy Optimization (GRPO) with verifier-based rewards.
- alignment/gym/ — Scripts for training and evaluating language models on custom environments.
data/ — Scripts for working with text preprocessing (i.e., filtering, tokenization, etc.).
- data/cc/ — Scripts for working with Common Crawl data.
- data/filters/ — Dataset filtering and annotation pipelines for text corpus curation.
- data/parsers/ — Parsers for converting raw datasets into a standardized format or to perform stratification for evaluation.
- data/tokenization/ — Tokenization, packing, decontamination, and validation split utilities for pretraining and SFT datasets.
distributed/ — Scripts for training and evaluating language models with DDP and FSDP.
evals/ — Scripts for evaluating language models via the lm-evaluation-harness.
merge/ — Scripts for running different merging techniques via mergekit.
synthetic/ — Scripts for generating synthetic datasets with vLLM.
tests/ — Unit and integration tests for our code base.
tokenizer/ — Scripts for training and evaluating tokenizers.
utils/ — Miscellaneous utilities for our code base.

Installation

All of our codebase is designed to run on Marvin or Bender, i.e., the University of Bonn HPC clusters. You will only need to set things up on the cluster itself - not on your local machine. For your local machine, you can just clone the repository and work with the files (e.g., editing code, writing new scripts, etc.) without worrying too much about dual stack setups or module loading.

Workspace Setup

On Marvin, we work with workspaces that are allocated with a specific file system.

Use utils/marvin_create_workspace.sh to allocate a workspace, clone the repository, and prepare the directory layout. Open the script first and edit the user customization section at the top (username, file_system, work_group, email, workspace_name) to match your account, then run it from a Marvin login node:

bash utils/marvin_create_workspace.sh

For Bender users, /home/$USER is the default workspace directory, so you can just clone the repository there and start working.

Module Stack Selection

Marvin and Bender have a dual software stack (AMD and Intel). The single .modules.sh file at the repository root loads the right build for you. It auto-detects the stack from the SLURM environment, so most of the time you can just source it and forget about it:

# Marvin:
# - Partitions with "gpu" in the name (e.g. sgpu, mlgpu)  -> AMD stack
# - All other partitions                                  -> Intel stack
#
# Bender:
# - Partition "a100"                                      -> AMD stack
# - Partition "a40"                                       -> Intel stack
source "$workdir/.modules.sh"

You can also force a specific stack by setting the LLM_FOUNDRY_STACK environment variable before sourcing:

LLM_FOUNDRY_STACK=amd   source "$workdir/.modules.sh"   # GPU/training stack
LLM_FOUNDRY_STACK=intel source "$workdir/.modules.sh"   # CPU/data stack

Sourcing prints whose stack was selected, why, and the resulting module list, so your job logs always show the resolved environment.

If you are working on JSC Jupiter, see utils/jupiter/README.md for JSC-specific module and installation scripts.

Installing Dependencies

Use the pyproject.toml to install a specific set of dependencies. The available extras are:

data — For downloading and preprocessing datasets.
tokenizer — For training and evaluating tokenizers with the pinned SentencePiece-compatible stack.
distributed — For training language models with our DDP and FSDP implementations.
synth — For generating synthetic samples with vLLM.
trl — For post-training and alignment with TRL.
tests — For running our test suite.

For example:

pip install -e "./llm-foundry/.[distributed]"  # for DDP/FSDP training

Running the Tests

Install the test dependencies first:

pip install -e "./llm-foundry/.[tests]"

Run all test scripts in sequence:

python tests/

Or run a specific script (e.g., the distributed training tests):

python tests/tests_distributed.py

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for details on how to set up your development environment, the contribution workflow (forking, branching, squashing commits, opening a pull request), and the project's style guide.

License

This project is licensed under the Apache License 2.0. See LICENSE for the full license text.

Acknowledgments

Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.

We also gratefully acknowledge access to the Marvin and Bender clusters, hosted by the University of Bonn, and maintained by the university's High Performance Computing Team.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Foundry

Table of Contents

Overview

Code of Conduct

How to Train a Model

Repository Structure

Installation

Workspace Setup

Module Stack Selection

Installing Dependencies

Running the Tests

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
alignment		alignment
data		data
distributed		distributed
evals		evals
merge		merge
synthetic		synthetic
tests		tests
tokenizer		tokenizer
utils		utils
.codecarbon.config		.codecarbon.config
.gitignore		.gitignore
.modules.sh		.modules.sh
.pre-commit-config.yaml		.pre-commit-config.yaml
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
HOWTO.md		HOWTO.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

LLM Foundry

Table of Contents

Overview

Code of Conduct

How to Train a Model

Repository Structure

Installation

Workspace Setup

Module Stack Selection

Installing Dependencies

Running the Tests

Contributing

License

Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages