A lightweight and secure Transformer language model trained from scratch on a single medical textbook—Grant's Atlas of Anatomy.
MiniHealthLM is designed for rapid experimentation, low-resource environments, and educational research in healthcare language modeling.
This project explores how to build and pretrain a domain-specific LLM from scratch using:
- Only a single PDF textbook as the corpus
- Custom tokenizer trained from that corpus
- Privacy-aware data sanitization
- Lightweight Qwen-style Transformer architecture
It aims to demonstrate how LLMs can be scaled down and securely trained for specific domains like medicine and anatomy, while remaining fully transparent and modular.
MiniHealthLM includes a complete end-to-end pretraining pipeline:
- 📖 PDF ➜ plain text (
extract_text.py) - 🧼 Secure preprocessing (
sanitize_data.pyusing PII filters) - 🔡 Tokenizer training (Byte-Level BPE)
- 🧠 Transformer pretraining (Qwen-style GQA + RoPE + RMSNorm)
- 💾 Checkpointing & loss tracking with TQDM
Training Dataset:
📘 Grant's Atlas of Anatomy
➡️ PDF Link
| Component | Description |
|---|---|
| Embedding | Token Embedding (Byte BPE) |
| Layers | 20 Transformer blocks |
| Hidden Size | 3072 |
| Attention | Grouped Query Attention (GQA) |
| Heads | 16 query heads, 4 key-value heads |
| Norm | RMSNorm |
| Positional Bias | Rotary Position Embeddings (RoPE) |
| FFN | SiLU MLP (4x expansion) |
| Context Length | 1024 tokens |
| Params (est.) | ~0.6 Billion |
- ⚕️ Domain-First: Model trained purely on anatomical medical text
- 🔐 Secure Pretraining: Removes PII and sensitive content before tokenization
- 🔬 Small Yet Complete: Implements modern LLM features like RoPE, GQA, and RMSNorm
- 🧠 Trained From Scratch: No pretraining dependency or transfer — 100% domain-rooted
- ⚡ Single-GPU Ready: Efficient enough to run on local or lab hardware
pip install -r requirements.txtpython extract_text.pypython sanitize_data.pypython train_tokenizer.pypython train.pyMiniHealthLM/
├── checkpoints/ # Saved model checkpoints
├── data/
│ ├── corpus.pdf # Grant's Atlas PDF
│ ├── corpus.txt # Cleaned training text
│ └── tokenizer/ # Tokenizer files
├── src/
│ ├── config.py # Model settings
│ ├── model.py # Qwen-style transformer
│ ├── dataset.py # DataLoader
│ └── utils.py # Checkpointing utilities
├── extract_text.py
├── sanitize_data.py
├── train_tokenizer.py
├── test_tokenizer.py
├── train.py
└── requirements.txt
Elias Hossain
Machine Learning Researcher | Secure LLMs | Biomedical NLP