🧠 MiniHealthLM

A lightweight and secure Transformer language model trained from scratch on a single medical textbook—Grant's Atlas of Anatomy.
MiniHealthLM is designed for rapid experimentation, low-resource environments, and educational research in healthcare language modeling.

🎯 Project Purpose

This project explores how to build and pretrain a domain-specific LLM from scratch using:

Only a single PDF textbook as the corpus
Custom tokenizer trained from that corpus
Privacy-aware data sanitization
Lightweight Qwen-style Transformer architecture

It aims to demonstrate how LLMs can be scaled down and securely trained for specific domains like medicine and anatomy, while remaining fully transparent and modular.

🏗️ What I Built

MiniHealthLM includes a complete end-to-end pretraining pipeline:

📖 PDF ➜ plain text (extract_text.py)
🧼 Secure preprocessing (sanitize_data.py using PII filters)
🔡 Tokenizer training (Byte-Level BPE)
🧠 Transformer pretraining (Qwen-style GQA + RoPE + RMSNorm)
💾 Checkpointing & loss tracking with TQDM

Training Dataset:
📘 Grant's Atlas of Anatomy
➡️ PDF Link

🧬 Architecture Overview

Component	Description
Embedding	Token Embedding (Byte BPE)
Layers	20 Transformer blocks
Hidden Size	3072
Attention	Grouped Query Attention (GQA)
Heads	16 query heads, 4 key-value heads
Norm	RMSNorm
Positional Bias	Rotary Position Embeddings (RoPE)
FFN	SiLU MLP (4x expansion)
Context Length	1024 tokens
Params (est.)	~0.6 Billion

✨ Novelty

⚕️ Domain-First: Model trained purely on anatomical medical text
🔐 Secure Pretraining: Removes PII and sensitive content before tokenization
🔬 Small Yet Complete: Implements modern LLM features like RoPE, GQA, and RMSNorm
🧠 Trained From Scratch: No pretraining dependency or transfer — 100% domain-rooted
⚡ Single-GPU Ready: Efficient enough to run on local or lab hardware

🚀 Quickstart

1. Install Dependencies

pip install -r requirements.txt

2. Extract Text from PDF

python extract_text.py

3. Sanitize (Remove PII)

python sanitize_data.py

4. Train Tokenizer

python train_tokenizer.py

5. Launch Pretraining

python train.py

🗂️ Folder Structure

MiniHealthLM/
├── checkpoints/              # Saved model checkpoints
├── data/
│   ├── corpus.pdf            # Grant's Atlas PDF
│   ├── corpus.txt            # Cleaned training text
│   └── tokenizer/            # Tokenizer files
├── src/
│   ├── config.py             # Model settings
│   ├── model.py              # Qwen-style transformer
│   ├── dataset.py            # DataLoader
│   └── utils.py              # Checkpointing utilities
├── extract_text.py
├── sanitize_data.py
├── train_tokenizer.py
├── test_tokenizer.py
├── train.py
└── requirements.txt

👨‍💻 Author

Elias Hossain
Machine Learning Researcher | Secure LLMs | Biomedical NLP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 MiniHealthLM

🎯 Project Purpose

🏗️ What I Built

🧬 Architecture Overview

✨ Novelty

🚀 Quickstart

1. Install Dependencies

2. Extract Text from PDF

3. Sanitize (Remove PII)

4. Train Tokenizer

5. Launch Pretraining

🗂️ Folder Structure

👨‍💻 Author

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
extract_text.py		extract_text.py
requirements.txt		requirements.txt
sanitize_data.py		sanitize_data.py
test_tokenizer.py		test_tokenizer.py
train.py		train.py
train_tokenizer.py		train_tokenizer.py

eliashossain001/MiniHealthLM

Folders and files

Latest commit

History

Repository files navigation

🧠 MiniHealthLM

🎯 Project Purpose

🏗️ What I Built

🧬 Architecture Overview

✨ Novelty

🚀 Quickstart

1. Install Dependencies

2. Extract Text from PDF

3. Sanitize (Remove PII)

4. Train Tokenizer

5. Launch Pretraining

🗂️ Folder Structure

👨‍💻 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages