SwitLM is a simple yet powerful Python library for creating and training custom Language Models with minimal code. Train your own LLM and export to GGUF format in just a few lines!
- π Simple API: Create and train LLMs with 3 lines of code
- π― Multiple Model Sizes: From 50M to 3B parameters
- ποΈ Modern Architecture: RoPE, RMSNorm, SwiGLU (LLaMA-style)
- π Rich Dataset Support: WikiText, AG News, IMDB, SQuAD, TinyStories, and more
- πΎ GGUF Export: Direct export to GGUF format for llama.cpp
- β‘ GPU Optimized: Memory-efficient training with gradient checkpointing
- π W&B Integration: Built-in experiment tracking
- π¨ Flexible Configuration: Easy customization of architecture and training
pip install switlmOr install from source:
git clone https://github.com/Avijit0001/switlm.git
cd switlm
pip install -e .- Python >= 3.8
- PyTorch >= 2.0.0
- transformers >= 4.30.0
- datasets >= 2.12.0
from switlm import SwitLMTrainer
# Create and train a 1B parameter model
trainer = SwitLMTrainer(
n_parameters="1B",
dataset="wikitext",
num_layers=24
)
# Train the model
trainer.train()
# Save as GGUF (ready for llama.cpp)
trainer.save("my_model.gguf")# Generate text with your trained model
output = trainer.generate(
"The future of artificial intelligence",
max_length=100,
temperature=0.7
)
print(output)# Train on multiple datasets sequentially
trainer = SwitLMTrainer(
n_parameters="500M",
dataset=["wikitext", "ag_news", "imdb"],
num_layers=16
)
trainer.train()
trainer.save("multi_dataset_model.gguf")from switlm import SwitLMTrainer, ModelConfig, TrainingConfig
# Custom model architecture
model_config = ModelConfig(
n_parameters="custom",
num_layers=20,
hidden_size=1536,
num_heads=12,
intermediate_size=6144
)
# Custom training settings
training_config = TrainingConfig(
learning_rate=1e-4,
batch_size=4,
num_epochs=3,
use_wandb=True
)
trainer = SwitLMTrainer(
model_config=model_config,
training_config=training_config,
dataset="wikitext"
)
trainer.train()
trainer.save("custom_model.gguf")| Size | Parameters | Layers | Hidden Size | Heads | Use Case |
|---|---|---|---|---|---|
| 50M | ~50M | 8 | 512 | 8 | Quick experiments |
| 100M | ~100M | 12 | 768 | 12 | Small projects |
| 500M | ~500M | 16 | 1024 | 16 | Medium tasks |
| 1B | ~1B | 24 | 2048 | 16 | Serious applications |
| 3B | ~3B | 32 | 2560 | 32 | Production use |
wikitext- Wikipedia articlesag_news- News classificationimdb- Movie reviewssquad- Question answeringtiny_stories- Short storiesopenwebtext- Web textc4- Colossal Clean Crawled Corpusbookcorpus- Bookspile- Diverse text corpus
# Load a previously trained model
trainer = SwitLMTrainer(n_parameters="1B")
trainer.load("my_model.pt")
# Continue training on new data
trainer.datasets = ["squad", "imdb"]
trainer.train(num_epochs=2)
trainer.save("continued_model.gguf")from switlm import TextGenerator
# Create generator
generator = TextGenerator(trainer.model, trainer.tokenizer)
# Generate with custom parameters
text = generator.generate(
"Once upon a time",
max_length=200,
temperature=0.8,
top_p=0.95,
top_k=50,
repetition_penalty=1.2
)
print(text)# Save as both PyTorch and GGUF
trainer.save("my_model", format="both")
# Outputs:
# - my_model.pt (PyTorch checkpoint)
# - my_model.gguf (GGUF format)SwitLM implements a modern transformer architecture with:
- RoPE (Rotary Position Embeddings): Better positional encoding
- RMSNorm: More stable training than LayerNorm
- SwiGLU: Advanced activation function (from PaLM/LLaMA)
- Pre-norm Architecture: Better gradient flow
- Gradient Checkpointing: Memory-efficient training
- Mixed Precision: Faster training with FP16
TrainingConfig(
learning_rate=3e-4, # Learning rate
weight_decay=0.01, # Weight decay for regularization
beta1=0.9, # Adam beta1
beta2=0.95, # Adam beta2
warmup_ratio=0.1, # Warmup ratio
max_grad_norm=1.0, # Gradient clipping
batch_size=4, # Batch size (auto if None)
gradient_accumulation_steps=8, # Accumulation steps
max_length=512, # Max sequence length
num_epochs=1, # Number of epochs
use_wandb=True, # W&B logging
wandb_project="switlm" # W&B project name
)- Start Small: Begin with 50M or 100M models to test your pipeline
- GPU Memory: Larger models require more VRAM (1B model needs ~12GB)
- Dataset Size: More data generally means better models
- Learning Rate: Start with 3e-4, adjust based on loss curves
- Sequence Length: Shorter sequences (256-512) train faster
- Gradient Accumulation: Increase if you run out of memory
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by modern LLM architectures (LLaMA, GPT, PaLM)
- Built with PyTorch and HuggingFace Transformers
- GGUF format support for llama.cpp integration
If you find SwitLM useful, please consider giving it a star! β
Made with β€οΈ by the SwitLM team (Avijit Paul)