Mini-vLLM

A lightweight inference and quantization engine for studying LLMs.

Decode Throughput

Measured on RTX 4050 6 GB · greedy · 128 decode tokens

Model	HF Transformers	mini-vllm	Speedup
Qwen3-0.6B	23.0 tok/s	114.5 tok/s (megakernel)	5.0x
Qwen3.5-0.8B	15.3 tok/s	90.9 tok/s (CUDA Graph + flash-attn + fla)	6.0x

Streaming Inference & Megakernel

mini-vllm megakernel vs HF Transformers — same Qwen3-0.6B, same prompt, ~8x faster

_{Qwen3-0.6B · bf16 · greedy · same prompt · 6 GB RTX 4050
Left: HF Transformers · Right: mini-vllm megakernel}

Features

Fused CUDA Megakernel -- single-kernel decode pipeline fusing embedding, all transformer layers, norm and LM head
CUDA Graph Decode -- configurable bucketed CUDA Graphs to eliminate CPU launch overhead
AWQ 4-bit Quantization -- built-in calibration and inference with Triton/CUDA kernels
SwanLab Monitoring -- real-time throughput and memory tracking (docs)
lm-eval Benchmark -- built-in evaluation harness adapter (docs)

Supported Models

Model	Model Backend	Attention Backend
Qwen3	`default`, `megakernel_cuda`	`sdpa`, `flash_attn`, `naive`
Qwen3.5	`default`	`sdpa`, `flash_attn`, `naive`, `fla`

Quick Start

1. Install

git clone https://github.com/BoundlessWindMoon/minivllm.git
cd mini-vllm

uv venv .venv --python 3.12
source .venv/bin/activate

# PyTorch (match your CUDA version)
uv pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu128

# Core
uv pip install -e .

# Optional: flash-attn, lm-eval, swanlab, etc.
uv pip install -e ".[all]"

# flash-linear-attention (required for Qwen3.5 linear attention)
pip install flash-linear-attention --no-build-isolation

2. Download Model

HF_ENDPOINT=https://hf-mirror.com huggingface-cli download Qwen/Qwen3-0.6B --local-dir ~/huggingface/Qwen3-0.6B/

# Optional: Qwen3.5 (multimodal, requires flash-linear-attention)
HF_ENDPOINT=https://hf-mirror.com huggingface-cli download Qwen/Qwen3.5-0.8B --local-dir ~/huggingface/Qwen3.5-0.8B/

3. Run

# Qwen3 with megakernel backend
python main.py

# Qwen3.5 with cuda graph
python main.py --config configs/qwen3_5.yaml

4. Benchmark

python -m eval.run --config configs/qwen3_5.yaml --tasks arc_easy --limit 50

See docs/benchmark.md for task customization and log output.

Documentation

Configuration Reference -- all YAML parameters
Benchmark Evaluation -- lm-eval integration
Profiling & Monitoring -- PyTorch profiler, Perfetto, SwanLab
AWQ Quantization -- calibration and quantized inference

Project Structure

mini-vllm/
├── configs/            # YAML configs
├── engine/             # Inference loop, model runner, sampler
├── kernels/            # Triton & CUDA kernels
├── layers/             # Attention, MLP, RMSNorm, Rotary
├── model/              # Qwen3 / Qwen3.5 architectures
├── quantization/       # AWQ calibration and quantized layers
├── eval/               # lm-eval adapter
├── scripts/            # Benchmark, verify, profile tools
├── main.py             # Entry point: inference
└── quant.py            # Entry point: AWQ calibration

Acknowledgements

nano-vllm -- minimalist inference architecture
AutoAWQ -- AWQ quantization
mega-qwen -- fused megakernel design

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini-vLLM

Decode Throughput

Streaming Inference & Megakernel

Features

Supported Models

Quick Start

1. Install

2. Download Model

3. Run

4. Benchmark

Documentation

Project Structure

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
assets		assets
chat		chat
configs		configs
engine		engine
eval		eval
kernels		kernels
layers		layers
model		model
quantization		quantization
scripts		scripts
tools		tools
utils		utils
.gitignore		.gitignore
README.md		README.md
chat.py		chat.py
main.py		main.py
pyproject.toml		pyproject.toml
quant.py		quant.py

Folders and files

Latest commit

History

Repository files navigation

Mini-vLLM

Decode Throughput

Streaming Inference & Megakernel

Features

Supported Models

Quick Start

1. Install

2. Download Model

3. Run

4. Benchmark

Documentation

Project Structure

Acknowledgements

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages