Measured on RTX 4050 6 GB · greedy · 128 decode tokens
| Model | HF Transformers | mini-vllm | Speedup |
|---|---|---|---|
| Qwen3-0.6B | 23.0 tok/s | 114.5 tok/s (megakernel) | 5.0x |
| Qwen3.5-0.8B | 15.3 tok/s | 90.9 tok/s (CUDA Graph + flash-attn + fla) | 6.0x |
Qwen3-0.6B · bf16 · greedy · same prompt · 6 GB RTX 4050
Left: HF Transformers · Right: mini-vllm megakernel
- Fused CUDA Megakernel -- single-kernel decode pipeline fusing embedding, all transformer layers, norm and LM head
- CUDA Graph Decode -- configurable bucketed CUDA Graphs to eliminate CPU launch overhead
- AWQ 4-bit Quantization -- built-in calibration and inference with Triton/CUDA kernels
- SwanLab Monitoring -- real-time throughput and memory tracking (docs)
- lm-eval Benchmark -- built-in evaluation harness adapter (docs)
| Model | Model Backend | Attention Backend |
|---|---|---|
| Qwen3 | default, megakernel_cuda |
sdpa, flash_attn, naive |
| Qwen3.5 | default |
sdpa, flash_attn, naive, fla |
git clone https://github.com/BoundlessWindMoon/minivllm.git
cd mini-vllm
uv venv .venv --python 3.12
source .venv/bin/activate
# PyTorch (match your CUDA version)
uv pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu128
# Core
uv pip install -e .
# Optional: flash-attn, lm-eval, swanlab, etc.
uv pip install -e ".[all]"
# flash-linear-attention (required for Qwen3.5 linear attention)
pip install flash-linear-attention --no-build-isolationHF_ENDPOINT=https://hf-mirror.com huggingface-cli download Qwen/Qwen3-0.6B --local-dir ~/huggingface/Qwen3-0.6B/
# Optional: Qwen3.5 (multimodal, requires flash-linear-attention)
HF_ENDPOINT=https://hf-mirror.com huggingface-cli download Qwen/Qwen3.5-0.8B --local-dir ~/huggingface/Qwen3.5-0.8B/# Qwen3 with megakernel backend
python main.py
# Qwen3.5 with cuda graph
python main.py --config configs/qwen3_5.yamlpython -m eval.run --config configs/qwen3_5.yaml --tasks arc_easy --limit 50See docs/benchmark.md for task customization and log output.
- Configuration Reference -- all YAML parameters
- Benchmark Evaluation -- lm-eval integration
- Profiling & Monitoring -- PyTorch profiler, Perfetto, SwanLab
- AWQ Quantization -- calibration and quantized inference
mini-vllm/
├── configs/ # YAML configs
├── engine/ # Inference loop, model runner, sampler
├── kernels/ # Triton & CUDA kernels
├── layers/ # Attention, MLP, RMSNorm, Rotary
├── model/ # Qwen3 / Qwen3.5 architectures
├── quantization/ # AWQ calibration and quantized layers
├── eval/ # lm-eval adapter
├── scripts/ # Benchmark, verify, profile tools
├── main.py # Entry point: inference
└── quant.py # Entry point: AWQ calibration
- nano-vllm -- minimalist inference architecture
- AutoAWQ -- AWQ quantization
- mega-qwen -- fused megakernel design
MIT