Skip to content

BoundlessWindMoon/minivllm

Repository files navigation

Mini-vLLM

A lightweight inference and quantization engine for studying LLMs.

Python 3.10+ PyTorch 2.9 License: MIT

Decode Throughput

Measured on RTX 4050 6 GB · greedy · 128 decode tokens

Model HF Transformers mini-vllm Speedup
Qwen3-0.6B 23.0 tok/s 114.5 tok/s (megakernel) 5.0x
Qwen3.5-0.8B 15.3 tok/s 90.9 tok/s (CUDA Graph + flash-attn + fla) 6.0x

Streaming Inference & Megakernel

mini-vllm megakernel vs HF Transformers — same Qwen3-0.6B, same prompt, ~8x faster

Qwen3-0.6B · bf16 · greedy · same prompt · 6 GB RTX 4050
Left: HF Transformers · Right: mini-vllm megakernel

Features

  • Fused CUDA Megakernel -- single-kernel decode pipeline fusing embedding, all transformer layers, norm and LM head
  • CUDA Graph Decode -- configurable bucketed CUDA Graphs to eliminate CPU launch overhead
  • AWQ 4-bit Quantization -- built-in calibration and inference with Triton/CUDA kernels
  • SwanLab Monitoring -- real-time throughput and memory tracking (docs)
  • lm-eval Benchmark -- built-in evaluation harness adapter (docs)

Supported Models

Model Model Backend Attention Backend
Qwen3 default, megakernel_cuda sdpa, flash_attn, naive
Qwen3.5 default sdpa, flash_attn, naive, fla

Quick Start

1. Install

git clone https://github.com/BoundlessWindMoon/minivllm.git
cd mini-vllm

uv venv .venv --python 3.12
source .venv/bin/activate

# PyTorch (match your CUDA version)
uv pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu128

# Core
uv pip install -e .

# Optional: flash-attn, lm-eval, swanlab, etc.
uv pip install -e ".[all]"

# flash-linear-attention (required for Qwen3.5 linear attention)
pip install flash-linear-attention --no-build-isolation

2. Download Model

HF_ENDPOINT=https://hf-mirror.com huggingface-cli download Qwen/Qwen3-0.6B --local-dir ~/huggingface/Qwen3-0.6B/

# Optional: Qwen3.5 (multimodal, requires flash-linear-attention)
HF_ENDPOINT=https://hf-mirror.com huggingface-cli download Qwen/Qwen3.5-0.8B --local-dir ~/huggingface/Qwen3.5-0.8B/

3. Run

# Qwen3 with megakernel backend
python main.py

# Qwen3.5 with cuda graph
python main.py --config configs/qwen3_5.yaml

4. Benchmark

python -m eval.run --config configs/qwen3_5.yaml --tasks arc_easy --limit 50

See docs/benchmark.md for task customization and log output.

Documentation

Project Structure

mini-vllm/
├── configs/            # YAML configs
├── engine/             # Inference loop, model runner, sampler
├── kernels/            # Triton & CUDA kernels
├── layers/             # Attention, MLP, RMSNorm, Rotary
├── model/              # Qwen3 / Qwen3.5 architectures
├── quantization/       # AWQ calibration and quantized layers
├── eval/               # lm-eval adapter
├── scripts/            # Benchmark, verify, profile tools
├── main.py             # Entry point: inference
└── quant.py            # Entry point: AWQ calibration

Acknowledgements

License

MIT

About

A light, transparent, and modular inference & quantization engine for studying LLMs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors