Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation
Official PyTorch Implementation
This is a PyTorch/GPU implementation of the paper Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Generation, which directly utilizes the features from the frozen pre-trained vision foundation model (VFM) to reconstruct the original image. To achieve this, VFMTok innovatively designed two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer’s outputs with the foundation model’s representations to preserve semantic fidelity. Once the trained VFMTok is integrated into the autoregressive (AR) generative models, it achieves notable results on the class-to-image generation task, while accelerating convergence by a factor of three. Besides, it also enables high-fidelity class-conditional synthesis without the requirement of a CFG (classifier-free guidance).
Besides, VFMTok can also achieve new state-of-the-art image synthesis quality on the task of class-to-image generation. Please check in this repository.
This repo contains:
- 🪐 A simple PyTorch implementation of VFMTok and various AR generative models.
- ⚡️ Pre-trained tokenizer: VFMTok and AR generative models trained on ImageNet.
- 🛸 Training and evaluation scripts for tokenizer and generative models, which were also provided in here.
- 🎉 Hugging Face for easy access to pre-trained models.
- [2025/07/11] 🔥 VFMTok has been released. Checkout the paper for details.🔥
- [2025/09/18] 🔥 VFMTok has been accepted by NeurIPS 2025! 🔥
- [2025/10/11] 🔥 Image tokenizers and AR models for class-conditional image generation are released. 🔥
- [2025/10/11] 🔥 All codes of VFMTok have been released. 🔥
If you are not using Linux, do NOT proceed.
- Clone this repository and navigate to Hita folder
git clone https://github.com/CVMI-Lab/VFMTok.git
cd VFMTok- Install Package
conda create -n vfmtok python=3.10 -y
conda activate vfmtok
pip install --upgrade pip # enable PEP 660 support
pip install -e .- Install additional packages for training cases as required.
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
- Install deformable attention module
cd vfmtok/modules/ops
bash make.sh
In this repo, we release:
- One image tokenizers: VFMTok(DINOv2).
- Class-conditional autoregressive generative models ranging from 111M to 3B parameters.
In this repo, we release one image tokenizer: VFMTok(DINOv2). It directly utilizes the features from the frozen pre-trained VFM -- DINOv2, to reconstruct the image. Besides, VFMToks also designs 2 key components: region-adaptive quantization and semantic reconstruction to reduce the redundancy in the pretrained features and maintain the semantic fidelity, respectively.
| Method | tokens | rFID (256x256) | rIS (256x256) | weight |
|---|---|---|---|---|
| VFMTok | 256 | 0.98 | 215.4 | vfmtok-tokenizer.pt |
Once the trained VFMTok(DINOv2) is integrated into autoregressive (AR) generative models, it ahieves notable image generation performance.
| Method | params | epochs | FID | sFID | IS | Pre. | Rec. |
|---|---|---|---|---|---|---|---|
| VFMTok-B | 111M | 300 | 3.43 | 5.88 | 252.2 | 0.85 | 0.53 |
| VFMTok-L | 343M | 300 | 2.76 | 5.69 | 276.1 | 0.84 | 0.57 |
| VFMTok-XL | 775M | 200 | 2.38 | 5.54 | 277.2 | 0.83 | 0.60 |
| VFMTok-XXL | 1.4B | 200 | 2.28 | 5.49 | 274.3 | 0.83 | 0.60 |
| VFMTok-2B | 2.0B | 200 | 2.27 | 5.56 | 283.6 | 0.82 | 0.61 |
| VFMTok-3B | 3.1B | 200 | 2.07 | 5.46 | 280.4 | 0.82 | 0.61 |
The trained VFMTok(DINOv2), when integrated into the AR generation models, can also achieve impressive image generation quality without CFG-guidance (CFG-free guidance).
| Method | params | epochs | FID | sFID | IS | Pre. | Rec. |
|---|---|---|---|---|---|---|---|
| VFMTok-B | 111M | 300 | 3.09 | 5.67 | 173.6 | 0.80 | 0.58 |
| VFMTok-L | 343M | 300 | 2.15 | 5.44 | 230.1 | 0.82 | 0.60 |
| VFMTok-XL | 775M | 200 | 2.06 | 5.59 | 257.2 | 0.82 | 0.61 |
| VFMTok-XXL | 1.4B | 200 | 2.09 | 5.48 | 259.3 | 0.82 | 0.61 |
| VFMTok-2B | 2.0B | 200 | 2.20 | 5.54 | 279.7 | 0.82 | 0.61 |
| VFMTok-3B | 3.1B | 200 | 2.04 | 5.43 | 267.8 | 0.82 | 0.61 |
- Download the DINOv2-L pre-trained foundation model from the official model zoo.
- Create symbolic links that point from the locations of the pretrained DINOv2-L model and the ImageNet training dataset folders to this directory.
- Create dataset script for your own dataset. Here, we provide a template for training tokenizers and AR generative models using the ImageNet dataset in LMDB format.
ln -s DINOv2-L_folder init_models
ln -s ImageNetFolder imagenet- Training VFMTok(DINOv2) tokenizer (see
scripts/tokenizer/train_tok.sh):
export NODE_COUNT=1
export NODE_RANK=0
export PROC_PER_NODE=8
scripts/autoregressive/torchrun.sh vq_train.py --image-size 336 --results-dir output --mixed-precision bf16 --codebook-slots-embed-dim 12 \
--data-path imagenet/lmdb/train_lmdb --global-batch-size 16 --num-workers 4 --ckpt-every 5000 --epochs 50 \
--transformer-config configs/vit_transformer.yaml --log-every 1 --lr 1e-4 --ema --z-channels 512- Training AR generative models (see
scripts/autoregressive/run_train.sh)
model_type='GPT-L' # 'GPT-B' 'GPT-XL' 'GPT-XXL' 'GPT-2B'
scripts/autoregressive/torchrun.sh train_c2i.py --gpt-type c2i --image-size 336 --gpt-model ${model_type} --downsample-size 16 --num-workers 4 \
--anno-file imagenet/lmdb/train_lmdb --global-batch-size 512 --ckpt-every 10000 --ema --log-every 1 --results-dir output \
--vq-model VQ-16 --vq-ckpt tokenizer/vfmtok-tokenizer.pt --latent-size 16 --mixed-precision bf16 --epochs 300- Resume from an AR generative checkpoint
model_type='GPT-L'
scripts/autoregressive/torchrun.sh train_c2i.py --gpt-type c2i --image-size 336 --gpt-model ${model_type} --downsample-size 16 --num-workers 4 \
--anno-file imagenet/lmdb/train_lmdb --global-batch-size 512 --ckpt-every 10000 --ema --log-every 1 --results-dir output \
--vq-model VQ-16 --vq-ckpt tokenizer/vfmtok-tokenizer.pt --latent-size 16 --mixed-precision bf16 --epochs 300 \
--gpt-ckpt output/vanilla/${model_type}/${model_type}-{ckpt_name}.pt- Evaluated a pretrained tokenizer (see
scripts/tokenizer/run_tok.sh):
scripts/autoregressive/torchrun.sh vqgan_test.py --vq-model VQ-16 --image-size 336 --output_dir recons --batch-size $1 \
--z-channels 512 --vq-ckpt tokenizer/vfmtok-tokenizer.pt --codebook-slots-embed-dim 12- Evaluate a pretrained AR generative model (see
scripts/autoregressive/run_test.sh)
model_type='GPT-L' # 'GPT-B' 'GPT-XL' 'GPT-XXL' 'GPT-2B'
scripts/autoregressive/torchrun.sh test_net.py --vq-ckpt tokenizer/vfmtok-tokenizer.pt \
--gpt-ckpt snapshot/model_dump/${model_type}-$1.pt --compile --gpt-model ${model_type} --image-size 336 \
--sample-dir samples --image-size-eval 256 --cfg-scale $2 --precision bf16 --per-proc-batch-size $3 \
--codebook-slots-embed-dim 12 --latent-size 16If you find VFMTok useful for your research and applications, please kindly cite using this BibTeX:
@article{zheng2025vision,
title={Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation},
author={Zheng, Anlin and Wen, Xin and Zhang, Xuanyang and Ma, Chuofan and Wang, Tiancai and Yu, Gang and Zhang, Xiangyu and Qi, Xiaojuan},
journal={arXiv preprint arXiv:2507.08441},
year={2025}
}
The majority of this project is licensed under Apacha 2.0 License. Portions of the project are available under separate license of referred projects, detailed in corresponding files.
Our codebase builds upon several excellent open-source projects, including LlamaGen, Deformable DETR, Hita and Paintmind. We are grateful to the communities behind them.
This codebase has been cleaned up but has not undergone extensive testing. If you encounter any issues or have questions, please open a GitHub issue. We appreciate your feedback!
